Cutio

[LLM Sponsor Detection Benchmark]

March 25, 2026

Cutio uses LLMs to detect sponsor segments in YouTube videos in real-time. Model choice directly affects user experience: a false positive skips actual content, a false negative forces the user to sit through an ad. We need models that are accurate, fast, cheap, and reliable — but these goals are in tension. This benchmark evaluates 5 models on detecting and classifying sponsored and self-promotional segments across 65 videos spanning diverse categories and creators, quantifying the trade-offs.

The task is harder than it looks. Sponsors are often weaved into content naturally (stealth integration), creators do meta-commentary about their own ads, and self-promo segments can look identical to sponsored reads. Ground truth is built via multi-model consensus from three frontier LLMs with judge arbitration — the prompt encodes a decision tree that forces models to reason about context before classifying.

> Multi-Model Consensus Ground Truth

Building reliable ground truth for ad-segment detection is hard — human annotation is expensive, slow, and subjective. We use a multi-model consensus protocol that combines three frontier-class LLMs as independent annotators with a fourth model as a tiebreaker judge.

Multi-model consensus protocol flowchart — 3 reference models annotate independently, IoU clustering, agreement level branching into Hard GT, Judge Arbitration → Soft GT, or Discarded

Reference models: Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 annotate every video independently with extended thinking enabled. Overlapping segments are clustered across models using IoU ≥ 0.3.

Tiered confidence: Segments where all three models agree become hard ground truth (median boundaries). Where only a majority agrees, Claude Sonnet 4.6 acts as a blind judge — receiving the transcript fragment and anonymized annotations without knowing which model produced which. The judge's verdict becomes soft ground truth. Single-model detections are discarded.

Inter-annotator agreement measured via Krippendorff's alpha reaches α = 0.856 — well above the 0.8 threshold typically considered reliable. 21 segment clusters were disputed and discarded.

Total segments

95

Hard (unanimous)

87

Soft (judged)

8

Krippendorff's α

0.856

> Prompt Design

Each model receives the same structured prompt: video metadata (author, title, category, description, duration) plus the full timestamped transcript. The prompt encodes a decision tree that forces sequential reasoning:

  • • Q1. Is this promotional language sincere? → ironic / skit → skip
  • • Q2. Does the video exist for this purpose? → primary topic → skip
  • • Q3. Third-party brand + CTA? → sponsor
  • • Q4. Creator's own stuff in >5s block? → self-promo

The output is structured JSON validated with Zod — segment start/end in integer seconds with category classification. An Anthropic-specific workaround removes minimum constraints from the schema (Claude rejects them), then re-validates locally with the strict schema.

> Evaluation Protocol

Matching: Predicted segments are matched to ground truth using greedy IoU-based assignment. All candidate pairs with IoU ≥ 0.5 are sorted by IoU descending; the best pair is matched first, then both segments are removed from further consideration. A match also requires correct category classification.

Metrics: F1 is micro-averaged across all videos. Timing accuracy is measured by MAE (Mean Absolute Error) of start and end boundaries for matched pairs. Mean IoU captures temporal overlap quality. Bootstrap confidence intervals (1000 iterations, block-resampled by video) provide uncertainty estimates.

Hard / Soft split: Metrics are reported separately for high-confidence (unanimous) and low-confidence (judge-arbitrated) segments, revealing how models perform on clear-cut versus ambiguous cases.

> Overall Ranking

The table below ranks models by F1 score with 95% bootstrap confidence intervals. Note that CI widths of ~12–16 pp mean differences between the top models are not statistically significant — treat relative rankings with caution.

Overall model ranking by F1 score, precision, recall, cost, latency, and timing accuracy
Model F1 F1 (hard) Precision Recall Cost/Video Latency (med) MAE Start MAE End Mean IoU
gemini-3.1-flash-lite85.4% [78.6%–91.4%]85.9%87.8%83.2% $0.00574.8s 3.9s 2.0s 87.8%
gemini-3-flash85.1% [76.8%–92.2%]84.5%83.0%87.4% $0.00772.9s 2.9s 2.0s 89.1%
qwen-3.5-flash84.6% [77.6%–91.3%]85.1%88.5%81.0% $0.00238.1s 3.8s 1.9s 87.2%
gpt-5.4-nano82.4% [75.8%–89.2%]80.6%78.8%86.3% $0.00366.4s 5.1s 1.5s 85.7%
claude-haiku-4.581.1% [72.5%–89.0%]83.7%85.9%76.8% $0.01782.6s 4.0s 2.3s 87.3%

> Detection Accuracy

F1 score captures the balance between precision and recall. Mean IoU shows how precisely predicted segment boundaries overlap with ground truth — a model can correctly detect a segment (high F1) while placing boundaries poorly (low IoU).

> F1 Score

Harmonic mean of precision and recall — the primary ranking metric.

> Precision vs Recall

Conservative models cluster top-left, aggressive ones bottom-right. Top-right is ideal.

The precision-recall scatter reveals distinct model strategies. Precision-biased models (upper-left) are conservative — they miss some sponsors but rarely make false claims. Recall-biased models (lower-right) catch more sponsors but produce more false positives.

For a skip-ahead UX like Cutio's, high recall is critical — missed sponsors degrade the user experience more than occasional false skips, which users can easily undo. But precision below ~80% creates an annoyingly trigger-happy experience. The ideal operating point is the top-right corner: high in both dimensions.

> The Hard/Soft Gap

The most striking finding: soft segments are nearly impossible for every model. These are segments where reference annotators disagreed and a judge had to arbitrate. If even frontier models can't agree on whether something is an ad, it's unsurprising that test models struggle too.

A structural factor amplifies this effect: because models don't distinguish "hard" from "soft" predictions, all unmatched predictions count as false positives against the small pool of soft GT segments (only 8 out of 95). This makes soft precision inherently near-zero — the metric captures genuine difficulty but also reflects the imbalance between tiers.

Hard vs soft segment F1, precision, and recall per model
Model Hard F1 Soft F1 Gap
gemini-3.1-flash-lite85.9%6.1%79.8pp
qwen-3.5-flash85.1%6.3%78.7pp
gemini-3-flash84.5%7.4%77.1pp
claude-haiku-4.583.7%2.1%81.6pp
gpt-5.4-nano80.6%8.9%71.7pp

Hard F1 ranges from 80.6% to 85.9%, while soft F1 never exceeds 8.9%. This suggests the benchmark's difficulty is bimodal — clear-cut segments are solved well, but ambiguous edge cases remain open. The soft tier effectively measures a model's ability to handle content that even experts disagree on.

> Timing Accuracy

Beyond detection, where a model places segment boundaries matters. A correctly detected segment with sloppy boundaries creates a jarring skip experience. We measure MAE (Mean Absolute Error) for start and end timestamps separately.

Timing accuracy — MAE start, MAE end, and mean IoU per model
Model MAE Start MAE End Ratio
gemini-3-flash2.9s 2.0s 1.4×
qwen-3.5-flash3.8s 1.9s 1.9×
gemini-3.1-flash-lite3.9s 2.0s 1.9×
claude-haiku-4.54.0s 2.3s 1.8×
gpt-5.4-nano5.1s 1.5s 3.4×

Start boundaries are consistently ~2× harder than end boundaries across all models. This makes intuitive sense: sponsor segments often begin with a natural transition ("speaking of which..."), while endings are marked by clearer signals ("anyway, back to..."). The prompt instructs models to start 2–3 seconds before the transition phrase and include music gaps — this ambiguity makes start detection inherently noisier.

> Cost-Performance Frontier

For a production service processing thousands of videos daily, cost per video is a critical dimension. The spread across models is dramatic — a 8× cost difference between the cheapest and most expensive model.

> Cost vs F1

Cost-performance Pareto frontier. Top-left corner is the sweet spot.

> Cost per Video

API cost for a single video analysis, sorted cheapest first.

> Latency

Response time in seconds. Solid = median, faded = p95.

The cost-efficiency sweet spot belongs to qwen-3.5-flash at 364 F1-points per dollar. The raw accuracy leader gemini-3.1-flash-lite costs 2.5× more per video for a 0.8% F1 improvement — a steep marginal cost that may or may not be justified depending on precision requirements.

> Limitations

  • Dataset size. 65 videos is enough to surface broad patterns but too small for fine-grained statistical claims. Bootstrap CI widths of ~12–16 pp reflect this — differences smaller than that may be noise.
  • LLM-as-ground-truth. The gold GT itself is produced by AI models, not human annotators. While the multi-model consensus protocol with Krippendorff's α = 0.856 suggests high internal reliability, systematic biases shared by all reference models would be invisible.
  • Single-pass evaluation. Each model runs once per video at temperature 0.3 with a uniform reasoning budget (1024 thinking tokens) and a 4096-token output cap. No ensembling or multi-pass aggregation is used.
  • Pricing volatility. API costs are based on published rates at time of evaluation and may change.