Paid Media

Claude Computer Use: An Ad Creative QA Agent That Audits Your Meta Ads Library

March 2, 2025

Contents

Last quarter a junior media buyer on one of my retainer accounts sent a Slack message at 11:47 PM: "I think 23 of our live ads might be flagged. Brand wants them down by morning." The Meta rep was offline. She had 4 hours, ~80 ads in flight, and a checklist she'd been running on autopilot for two months. By 3:30 AM she'd reviewed 41 ads and missed the one that was actually a policy violation — a "100% guaranteed results" claim on a lead-gen form (a Meta ad format that captures user info before redirecting to a site). The ad stayed up another 36 hours, cost us 14 conversions from people who bounced when Meta finally took it down, and the post-mortem was the same sentence I keep hearing from every team: "we almost caught it."

I built a Computer Use agent the next week. It does not replace the reviewer. It does the boring part of the reviewer's job in 12 minutes, and the human reviewer's job shrinks from "scan 200 ads" to "look at 12 red flags and 23 yellow flags and make 6 calls." That ratio is the actual product.

Why "human review" stops scaling around ad #80

The job most teams call "ad QA" is two jobs stapled together:

Policy compliance (hard fail) — claims Meta's ad review will reject, restricted categories, missing disclaimers, before/after framing, the 30 banned phrases that get flagged at a rate the platform publishes but nobody reads.
Brand consistency (soft fail) — logo placement, color palette match, voice/tone, mandatory assets the brand team required ("always include the 10% off sticker", "never use stock people in suits", "headline must reference the campaign name").

A focused reviewer can do both for the first 60–80 ads. Past 80, one of two things happens: they stop flagging the soft stuff (most common), or they slow to ~5 ads per hour and miss deadlines (second most common). The post-mortem of every banned ad I've seen in 12 years looks the same: it was a Tuesday afternoon, the reviewer was tired, and "guaranteed" was sitting in the 8th line of the primary text.

The job is perfect for an agent. It's bounded, it's visual, it's repetitive, and the rules are stable enough that the false-positive rate is tolerable.

The 2-axis rubric I use to brief the agent

I do not let the agent invent its own checklist. I give it the same 2-axis rubric every time, with the brand-specific items filled in by the account lead before each run. It looks like this:

Axis	Severity	Example flags
Meta policy	RED (will be rejected)	"guaranteed / best in class / #1 / 100%"; before/after claims; weight-loss / financial-gain claims without disclaimer; misleading health claims; "free" tied to a paid offer
Meta policy	YELLOW (might be rejected)	Unclear landing page match; adult-tone language; user-incentivized reviews; missing "results not typical" on a finance ad
Brand consistency	RED (ship blocker)	Logo missing; wrong product shown; forbidden color (off-palette hex); required disclaimer missing
Brand consistency	YELLOW (review with brand lead)	Tone drift ("we" vs. the brand's preferred "I"); CTA (call-to-action) phrasing not on the approved list; secondary headline too long for the format

The RED rows are the agent's hard pass. The YELLOW rows are the human reviewer's queue. In the runs I have data for, the agent flagged ~12% of ads RED and ~18% YELLOW — meaning the human reviewer's job shrinks to about 30% of the original set, and the median decision time on those 30% is ~15 seconds because the agent has already pasted the offending line into the output row.

The build (3 layers, no surprises)

This post is not the one where I walk you through Docker (I covered the headless-browser setup — a browser that runs without a visible window — in [the SERP-brief Computer Use post]; the container image, the xdotool dispatcher, the screenshot-to-tool-result loop are identical). What I want to focus on is the three layers that make the ad QA version different from a content brief agent.

Layer 1 — Where the agent gets the ad creative. The Meta Ads Manager "preview" link is the input. The agent opens it, the ad renders, the screenshot is the ground truth. I do not feed it creative files directly — most account teams work in Ads Manager, and the preview is what Meta's reviewer will see, so it's the right reference image. (This means the agent only catches what Meta's reviewer will catch on a normal day; if Meta's reviewer is in a particularly strict mood, you'll still get rejects. Don't blame the agent.)

Layer 2 — The 2-axis rubric is loaded as a system prompt, not as a tool. The rubric is the role definition. The agent's job is to be a strict, slightly paranoid creative reviewer applying a specific checklist. Telling it to "review these ads for issues" produces vague output. Telling it "you are a Meta ad reviewer applying this specific rubric, and you MUST return one row per ad in CSV format with severity, axis, line, and fix" produces structured output you can paste into a Slack thread.

Layer 3 — Output is a CSV, not prose. A single CSV with columns: ad_id, axis, severity, line, fix. One row per flag. Ads that pass cleanly are not in the CSV — silence means green. This format is boring on purpose: a creative director at 7 AM does not want to read 80 paragraphs, they want a spreadsheet they can sort.

The system prompt I actually run

This is the version shipping in production on three accounts. Strip the brand-specific lines and it generalizes:

You are a senior ad creative reviewer. You are reviewing ads in the Meta
Ads Manager. Apply the rubric below, not your own judgment.

Inputs you will receive:
- A Meta Ads Manager preview URL
- The brand's "BIBLE.md" file (logo position, palette, voice adjectives,
  required disclaimers, forbidden assets, mandatory campaign phrases)

For each ad on the page, do the following:
1. Screenshot the ad card.
2. Read the primary text, headline, description, and any visible image.
3. Check the ad against the rubric below.
4. If the ad has any flag, append one CSV row to /review/flags.csv with:
   ad_id, axis (policy|brand), severity (RED|YELLOW), offending_line,
   fix (a 1-sentence suggestion).

Rubric — POLICY (RED):
- Any of: "guaranteed", "guarantees", "100%", "#1", "best in class",
  "the best", "lowest price", "risk-free", "no risk" outside of clearly
  financial products.
- Before/after framing, or weight-loss / financial-gain claims without
  the standard disclaimer.
- Landing page mismatch (ad promises X, page shows Y).
- "Free" attached to a paid offer.
- Adult-tone language, sensitive categories, political content.

Rubric — POLICY (YELLOW):
- User-incentivized reviews ("write a review and get 20% off").
- Health/finance claims that need a "results not typical" line and
  do not have one.
- Pricing claims that should link to a price page but do not.

Rubric — BRAND (RED):
- Logo missing or in an unapproved position.
- Off-palette color (the brand's hex list — its exact color codes — is in BIBLE.md).
- Mandatory disclaimer from BIBLE.md is missing.
- A product the brand does not sell is pictured.

Rubric — BRAND (YELLOW):
- Voice drift: the BIBLE.md voice adjectives are "warm, plain, no
  jargon"; the ad uses jargon.
- CTA phrasing not on the approved list.
- The secondary headline exceeds the format's character limit.

Rules:
- One CSV row per flag, not per ad. An ad can have multiple rows.
- If an ad has zero flags, do not write a row for it.
- The `offending_line` field must be the exact substring from the ad
  copy, in quotes, so the human reviewer can find it in 2 seconds.
- The `fix` field is a 1-sentence suggestion, not a rewrite.
- When you have processed every ad on the page, stop. Do not start
  a second pass.
- If a CAPTCHA appears, return "BLOCKED" and stop.

Two details earn their keep. First, the rule "if an ad has zero flags, do not write a row for it" is what keeps the CSV scannable. Without it, the agent writes 80 confirmation rows and the reviewer is back to skimming. Second, the "offending_line must be the exact substring" is the part that turns this from "AI noise" into "human-speed tool" — the reviewer can ctrl-F the ad copy in 2 seconds, vs. re-reading the whole ad to find the flagged claim.

What the output looks like in practice

A real run on a 78-ad account last month produced a CSV with 41 rows. The first 6 looked like this:

ad_id,axis,severity,offending_line,fix
104293817,policy,RED,"#1 rated email tool for SMBs",Drop the "#1" superlative; "Top-rated" is generally accepted.
104293821,policy,YELLOW,"Loved by 50,000 marketers",Add a footnote linking to a verifiable review source.
104293842,brand,RED,(no logo on the carousel card 3),Re-export the carousel with the BIBLE.md logo on cards 1, 3, 5.
104293842,brand,YELLOW,"Our proprietary AI engine",BIBLE.md voice is "plain, no jargon" — replace with "our built-in model".
104293855,policy,RED,"Guaranteed 3x ROI in 30 days",Remove "guaranteed" — Meta rejects this exact phrase > 90% of the time in my runs.
104293861,brand,YELLOW,"Limited time — ends Friday",Confirm the campaign's hard end date with the brand lead; the ad says Friday but the brief says Sunday.

Notice what is not in the output: a paragraph of "the ad is generally well-written, but..." preamble. There is no preamble. There is no aesthetic feedback. The agent does not get to have an opinion on the creative — it gets to have an opinion on the rubric. The aesthetic feedback is what the human reviewer is for, and that part of the job is the part the reviewer enjoys.

Where the agent earns its keep (and where it doesn't)

It earns its keep on the night-shift review: 200 ads at 11 PM before a launch, or a 14-day mid-campaign audit, or any moment when the question is "is there anything in our library that will get us shut down tomorrow." In the last 4 months, my account teams have used it on 17 separate occasions and the agent's RED flags have lined up with what Meta's reviewers actually catch about 85% of the time. The other 15% is a known false-negative floor — Meta's human reviewers are still better at edge cases (a clever pun that reads as a health claim to a tired reviewer, an image of a person's midsection that gets flagged as a weight-loss context).

It does not earn its keep on:

The first 3 ads of a campaign, where a human eyeball beats it for finding big creative direction problems.
Ads that depend on visual judgment ("does this image of food look appetizing"), which is not a checklist problem.
Anything where the brand's BIBLE.md is not maintained. Garbage in, garbage out — if your BIBLE.md is a stale Google Doc from 2023, the agent will enforce 2023's brand rules on 2026's ads.

The honest summary: the agent does not replace a reviewer. It replaces the 4-hour "scan 200 ads and try not to fall asleep" part of the reviewer's job. A reviewer with this agent finishes the same audit in 35 minutes, catches ~10% more RED flags than the unaided baseline (because the agent does not get tired at ad #140), and spends the time they saved on the parts of the job only a human can do: telling a creative director that the campaign's positioning is off, and rewriting the one bad ad in the set instead of leaving it on autopilot.

The post-mortem from that 11:47 PM Slack message would have read differently with this agent. The buyer would have started her evening with a 41-row CSV instead of a 200-ad checklist. The "guaranteed" ad would have been in row 5 of the CSV, not buried in the 41st ad she had time to look at. And the 14 lost conversions would still be 14 — but the next set of 14, the ones from the next campaign, would not.

Twitter LinkedIn Facebook Reddit Email

Claude Computer Use: I Let It Read the SERP and Write Its Own Brief Claude Computer Use agent: monitor your top 20 keyword rankings daily and alert you on Slack when something changes 50 Meta Ad Copy Variants From One Brief: The Claude Pipeline I Use Meta Creative Testing Matrix: 75 Ads in a Day (3 × 5 × 5)