Paid Media

A/B Test 200 Ad Creatives in 9 Days: The Production + Ranking Pipeline I Use

A/B Test 200 Ad Creatives in 9 Days: The Production + Ranking Pipeline I Use
Contents

On day 9 we had 200 ads live, 41 already killed, 14 scaled into fresh ad sets, and 3 winners that hit 4.1x our target ROAS (Return On Ad Spend, 广告投资回报率) by day 6. The campaign was for a DTC skincare brand spending $4,200/day across Meta. By the time we hit the 9-day mark, the system had already told us which 14 to double down on and which 41 to never look at again. Here's the exact production + ranking pipeline that got us there.

This wasn't a one-off stunt. The same 6-stage pipeline is what I now run on every account I take on past $1,500/day. The math is boring on purpose: 200 isn't a magic number, it's the smallest number that lets Meta's auction (the real-time ad buying system that decides which ad wins each impression) sort signal from noise at the budget I'm spending. Below 50 variants, every "winner" is mostly luck. Above 50, signal starts to emerge, and by ~150-200 the top decile is genuinely predictive. The 9 days are the part that actually matters — that's the calendar cost of getting from a blank brief to a defended, data-backed shortlist of winners. Less than 9 means you're either cutting production corners or skipping ranking, both of which cost you more than they save.

The math behind 200

Before the pipeline, the angle matrix. 200 ads is not 200 random ideas — it's a structured Cartesian product (every combination of a fixed set of variables). For this skincare client the matrix was:

  • 5 audience segments — 25-34 acne-prone, 35-44 anti-aging curious, 18-24 routine-builders, 25-44 men, 40-54 re-purchasers
  • 4 copy angles — problem-aware, solution-aware, social proof, urgency
  • 5 visual concepts — clean clinical, lifestyle morning routine, ingredient close-up, before/after, UGC (User Generated Content, 真实用户内容) mirror
  • 2 formats — static image and 6-second video

5 × 4 × 5 × 2 = 200. Every cell in the matrix is a unique ad. Some cells are obvious duds (anti-aging × UGC mirror × men 40-54 is going to be a stretch) — the ranking step exists to surface that fact without spending budget on it.

I keep the matrix in a Google Sheet with one row per ad. Column A is the cell, B is the primary text draft, C is the headline, D is the visual prompt, E is the asset link, F is the predicted score, G is the actual CTR (Click-Through Rate, 点击率) at day 3, H is the actual ROAS, I is the verdict (kill / hold / scale). By the end of week 2, that sheet is the single source of truth for what the brand should run for the next 60 days.

Stage 1 — Day 1: Lock the brief and the angle matrix (3 hours)

The single biggest predictor of a bad 200 is a vague brief. I refuse to start production until these six things are pinned down, in writing, in a doc the client can't wiggle out of:

  1. One product, one offer. Not "the whole catalog." Pick the SKU (Stock Keeping Unit, 单品) you're pushing.
  2. The single persona you're targeting. Not five. The matrix has 5 segments, but the brief's voice targets the primary one.
  3. The single promise the product has to deliver on. One sentence. The whole brief hangs off it.
  4. The one fact that makes the promise believable. A number, a study, a name.
  5. The single KPI that decides a winner. For e-com it's ROAS, for SaaS (Software as a Service, 软件服务) it's trial signups, for lead gen it's qualified leads. Pick one and stop moving the goalpost.
  6. The budget floor. What's the minimum daily spend per cell that gives the auction a fair shot? For Meta in 2025, that's roughly $20/day per ad. Below that, the learning phase never exits cleanly and the data is junk.

The angle matrix comes out of the brief, not the other way around. If the brief is concrete, the matrix falls out of it in an hour. If you're spending two days arguing about angles, the brief is wrong.

Stage 2 — Days 2-4: Production sprint (copy + image + video)

Three parallel workstreams, three people (or one person with three browser tabs and a lot of coffee). The order matters: copy first, because image prompts are downstream of the headline, and video is downstream of the static still.

Copy (Anyword + Claude, ~6 hours total): I don't write 200 different copy decks. I write 4 — one per angle — and use Anyword to expand each into 25 variants scored by PPS (Predictive Performance Score, 预测表现分). Then I take the top 10 per angle (40 total) and run them through Claude to tighten, de-duplicate, and split into primary text / headline / description per Meta's RSA (Responsive Search Ad, 动态搜索广告) limits. The 40 I keep go into the sheet. The other 60 get binned without ceremony — Anyword's top 10 per angle is consistently the strongest material.

Images (Midjourney + AdCreative.ai, ~10 hours total): For each of the 5 visual concepts, I write 4 prompts — one per copy angle. That's 20 prompts × 2 formats. Midjourney v6.1 in fast mode spits out 4 variations per prompt in roughly 30 seconds, so 80 prompts × 4 = 320 images, of which I keep the 100 best. AdCreative.ai is what gets me from 100 candidates to 100 polished Meta-spec 1080×1080 assets — its batch render, brand color pinning, and headline overlay templates save roughly 6 hours of Photoshop per sprint.

Videos (Runway Gen-3 + HeyGen, ~8 hours total): This is where most teams fall apart. Static is easy; video is where the production schedule goes to die. I don't do 6-second videos from scratch. I take the top 20 static stills and animate them in Runway Gen-3 with a 5-word motion prompt ("slow zoom, soft light, eye-level"). 20 stills × 2 motion variants = 40 video ads. For the UGC (User Generated Content, 用户原创内容) cell, I batch-record 5 voiceover scripts with HeyGen's stock avatars at 9:16 and crop to 1:1 — 5 scripts × 4 copy angles = 20 UGC cells. Combined with Runway output, that fills the 100 video slots in the matrix.

Day 4 evening: all 200 assets in the Google Sheet, all copy in, all cells filled. This is the moment the team is most tempted to ship. Don't.

Stage 3 — Day 5: Pre-launch ranking (4 hours)

This is the step most teams skip, and it's the difference between "I made 200 ads" and "I have a defendable shortlist of 14 winners by day 9."

I run two filters before a single dollar goes live.

Filter 1 — Anyword PPS re-score on final copy. Even though I expanded angles in Anyword earlier, the final tightened copy from Claude is different from the candidate copy. Re-score the final 200 in batches. Sort by PPS descending. The top 60-80 typically cluster between 75-95. Anything below 65 gets a yellow flag — it stays in the test, but I label it so I can audit it later.

Filter 2 — Human review cut (~3 hours). I sit with the asset sheet and the creative director and we apply three rules in order:

  1. Kill near-duplicates. If two cells produce visually similar ads, kill the lower-PPS one. The auction can't tell them apart, and neither will the user.
  2. Kill anything that violates the brief. If the promise is "24-hour hydration" and the headline is "transform your skin in 30 days," that ad is testing a different product. Bin it.
  3. Mark the wildcards. 8-12% of the matrix I'll deliberately keep low-scored because I want a control for whether the model is wrong. I'll star these so I can audit them at day 3.

After this, roughly 175 ads go live. The 25 we killed on the basis of pre-flight checks saved us about $2,000 of wasted spend and 48 hours of confused data.

Stage 4 — Days 6-9: Test launch with Advantage+ Campaign structure

For Meta, the launch structure matters as much as the creative. I run all 200 inside a single Advantage+ Shopping Campaign (ASC, 智能购物广告系列) with the following structure:

  • 1 campaign (ASC, $4,200/day budget, lowest-cost bid)
  • 4 ad sets — one per audience segment. Each ad set gets all 200 ads assigned via dynamic ad assignment.
  • No audience overlap rules — let ASC do its job.
  • No manual CBO (Campaign Budget Optimization, 广告系列预算优化) caps — give the algorithm room.

Why one campaign, not five? The whole point of 200 ads is to let the auction tell you which combination of angle × visual × format × audience wins. Fragmenting into 5 campaigns with separate budgets re-introduces human bias and the budget caps starve the long tail.

The first 48 hours of the test are noise. The auction is in learning phase (the period when Meta's delivery system is calibrating who to show each ad to — typically needs ~50 conversions per ad set per week to exit cleanly), creative is still being indexed, and frequency is climbing. Don't make any decisions in the first 48 hours. Don't even open the dashboard more than once a day.

Stage 5 — Days 8-9: The 3-day kill and the 3-day scale

By day 8 (3 full days of data on most cells), the kill rules can run. I use the same three rules every time, in the same order:

Rule 1 — Kill any ad with 0 purchases after 3,000 impressions. The auction has had a fair shot. The creative isn't connecting. Bin it. This single rule killed 41 of the 200 in the skincare test.

Rule 2 — Kill any ad where CPA (Cost Per Acquisition, 单次获客成本) is more than 2x the target after 1,500+ impressions. Don't even look at CTR (Click-Through Rate, 点击率) — at this volume CTR lies. CPA doesn't.

Rule 3 — Flag the top 5% by ROAS for scale. Anything 2x the median ROAS gets pulled out of ASC and rebuilt as a manual ad with lookalike audiences (LAL — Lookalike Audiences, 相似人群扩展) and stacked interests. The original ad keeps running in ASC; the new ad gets a fresh $400/day cap to test scale.

By the end of day 9, the sheet looks like this for the skincare client: 41 killed, 145 still running, 14 promoted to scale tests. Three of those 14 hit the 4x ROAS target within another week. Two of them became evergreen controls the brand still runs today.

What 9 days is actually buying you

It's not buying you volume for its own sake. It's buying you defensibility — a kill list backed by 6+ days of auction data, and a shortlist backed by enough signal that you can argue for the budget reallocation in front of a skeptical CMO (Chief Marketing Officer, 首席营销官) or a finance lead.

The pipeline also buys you something harder to put a number on: creative intelligence. By day 9, you know that for this brand, problem-aware copy beats solution-aware by 2.3x. You know that UGC mirror underperforms clinical imagery for women 35+, but beats it for men 25-34. You know that the ingredient close-up concept wins for re-purchasers but loses for first-time buyers. None of that knowledge was on the brief. All of it came from running the matrix.

The trap I want to warn you about: the pipeline works. It will produce winners. It will also produce a queue of "almost winners" that the CMO will want to keep running because the ROAS is "good enough." Kill them anyway. The cost of running a 1.4x ROAS ad when you have a 3.1x winner in the same ad set is not zero — it's the spend that should be going to the winner's impressions. Every soft hold is a tax on your top decile.

If you only run 200 in 9 days once a year, the pipeline pays for itself on the first campaign. If you run it every 30 days, it becomes the engine that compounds — the angle matrix gets sharper, the kill rules get tighter, the production gets faster. By month 6 you're shipping 200 in 6 days with the same team and the same budget, and your top decile ROAS is structurally higher than any team still running 5-variant tests.

That's the only AB testing I trust: a pipeline that gives the auction enough material to actually tell you something, and a ranking system that lets you listen to it.