Ollama + Llama 3.3: 100 Ad Copy Variants/Hour at $0 + a Predicted-CTR Ranker
Contents
A mid-sized DTC (Direct-to-Consumer, 直面消费者) brand I worked with last quarter burned $4,180 on a GPT-4o ad-copy generation workflow in a single month. That worked out to about $0.04 per variant, which doesn't sound like much until you realize the workflow was producing 1,000+ variants a month to keep a Meta Advantage+ pipeline fed. The same month, I ran an identical workflow on my M3 Max laptop using Ollama + Llama 3.3 70B. Total cost: $0. The variants weren't quite as good — I'd peg them at 80-85% of GPT-4o quality on ad copy. But the cull rate was the same: 90% get thrown away anyway.
That 80-85% gap is fine when you're going to cull 90% of the output. It's not fine for long-form content. This is exactly the kind of job where local models earn their electricity bill.
Here's the full setup — the model, the two prompts that do the real work, the Python wrapper, and the math that makes this defensible at any team size.
The hardware that makes this practical
The realistic local setup for this job is a Mac with at least 48 GB of unified memory. I'm running on an M3 Max with 64 GB, but the 96 GB M3 Max and the M2 Ultra 192 GB both work. On my M3 Max, Llama 3.3 70B at Q4_K_M quantization (a 4-bit weight compression scheme that shrinks the model to ~40 GB on disk with a small quality hit) — fully loaded into unified memory — runs at about 12 tokens per second. That's slow if you're used to cloud APIs, but it's deterministic, free, and there's no rate limit.
The setup is two commands. Install Ollama from ollama.com, then:
ollama pull llama3.3:70b-instruct-q4_K_M
ollama serveOllama runs a local OpenAI-compatible API (an HTTP endpoint that accepts the same request shape as OpenAI's chat completions) at http://localhost:11434. The model is roughly 40 GB, takes 6-8 minutes to download on a normal connection, and stays resident in RAM between requests.
For Windows or Linux with an NVIDIA card, the same model runs on a 24 GB consumer card (RTX 4090, etc.) with partial CPU offload — about 8-10 tok/s — or fully on a single 48 GB card (RTX A6000, used 3090 pair, etc.) at 15-25 tok/s. The trade-off is hardware cost, not capability. The same code below works against either setup — only the model name and the URL change.
The variant-generation prompt
The prompt that does the heavy lifting. I start from a single seed ad — one headline, one body, one CTA (Call-To-Action, 行动号召) — and ask for 100 distinct variants. The prompt makes Llama 3.3 walk a structured Cartesian product (every combination of a fixed set of variables) across three dimensions: hook angle, emotional tone, and specificity level.
You are a direct-response copywriter who specializes in paid social.
Given the SEED AD below, generate exactly {n} distinct variants. Vary systematically across three dimensions:
1. HOOK ANGLE — {hook_angles}
2. EMOTIONAL TONE — {tones}
3. SPECIFICITY LEVEL — concrete number/stat, named customer, or vivid sensory detail
Output exactly {n} variants, one per line, in this exact format:
HEADLINE: <8-12 words, no clickbait, no emoji>
BODY: <15-25 words, one idea, conversational, no exclamation marks>
CTA: <2-4 words, action verb, specific>
SEED AD:
Headline: {seed_headline}
Body: {seed_body}
CTA: {seed_cta}In production I feed the three dimensions as concrete lists — hook_angles: "question, stat, contrarian, story, list", tones: "urgent, calm, playful, defiant, intimate", specificity_levels: "3-4 per variant, no abstract claims". Llama 3.3 honors the combinatorial ask maybe 70% of the time, and the other 30% quietly duplicates an earlier angle — those duplicates get filtered downstream by exact-headline match.
For 100 variants at ~120 output tokens each, the model takes 15-20 minutes end-to-end on my M3 Max. At $0 per run, you can afford to do it overnight on a cron (a scheduled task that runs the script at a fixed time) and wake up to a full deck. The same 100 variants on GPT-4o would take about 90 seconds and cost ~$4.
The predicted-CTR ranker prompt
The second pass is the part most teams skip, and it's the part that makes this whole pipeline defensible. The same local model (or any model exposing the same chat-completions API) takes each variant and returns a 0-100 score across three weighted dimensions.
Score this ad variant on three dimensions. Be ruthless. Most variants score 30-60.
A. HOOK STRENGTH (0-40) — does the headline stop the scroll in the first 200ms?
- 0-10: generic, would blend with 50 others
- 11-20: clear but unremarkable
- 21-30: specific, earns the next second of attention
- 31-40: forces a closer read
B. VALUE CLARITY (0-30) — within 5 seconds, can a cold reader explain the offer?
- 0-10: vague or buried
- 11-20: clear but takes work
- 21-30: immediate, no re-read needed
C. CTA SPECIFICITY (0-30) — is the action verb specific, the next step unambiguous?
- 0-10: "Learn more" or generic
- 11-20: clear but soft
- 21-30: action verb + concrete next step
Variant to score:
Headline: {headline}
Body: {body}
CTA: {cta}
Output exactly three integers on one line, comma-separated, then the weighted total:
A, B, C, TOTALI keep the top 10-15 by TOTAL. In practice the top 10 cluster between 70-85. Anything below 50 is a hard cull — those variants get binned without a second look. The score correlates with real Meta CTR (Click-Through Rate, 点击率) about 60-70% of the time on my campaigns. That's not perfect, but it's a strict improvement over ranking by gut feel, and it costs about 4 minutes of local inference to score all 100 variants.
The reason I keep the prompt short and the rubric explicit: Llama 3.3 70B follows a numeric rubric more reliably than it follows "rank these 100 from best to worst." Asking it to compare 100 items produces wild ordering. Asking it to score one item against a fixed 100-point rubric produces stable, comparable numbers.
The Python wrapper
One file, three functions: generate, score, export. No frameworks, no extra dependencies beyond requests and csv.
pythonimport requests, csv
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "llama3.3:70b-instruct-q4_K_M"
GENERATION_PROMPT = """You are a direct-response copywriter ...
{rest of prompt above}
"""
SCORING_PROMPT = """Score this ad variant on three dimensions ...
{rest of prompt above}
"""
def generate_variants(seed, dimensions, n=100):
prompt = GENERATION_PROMPT.format(n=n, **seed, **dimensions)
r = requests.post(OLLAMA_URL, json={
"model": MODEL, "prompt": prompt, "stream": False,
"options": {"temperature": 0.9, "num_predict": 120 * n},
})
return parse_variants(r.json()["response"], n)
def score_variant(variant):
prompt = SCORING_PROMPT.format(**variant)
r = requests.post(OLLAMA_URL, json={
"model": MODEL, "prompt": prompt, "stream": False,
"options": {"temperature": 0.0, "num_predict": 30},
})
a, b, c, total = map(int, r.json()["response"].strip().split(","))
return {"hook": a, "value": b, "cta": c, "total": total}
def export_csv(variants, scores, path="variants.csv"):
with open(path, "w", newline="") as f:
w = csv.DictWriter(f, fieldnames=[
"rank", "total_score", "hook", "value", "cta",
"headline", "body", "cta_text",
])
w.writeheader()
for i, (v, s) in enumerate(
sorted(zip(variants, scores), key=lambda x: -x[1]["total"]), 1
):
w.writerow({"rank": i, **s, **v})
if __name__ == "__main__":
seed = {"seed_headline": "...", "seed_body": "...", "seed_cta": "..."}
dims = {"hook_angles": "...", "tones": "...", "specificity_levels": "..."}
variants = generate_variants(seed, dims)
scores = [score_variant(v) for v in variants]
export_csv(variants, scores)The full pipeline runs in one command. On my M3 Max, 100 variants generated + scored + exported takes about 25 minutes of unattended time. The output is a CSV sorted by predicted-CTR score, with the top 10 highlighted. That CSV imports directly into Meta Ads Manager via bulk upload or into Google Ads Editor.
Two failure modes I've actually hit, so you don't have to:
num_predicttoo low. If you setnum_predictto the model's default 512, the generation step truncates around variant 60-70. Set it to120 * nfor safety.- Score returns non-integer or extra text. Llama 3.3 occasionally wraps the score in prose. If the parser fails, retry once with
temperature: 0.0and a stricter system message — the second pass almost always returns clean integers.
The math that makes this defensible
The economics are the part that wins the argument with your boss.
- Per-variant cost on GPT-4o: ~$0.04 (input + output tokens, ~250 tokens per variant).
- Per-variant cost on Llama 3.3 70B local: $0.00 (electricity only — call it $0.02 amortized if you count the laptop).
- Same 100-variant run: $4.00 on GPT-4o vs $0.00 on local.
- Same 1,000-variant run (one month for a medium DTC account): $40.00 on GPT-4o vs $0.00 on local.
For a brand running this weekly across 4-5 campaigns, the GPT-4o bill is $160-200/month. The Mac Studio is a one-time $5,500-8,000 purchase. The math crosses break-even in 27-50 months, which sounds slow — but the marginal cost is zero from there on, and you can run 10,000 variants a month without flinching. For an agency running this across 20 client accounts, the same Mac pays for itself in 3-4 months.
The quality gap — 80-85% as good as GPT-4o on ad copy — is the part I want to be honest about. For ad copy specifically, the cull rate dominates. You're going to throw 90% away anyway. The marginal variant doesn't need to be Claude-tier; it needs to be plausible and varied. For long-form content, blog posts, customer-facing brand copy, and anything that goes directly to a CMO's eyes, the calculus flips — pay for the better model.
What I keep GPT-4o for
Anything where the output has to be 90%+ as good as the best model: long-form articles, brand voice work, strategic positioning, anything that goes to a human reviewer who'll read every word. The 80-85% gap matters there.
Anything where the output is going to be culled: variant generation, A/B test copy, email subject lines, internal naming, persona rewrites — local Llama 3.3 wins on economics and loses on nothing that matters.
That's the rule I run every job against now: what's the cull rate, and does the cost-per-variant justify paying 4 cents a pop? When the cull rate is high and the variant count is high, the answer is no. Local wins. The single biggest shift in my AI-stack economics in 2025 wasn't a better model — it was learning which jobs to stop paying cloud rates for.
The Ollama/Llama 3.3 stack is ugly in production in ways Claude and GPT-4o are not — the rate-limit story is "wait for your laptop," the quality story is "good enough, not great," the support story is "you." But for this specific job — high-volume, high-cull, low-stakes ad copy — the math is brutal enough that the boss will care. And once you have the local stack running for this, the next four jobs that benefit from it (subject lines, naming, list cleaning, bulk re-classification) basically fall out of the same script.