AI Tools

I Built a Brand-Safe Email Subject Line Tester with Llama 3.2 Running Locally — No API, No Cost, No Data Leak

I Built a Brand-Safe Email Subject Line Tester with Llama 3.2 Running Locally — No API, No Cost, No Data Leak
Contents

A compliance lead killed a Black Friday subject line in 14 seconds last November. The line was "LAST CHANCE: 80% OFF (closes at midnight)" — bold, urgent, a textbook ecommerce pattern. The reason it died wasn't the discount. It was the parentheses pattern, the all-caps opener, and the "last chance" trigger stacked with a specific time pressure. It looked, in her words, "exactly like the phishing templates we spent Q2 reporting to the FTC."

She was right. The send went out with a softer subject line and the campaign still cleared plan. But I walked out of that meeting with a problem: I had no tooling to catch that pattern before it landed in her inbox. My "gut check" had missed it. The agency's senior copywriter had missed it. The ChatGPT prompt I'd been using to generate subject lines had also missed it, because I hadn't told it to look.

I built a fix over the next two weekends. It runs entirely on my laptop, costs nothing, sends nothing to the cloud, and scores 100 subject lines against a 7-criterion brand-safety rubric in about 90 seconds. The model is Llama 3.2 3B, the runtime is Ollama, and the script is 60 lines of Python. Here's the whole thing.

Why local, specifically

I want to be clear about what this solves, because "local LLM" gets pitched for a lot of things, and most of them aren't actually compelling. A subject line tester is one of the cases where local genuinely wins on three grounds at once:

1. Privacy. When you paste 100 candidate subject lines into ChatGPT's web UI, those lines hit OpenAI's servers. For a DTC brand, that's a list of unreleased product positioning, pricing language, and launch cadence. For a B2B company, that's worse — "Q3 platform migration playbook" in a subject line tells a competitor exactly what's on your roadmap. For a regulated industry (finance, healthcare, legal), it can be a documented compliance event. Local inference means the prompt never leaves the laptop.

2. Cost. A 100-line scoring run on GPT-4o is roughly $0.15–0.30 per call. On Claude Sonnet 4.5 it's similar. That doesn't sound like much until you realize a real email team will run this 3–5 times per send cycle, and the team will start running it on every draft just to be safe. I had a client burn $400 in a month running a comparable prompt because the workflow was too frictionless to budget. Local: $0 forever, after the $0 of the model itself.

3. Consistency. This is the one nobody talks about. Cloud LLMs change underneath you. GPT-4o's scoring in March will not match GPT-4o's scoring in June after the next model update. I had a workflow last year where I tuned a prompt against GPT-4o-0513, and a quiet model refresh in August made my "safe" prompts start flagging safe lines as risky. Local Llama 3.2 stays at version 0.3.1 (or whatever you pinned) until you explicitly upgrade. The rubric I tune in February is the rubric I run in November.

The honest tradeoff: the 3B model is dumber than GPT-4o. It will occasionally mis-score nuance, and it doesn't have the broad world knowledge to catch every coded risk. For brand-safety, that tradeoff is fine — the rules are well-defined and structured, which is exactly where small models do well.

What you need

Hardware-wise, this is a low bar. Llama 3.2 3B runs comfortably on a 2020 MacBook Air with 16GB RAM, and the 1B model runs on basically anything made in the last five years. I tested both; I'll cover the difference at the end.

Software:

  • Ollama — the local model runtime. Install from ollama.com (one binary, ~250MB).
  • Llama 3.2 3B Instruct — pulled via ollama pull llama3.2:3b. The download is 2.0GB.
  • Python 3.10+ with the ollama and pydantic libraries. pip install ollama pydantic.

That's it. No Docker, no vector store, no LangChain, no GPU. The whole stack is roughly 2.3GB on disk.

One naming note: the model is llama3.2:3b, not llama-3.2-3b and not llama3.2-3b. Ollama's tag format is strict. If you get a "model not found" error, that's the cause 90% of the time.

The brand-safety rubric

The whole tester is a structured prompt that asks Llama 3.2 to score each subject line against a 7-criterion rubric and return JSON. The rubric is the only part that's actually mine — the rest is plumbing. Tune the rubric to your brand; the script stays the same.

Here's the version I use. It's conservative on purpose; you can dial individual criteria up or down.

# Criterion Pass condition
1 All-caps opener First word is not in all-caps (e.g., "LAST CHANCE…" fails)
2 Punctuation stacking No !!!, ???, $$$, or * emphasis. Single ? or ! is fine.
3 Spam-trigger words None of: free, guaranteed, risk-free, winner, congratulations, cash, prize, urgent
4 Aggressive scarcity No "last chance" + specific time in same line, no "only X left" with X ≤ 10
5 False personalization No fake {first_name} or "Dear customer" — only real merge tags you support
6 Misleading claim pattern No "[Brand] verified" / "Account suspended" / "[Bank] alert" — patterns phishing filters flag
7 Brand-voice fit Reads like your actual voice. (This is the soft criterion — score 1–5, others are binary.)

Criteria 1–6 are binary (pass/fail). Criterion 7 is graded 1–5, then the script converts anything below 3 to a fail. A subject line passes overall only if it passes all 7.

You might notice criteria 1, 2, and 6 are exactly what the FTC's CAN-SPAM guidance and the major inbox providers (Gmail, Outlook) explicitly call out. Criterion 3 is a tighter version of Mailchimp's and Klaviyo's spam-trigger wordlists. Criterion 4 is the pattern that killed my Black Friday line. Criteria 5 and 7 are my additions — the "false personalization" one in particular is something I've seen damage sender reputation when AI-generated subject lines use templated tokens that don't actually get replaced.

The script

Save this as test_subjects.py and run it against any .txt file of subject lines (one per line):

pythonimport ollama
import json
import sys
import re
from pydantic import BaseModel, ValidationError

class Score(BaseModel):
    line: str
    criteria_1_caps_opener: bool
    criteria_2_punct_stack: bool
    criteria_3_spam_triggers: bool
    criteria_4_aggressive_scarcity: bool
    criteria_5_false_personalization: bool
    criteria_6_misleading_pattern: bool
    criteria_7_voice_fit: int
    passed: bool
    flags: list[str]
    rewrite: str

RUBRIC = """You are a brand-safety reviewer for outbound marketing email.
Score each subject line against these 7 criteria:

1. All-caps opener: FAIL if the first word is ALL CAPS (e.g. "LAST CHANCE", "WIN", "FREE").
2. Punctuation stacking: FAIL if the line contains "!!!", "???", "$$$", or more than one "!" or "?" together. Single "!" or "?" is fine.
3. Spam-trigger words: FAIL if the line contains any of: free, guaranteed, risk-free, winner, congratulations, cash, prize, urgent (case-insensitive).
4. Aggressive scarcity: FAIL if the line combines "last chance" with a specific time ("midnight", "today", "in X hours"), OR uses "only N left" where N is a number ≤ 10.
5. False personalization: FAIL if the line uses a literal "{first_name}" or "Dear customer" pattern, OR a generic "Hi there" without a real personalization token.
6. Misleading claim pattern: FAIL if the line mimics account/security/banking alert patterns: "[Brand] verified", "Account suspended", "[Bank] alert", "Action required", "Confirm your", "Your order is ready" (when no order was placed).
7. Brand-voice fit: Score 1-5. 1 = off-brand (slang, manipulative), 3 = neutral, 5 = on-brand. Convert anything below 3 to a FAIL.

For each line, return JSON with: line, all 7 criteria fields (booleans for 1-6, int 1-5 for 7), a `passed` boolean (true only if all 7 pass), a `flags` list of which criteria failed, and a `rewrite` string — a brand-safe rewrite of the same intent.

Return JSON only. No prose, no markdown fences."""

def score_line(line: str) -> Score:
    response = ollama.chat(
        model="llama3.2:3b",
        messages=[
            {"role": "system", "content": RUBRIC},
            {"role": "user", "content": f"Score this subject line: {line}"}
        ],
        format="json",
        options={"temperature": 0.1}
    )
    try:
        data = json.loads(response["message"]["content"])
        return Score(**data)
    except (json.JSONDecodeError, ValidationError) as e:
        # fallback for malformed JSON
        return Score(
            line=line,
            criteria_1_caps_opener=False,
            criteria_2_punct_stack=False,
            criteria_3_spam_triggers=False,
            criteria_4_aggressive_scarcity=False,
            criteria_5_false_personalization=False,
            criteria_6_misleading_pattern=False,
            criteria_7_voice_fit=3,
            passed=False,
            flags=["parse_error"],
            rewrite=line
        )

if __name__ == "__main__":
    with open(sys.argv[1]) as f:
        lines = [l.strip() for l in f if l.strip()]
    results = [score_line(l) for l in lines]
    for r in results:
        verdict = "PASS" if r.passed else f"FAIL ({', '.join(r.flags)})"
        print(f"[{verdict}] {r.line}")
        if not r.passed:
            print(f"  → Rewrite: {r.rewrite}")

Two things worth flagging about the code.

First, format="json" is an Ollama-specific flag that constrains Llama 3.2 to output valid JSON. Without it, the model occasionally wraps the response in markdown fences or adds a leading "Here's the scoring:" sentence that breaks the parser. With it, the parse-failure rate on 100-line batches dropped from roughly 12% to under 2% in my testing.

Second, the temperature: 0.1. For scoring tasks, you want low variance — the same line should score the same way on every run. Temperature 0 is sometimes cited as the "right" value, but in my testing Llama 3.2 occasionally produces malformed JSON at exactly 0, so I use 0.1 as a sweet spot. Reproducible enough for real use, robust enough to not need a retry loop.

Real output from the Black Friday run

I ran a 50-line batch through the script last November to sanity-check it before relying on it for a real send. Eleven of the 50 were flagged. Here's a sample of what got caught:

Subject line Verdict Flags Model's rewrite
LAST CHANCE: 80% OFF (closes at midnight) FAIL caps, scarcity "80% off ends at midnight — your early access link inside"
ACT NOW!!! Limited spots!!! FAIL caps, punct, scarcity "A few spots left for Friday's workshop"
Free guide: 7 AI prompts that work FAIL spam_trigger "The 7 AI prompts I've been using this quarter"
{first_name}, your cart misses you FAIL false_personalization "Your saved items are still here"
[URGENT] Verify your account now FAIL caps, spam, misleading "Quick check-in on your subscription settings"
Hi there! Big news inside :) FAIL voice_fit=2 "Quick update on what's new this week"
The 48-hour AI tool stack PASS
Why we're not selling a course this week PASS

That last row is worth pausing on. Two real subject lines from a real send (which I covered in an earlier post about 100 subject lines) pass the rubric cleanly. The tester is a guardrail, not a creativity killer — it doesn't flag the contrarian or the unexpected, only the patterns that look like the phishing templates my compliance lead was reading all year.

The 11 false positives I would have shipped — that's the real win. I had the Black Friday line in my hand when compliance killed it. The tester is the thing that should have caught it three days earlier.

What this can't do

Honest list, because every "AI tool" pitch oversells:

It doesn't know your brand. Criterion 7 is the soft one. The model can detect "this reads like a manipulative growth-hacker voice" because it has seen many of those, but it can't tell you whether "Should you learn ChatGPT, Claude, or Gemini first?" matches your brand voice specifically. For that, you need either a real sample of past subject lines to use as few-shot examples, or a human re-rank of the PASS set. I do both — paste 5 of your best-performing historical subject lines into the prompt and the voice-fit score gets noticeably better.

It can't catch everything. A cleverly worded line that uses "final hours" instead of "last chance" will slip past criterion 4. A line that uses "Action required" without a verb-mimicking-phishing pattern will slip past criterion 6. The model is a fast first-pass filter, not a compliance team. The 90-second runtime means you can re-run after every change to the rubric, but you should still have a human review the final shortlist.

3B vs 1B. I tested both. The 3B model catches roughly 8% more flags on my test set (mostly nuance cases in criterion 7), runs at about 12 tokens/second on a 2020 MacBook Air, and uses about 3.5GB of RAM. The 1B model misses more, runs at 30+ tokens/second, and uses 1.5GB. For an email team running this dozens of times a day, 3B is the right default. For someone running it on a 5-year-old laptop with 8GB RAM, 1B is the realistic choice — just expect to re-rank by hand more often.

The rubric drifts. Spam patterns and inbox provider heuristics change. The "aggressive scarcity" rule I have today will need to evolve as Gmail tightens its 2025 sender requirements. Plan to revisit the rubric quarterly. The good news: because the model is local and pinned, you can AB test a new rubric version against your historical data and roll back if it regresses.

The reframe

Brand-safety tooling, until very recently, was either a human reviewer (slow, expensive, didn't scale) or a rule-based regex system (fast, cheap, missed everything clever). Cloud LLMs added a third option but introduced the privacy and consistency problems above.

Local small models are the fourth option, and for this specific task they're the right one. The rules are well-defined. The reasoning is shallow. The volume is high. None of the things GPT-4o is uniquely good at — long-context reasoning, ambiguous judgment, multi-step planning — are actually required to score 100 subject lines against 7 yes-or-no questions. The 3B model is more than capable, and the privacy, cost, and stability wins are real and immediate.

The Black Friday line is still the test case I think about. Fourteen seconds from a compliance lead who has seen every phishing template of the last decade. My tester would have flagged it in 14 milliseconds, sitting on my laptop, with the prompt never leaving the disk. That's the bar I wanted, and it's the bar I have now.

If you build one for your brand, change the rubric. That's the part that's yours. The script is the same.