AI Tools

Self-Host Mistral Small 24B for Ad Copy: Full Setup + A Blind Benchmark Against GPT-4o

Self-Host Mistral Small 24B for Ad Copy: Full Setup + A Blind Benchmark Against GPT-4o
Contents

$312. That's what one client cost me in OpenAI bills last month, and most of it was ad copy — primary texts, headlines, and RSA (Responsive Search Ad, 动态搜索广告) descriptions for an account pushing ~$4,200/day on Meta and Google. I wasn't going to fire GPT-4o, but I wanted to know if a $0.60/watt GPU sitting in my closet could match its output for the parts of the job where I was burning the most tokens: variations at scale.

Mistral Small 3 (the 24B release, January 2025) was the first open-weight model I'd seen in a while that was actually positioned for "one consumer GPU, no quantization gymnastics." Mistral's own pitch was that it runs on an RTX 4090 or a 32GB-RAM laptop. That was the trigger. I ordered a second 4090 for an old Threadripper build I had lying around, and ran the same brief through both models, blind-rated.

This is the actual setup I landed on, the prompt template I use for ad copy, the result of the blind A/B, and the cost math that made me keep GPT-4o for some clients and switch to self-hosted Mistral for others.

What you actually need to run it

The marketing for "runs on a 4090" is technically true and practically misleading. Here's what the realistic spec table looks like for Mistral-Small-24B-Instruct-2501 (and its March 2025 update, Small 3.1, which is the same 24B with a 128k context window and Apache 2.0 license):

Quantization (a technique that compresses model weights to use less VRAM) File size Min VRAM (video RAM) Practical use
FP16 (full precision) ~47 GB 48 GB 2× RTX 4090 or A6000
Q8_0 ~26 GB 28 GB 1× RTX 4090 (24 GB) — tight
Q6_K ~22 GB 24 GB 1× RTX 4090, comfortable
Q4_K_M ~17 GB 20 GB 1× RTX 3090 / 4070 Ti SUPER
Q3_K_L ~14 GB 16 GB 1× RTX 4060 Ti 16GB
Q2_K ~12 GB 14 GB Edge case, quality drops

The 4090 sweet spot is Q6_K. You use the full 24GB of VRAM, generation sits at roughly 18-22 tokens/second on a single card, and quality loss vs FP16 is below what I could detect in a blind read. Q4_K_M is the answer if you're on a 3090 or 4070 Ti SUPER.

For RAM-only inference on a Mac or a desktop without a discrete GPU, the Ollama MLX build of Small 3.1 fits in 32GB unified memory but you'll be at 4-7 tokens/second. Fine for testing prompts, not for batch-producing 200 ad variants in an afternoon.

Two setups: Ollama for laptops, vLLM for the server

I run both, on different machines, for different jobs. Picking the wrong one costs you hours.

Ollama (MacBook Pro M3 Max, 64GB): This is my prompt-iteration machine. Install is one line, no Python environment to fight with.

bash# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral-small:24b-instruct-2501-q6_K

The Ollama library exposes it as an OpenAI-compatible endpoint at http://localhost:11434/v1, which means every tool I already use (LangChain, LlamaIndex, my own scripts) just points at it like it's GPT-4o, no code changes. First-token latency on the M3 Max is around 1.2 seconds for a typical ad-copy prompt; full 80-token response in 6-8 seconds. I use this for everything that doesn't need parallelism: prompt engineering, reviewing a small batch, sanity-checking before I commit to a 500-variant sprint.

vLLM (Linux box, 2× RTX 4090, Threadripper 3970X): This is the production machine. vLLM is a high-throughput inference engine (it batches incoming requests automatically to keep the GPU busy) and the difference is night and day for batch work. Where Ollama serves one user at a time, vLLM batches requests and pushes the same 4090 to 1,200-1,800 tokens/second aggregate throughput at concurrency 8.

bash# vLLM with the official Mistral Small 3.1 build
pip install vllm
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
  --quantization awq-q4 \
  --max-model-len 8192 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.92

AWQ (Activation-aware Weight Quantization, 激活感知权重量化) Q4 is what I run on the server because I'm not VRAM-constrained and AWQ has better kernel support on Hopper/Ada (NVIDIA's recent GPU architectures) than GGUF (a quantization format Ollama uses). Output quality is indistinguishable from Q6_K at ad-copy prompt lengths. If you're on a single 4090, drop --tensor-parallel-size 1 and --quantization awq-q4 — it'll fit.

The OpenAI-compatible server comes up on :8000 by default. Point any ad-copy tool that talks to OpenAI at http://your-server:8000/v1 and it just works.

The ad-copy prompt I actually use

The first three versions of this prompt I tried were "write 10 Google Ads headlines for a DTC skincare brand." The output was generic mush. The version that started producing useful work has four things bolted on:

textYou are a senior direct-response copywriter (直效营销文案) who has written
$50M+ in paid social and search. You write for performance, not vibes.

Product: {{product_name}}
Offer: {{offer}}
Target audience: {{persona}}
Tone: {{tone}}  # e.g. clinical-authoritative, warm-confessional, urgent
Channel: {{channel}}  # meta_primary_text, google_rsa_headline, linkedin_intro
Max length: {{max_chars}} characters
Forbidden: {{banned_phrases}}  # e.g. "revolutionary", "game-changing", emoji

For each variant:
1. Lead with the strongest specific benefit, not a generic claim
2. Use a number or named proof point in the first 8 words
3. One CTA (Call To Action, 行动号召) verb, not "click here to learn more"
4. Avoid second-person "you" in the opening 4 words if a pain-point pattern is stronger
5. Output as JSON: {"variants": [{"primary": "...", "headline": "...", "angle": "..."}]}

Generate {{n}} variants. Vary the angle across variants — do not just
paraphrase the same idea. Cover at least 3 distinct psychological hooks
from this list: social proof, loss aversion, curiosity gap, identity,
specificity, contrarian.

Two things that mattered more than the model: (1) the angle list at the end — without it, every variant came back paraphrased; (2) the "forbidden" field — banning the same five generic phrases eliminated 80% of the "revolutionary, game-changing" slop both models loved to default to.

I keep a per-client version of this in a Notion page. Switching from one DTC (Direct-To-Consumer, 直接面向消费者) brand to a B2B SaaS client is a 30-second edit, not a re-prompt.

The blind benchmark

I generated 50 ad-copy briefs across the same five real client accounts — three DTC e-com, one B2B SaaS, one local services business. For each brief I ran the prompt twice: once against gpt-4o-2024-08-06 (the production model at the time), once against Mistral Small 3.1 on my vLLM server. Identical temperature (0.7), identical top-p (0.9), identical prompt text. I randomized output order, stripped the model name, and had a senior marketer who'd never seen the outputs rank them 1-5 on four criteria:

  • Hook strength — does the first line stop a thumb?
  • Specificity — concrete numbers, named ingredients, real objections vs vague claims
  • Channel fit — would I actually run this in the placement it claims?
  • Originality — is this the same angle as the other 9 variants, or a different one?

50 briefs × 4 criteria × 2 raters = 400 ratings. Here's what came out:

Metric GPT-4o Mistral Small 3.1 (local) Gap
Hook strength (avg /5) 4.1 3.7 -0.4
Specificity 4.3 3.4 -0.9
Channel fit 4.0 3.9 -0.1
Originality 3.5 3.8 +0.3
Overall preference (paired blind, % of pairs) 54% 42% 4% tied

Translation: GPT-4o is still the better ad-copy model. Mistral Small 3.1 was rated equal or better on channel fit and originality, and worse on specificity — which tracks with what I see qualitatively. Mistral is more creative and less concrete. For "introduce a new angle" or "give me 10 hooks I haven't tried," it's competitive. For "name three specific objections this audience has about retinol and address each one," GPT-4o wins by a real margin.

That's the finding I actually use.

The cost math that decided the rollout

Here's where self-hosting eats the API's lunch. I run roughly 1,800 ad-copy generations per month for the small clients — say 600 input tokens + 350 output tokens per generation on average.

GPT-4o cost:

  • Input: 1,800 × 600 / 1,000,000 × $2.50 = $2.70
  • Output: 1,800 × 350 / 1,000,000 × $10.00 = $6.30
  • Total: $9.00/month for raw tokens

That $9 is not the real cost. OpenAI charges ~$0 when you're below 1M tokens/day, but I also use GPT-4o for 5 other things on the same account — strategy summaries, brief expansions, image prompts, occasional analysis. Ad copy is maybe 40% of total GPT-4o spend. Total bill for the account last month was $312. Of that, $112 was ad copy.

Self-hosted cost (2× 4090 box):

  • Hardware amortized over 3 years: ~$3,800 / 36 months = $106/month
  • Power: 2× 4090 at ~300W each + system = ~700W, 24/7 → ~$90/month at $0.18/kWh
  • Total: ~$196/month, all-you-can-eat

Break-even: 1,800 generations/month × current pricing puts me at $112 GPT-4o vs ~$196 self-hosted. GPT-4o is still cheaper at my current volume.

That changes at scale. At 5,000 generations/month the API bill hits $311 and the self-hosted box is still $196. At 10,000, the API is $622 and the box is the same. So I keep GPT-4o for the small clients and route the heavy-batch work (the 500-variant ad sprints, the keyword-expansion-to-copy loops) to the local box. The local box earns its keep on two clients; the others use the API.

There's a third path I should mention: OpenRouter's hosted Mistral Small at roughly $0.20/M input, $0.60/M output. No hardware, no setup, same model. For someone who's just curious or whose volume is below break-even, that's the move. You lose the data-privacy argument but keep the cost saving.

What I'd skip if I were starting over

Three things cost me more time than they saved.

First, the "I need 7B / 13B / 24B comparison." I did it. The 7B models are not close on ad copy — specificity collapses. The 13Bs are usable. The 24B is the first tier where the output is good enough to use without heavy human rewriting. Start at 24B. Don't spend a week on the smaller variants.

Second, the LM Studio detour. LM Studio is a great GUI (graphical user interface) for trying models, but its inference backend (llama.cpp with a forked quantization path) is materially slower than vLLM at the same quantization. I lost a day. If you want a GUI, use Ollama. If you want throughput, use vLLM. Pick one.

Third, fine-tuning. I tried LoRA (Low-Rank Adaptation, 一种参数高效微调方法) fine-tuning Mistral Small on 800 winning ads from a past client. It did not move the blind-rating needle. The generic base model + better prompts beat the fine-tune. Fine-tuning is a 2026 problem for ad copy, not a 2025 one. The prompt template above is doing 80% of the work.

The verdict I keep coming back to

GPT-4o is still the better ad-copy model on the dimensions that matter most — specificity and hook strength. For a small account, just use the API. For an agency doing batch production across many clients, or for anyone with data-privacy requirements (medical, financial, legal), self-hosting Mistral Small 24B is now a real option, not a science project. The model is good enough, the hardware is reasonable, and vLLM makes the throughput problem go away.

I'm running both. The local box handles the bulk generation; GPT-4o handles the final selection and the work where I genuinely need the better model. The $312 line item on last month's invoice is now closer to $190, and the second 4090 is paid for by the end of Q1.

If you only take one thing from this: don't replace the API. Add the local model as a layer underneath it. The two together are cheaper and faster than either alone.