Self-Host Llama 3.3 70B for Marketing: Docker + Ollama + 4 Prompts That Justify It
Contents
A client opened her laptop on a Wednesday morning in March, looked at the previous month's OpenAI bill, and said "we paid $11,400 to classify support tickets." That sentence is the reason this post exists.
The team had been routing every Zendesk ticket through gpt-4o-mini to assign a category and a sentiment tag. Twelve thousand tickets a month, ~$0.95 per 100 in, ~$2.80 per 100 out. Add the ad-copy rewrites, the meta-description generation, and a few other "just run it through GPT" jobs, and you get a five-figure monthly bill for work that — by volume — is mechanical. They asked me: "Should we just run our own model?"
The answer, as it usually is in marketing-tech, is: it depends. For most teams the answer is no. For a specific handful of jobs, the answer is yes, and the math is brutal enough that the boss will care. This post is the long version of that conversation — hardware reality, the actual Docker Compose file, four prompts with real output, and a candid list of jobs where you should keep paying OpenAI.
The hardware reality check first
Llama 3.3 70B (Meta released it December 6, 2024, and it punches at the level of the old 405B model) is a large language model with 70 billion parameters. In full FP16 (16-bit floating point, the highest precision) it needs ~140 GB of VRAM (video memory on the GPU). You do not have that. Almost nobody has that. The realistic option for a single workstation is Q4_K_M quantization — a compression technique that reduces each weight to roughly 4 bits, costing some quality but shrinking the file to about 42-43 GB.
That still doesn't fit on a 24 GB consumer card. You have three realistic hardware paths:
| Setup | Quant | VRAM/RAM | Throughput (tok/s) | Approx. cost (USD) |
|---|---|---|---|---|
| 2x RTX 3090 (used) | Q4_K_M | 48 GB VRAM | 8-15 | $1,400-1,800 |
| 1x A100 80GB PCIe (used) | Q4_K_M | 80 GB VRAM | 18-25 | $8,000-11,000 |
| Mac Studio M2 Ultra 192GB | Q4_K_M | 192 GB unified | 15-20 | $5,500-8,000 |
| 1x RTX 4090 24GB | partial offload | 24 GB VRAM + 64 GB RAM | 4-8 | $1,800-2,200 |
| 4x RTX 4090 | Q4_K_M | 96 GB VRAM | 20-30 | $7,000-8,500 |
A few of these numbers are worth underlining. The RTX 4090 is the most popular card on the planet right now and it cannot run Llama 3.3 70B in Q4_K_M on a single GPU. You can do partial offload — push ~25 of the 80 layers to the 24 GB card and let the rest run on system RAM — but you'll see 4-8 tokens per second, which is fine for a chat session and punishing for a 10,000-record batch job. A single A100 80GB or two used 3090s is the sweet spot for the marketing use cases I'm about to describe.
CPU-only inference (running the model on the main processor instead of the GPU) is technically possible. It's 1-3 tokens per second. Don't.
The actual Docker Compose file
The most popular way to run Llama 3.3 70B locally in 2025 is Ollama — an open-source tool that wraps llama.cpp (a CPU/GPU inference engine) and serves an OpenAI-compatible API (an HTTP interface that follows the same shape as OpenAI's). Ollama runs as a single binary or in Docker. For a marketing team that needs a stable, restart-proof setup, Docker is the right answer.
Here's the smallest production-grade file. It exposes the Ollama API only to other containers on the same Docker network — you would put a reverse proxy (a web server that sits in front of Ollama to handle authentication, rate limiting, and HTTPS) in front of this for anything internet-facing.
yaml# docker-compose.yml
services:
ollama:
image: ollama/ollama:0.5.7
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434" # remove this line if using a reverse proxy
volumes:
- ollama_models:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 5
start_period: 90s # model pull can be slow on first boot
volumes:
ollama_models:Two non-obvious details. First, the count: all under nvidia makes the container see every GPU on the host — change it to count: 1 if you want to leave one card free for other workloads. Second, the start_period: 90s matters because the first time the container boots, it pulls the model weights (the downloaded parameter files that make up the trained model) — for Llama 3.3 70B that's 43 GB that has to come down from Hugging Face, and a health check that fires at 30 seconds will mark the container unhealthy and Docker will restart it mid-download.
Once the container is up, pull the model:
bashdocker exec -it ollama ollama pull llama3.3:70b-instruct-q4_K_MThat's the model file sitting on disk. The API is live. From any other machine on the network:
bashcurl http://your-host:11434/api/generate -d '{
"model": "llama3.3:70b-instruct-q4_K_M",
"prompt": "Classify this support ticket into one of: billing, bug, feature_request, how_to, other. Reply with only the category name. Ticket: my export to CSV is broken since the update",
"stream": false
}'The response will come back as JSON with a response field containing the model's answer, plus a stats block with token counts and tokens-per-second. On an A100 80GB you'll see ~20 tok/s for a 50-token generation.
The four marketing jobs where the math flips
The reason most teams shouldn't self-host is that the cloud API (a remote LLM service like OpenAI or Anthropic that you pay per token) is genuinely cheap at low volume and someone else's infrastructure problem at any volume. $0.15 per million input tokens for gpt-4o-mini is not a real number when you only process 200 tickets a day. The math flips when one of three things is true:
- Volume is high enough that the per-token cost matters. Five-figure monthly API bills are the trigger.
- The data cannot leave your network. Healthcare, financial services, legal, anything with PII (personally identifiable information — names, emails, phone numbers) under GDPR or HIPAA (US health-data privacy law) jurisdiction.
- The job is throughput-tolerant. A 10k-record batch that takes 4 hours overnight on a self-hosted box is fine. A 10k-record batch a customer is waiting on in real time is not.
The four jobs that hit at least one of these triggers, with the prompts I actually use:
Prompt 1: Bulk support-ticket classification
The team that prompted this post runs 12,000 Zendesk tickets a month through classification. Their API bill was $11,400 because they were using gpt-4o (not mini) for accuracy, and they were extracting structured fields — category, sentiment, product area, customer-tier signal — not just a label.
The Ollama version uses the /api/chat endpoint with a structured system prompt:
textSYSTEM:
You are a support-ticket classifier. For each ticket, output a JSON object with exactly these fields:
- "category": one of [billing, bug, feature_request, how_to, account, other]
- "sentiment": one of [positive, neutral, negative, angry]
- "product_area": one of [dashboard, api, mobile, billing, integrations, other]
- "tier_signal": one of [enterprise, smb, self_serve, unknown]
- "needs_human": boolean — true if the ticket contains words from [legal, lawyer, gdpr, lawsuit, cancel, refund, chargeback, urgent, escalation]
Output only the JSON. No prose, no markdown.
USER:
Classify this ticket:
"I just realized we've been double-charged for the API plan this month and the dashboard doesn't show the new usage limits. I'm the admin for a 40-seat account. Need this resolved before our board meeting Friday."Real output from Llama 3.3 70B Q4_K_M on an A100:
json{"category":"billing","sentiment":"negative","product_area":"billing","tier_signal":"enterprise","needs_human":true}Cost on gpt-4o: ~$0.018 per ticket. Cost on self-hosted Llama 3.3 70B at $0.13/kWh and ~20 tok/s: electricity only, about $0.0007 per ticket. At 12,000 tickets a month, the API cost is $216/month. The self-hosted cost is electricity plus hardware amortization — for an A100 80GB at $10,000 amortized over 36 months, that's $231/month in hardware + ~$8 in electricity. They break even on this one job alone in 30 months, and that's before counting the two other jobs below.
Prompt 2: Private competitive-intel summarization
This is the one a marketing team will not put through a public API no matter what the price is. Competitive intel is the SEO teardowns of your three closest competitors, the quarterly earnings transcripts, the Glassdoor reviews, the patent filings. The output is a battle card your sales team uses. The input is "things that, if they leak, are catastrophic."
The model does not need to be brilliant at this. It needs to be competent and to not phone home. Llama 3.3 70B at Q4_K_M is roughly the level of gpt-4o-mini for structured summarization, which is enough.
The prompt:
textSYSTEM:
You are a competitive analyst. Read the following 4 sources about [Competitor X] and produce a battle card with these sections:
1. "Positioning" — one sentence on how they describe themselves
2. "Pricing model" — what we know, what's rumored, what we don't know
3. "Recent moves" — last 90 days, with source citation
4. "Vulnerabilities" — claims their customers complain about
5. "Our counter" — one sentence on how we beat them on this
If a section has no evidence in the sources, write "insufficient data" — do not invent.
USER:
Source 1: [G2 reviews, last 90 days, 47 reviews, attached]
Source 2: [Their Q3 earnings call transcript, attached]
Source 3: [Ahrefs organic keyword overlap with our site, attached]
Source 4: [Three Glassdoor reviews from former AE (Account Executive) and SE (Sales Engineer) roles, attached]This entire pipeline runs inside the Docker container. Nothing leaves the box. The output goes to a Notion database. There is no audit trail to worry about because there's no third party involved.
Prompt 3: Overnight batch SEO meta-description generation
This is the volume play. A mid-size e-commerce site has 10,000 product pages. Each one needs a unique meta description under 155 characters. Doing this through gpt-4o-mini at $0.15/M input + $0.60/M output costs roughly $7.50 per run. Doing it through self-hosted Llama 3.3 70B at ~20 tok/s on an A100 takes about 6 hours for 10,000 records and costs about $0.80 in electricity. You run it at 1am, you wake up to a finished CSV.
The prompt per row:
textSYSTEM:
You write meta descriptions for SEO. Rules:
- 140-155 characters including spaces
- Include the target keyword once, naturally
- No quotation marks, no em-dashes, no emoji
- End with a soft call-to-action verb (discover, shop, learn, compare, find)
- Do not start with the brand name
USER:
URL slug: /products/leather-laptop-sleeve-15-inch-brown
Product name: Heritage Leather Sleeve — 15"
Target keyword: leather laptop sleeve 15 inch
Existing H1: Premium Full-Grain Leather, Tailored FitSample output:
text"Hand-stitched full-grain leather laptop sleeve sized for 15-inch MacBooks and ultrabooks. Discover a slim, durable carry that ages beautifully."The real cost driver here is engineering time, not even the LLM (large language model) cost. The model is the cheap part. The wrapping — pulling product data from the CMS (content management system — the database/backend that stores your site content), batching 200 records at a time, writing the results back, logging failures — is most of the work, and you'd write that wrapping code whether the LLM costs $0.80 or $7.50. Once you have the pipeline, the LLM cost is the line item you can shrink to almost zero.
Prompt 4: PII-redacted email-list cleaning
Every marketing team has one. The 80,000-row list that came from an event, an acquisition, a partner — full of duplicates, stale roles, multiple email addresses per person, and a non-zero number of rows with names and phone numbers you are about to email at scale. The legal team wants the PII stripped before it goes into the marketing automation platform. The marketing team wants the list cleaned and segmented. Both happen in the same pass.
The model is doing two jobs: PII detection and record normalization. Llama 3.3 70B is competent at both with a structured prompt:
textSYSTEM:
You are a list-cleaning assistant. For each row, output a JSON object with:
- "email": canonical form (lowercased, gmail dots removed if applicable)
- "first_name": the part before the @ in the email if no name field
- "company": clean company name, strip "Inc", "LLC", "Ltd" unless required
- "title": normalized to one of [Founder, CEO, CTO, CMO, VP Marketing, Director Marketing, Marketing Manager, Other Marketing, Other]
- "pii_flags": array of strings from [phone, address, ssn_pattern, credit_card_pattern, dob]. Empty array if none.
- "redacted_row": the original row with all PII fields replaced by "[REDACTED]"
Output only JSON.
USER:
Row to clean:
{"name":"Sarah Johnson","email":"Sarah.Johnson@gmail.com","company":"Acme Inc.","title":"VP, Marketing","phone":"415-555-0123","signup_date":"2024-03-15"}Output:
json{
"email":"sarahjohnson@gmail.com",
"first_name":"Sarah",
"company":"Acme",
"title":"VP Marketing",
"pii_flags":["phone"],
"redacted_row":{"name":"Sarah Johnson","email":"sarahjohnson@gmail.com","company":"Acme Inc.","title":"VP, Marketing","phone":"[REDACTED]","signup_date":"2024-03-15"}
}Running this through a hosted LLM means 80,000 rows of customer PII transits through OpenAI or Anthropic's API. Even with their data-handling promises, that's a paragraph your CISO (Chief Information Security Officer) will want removed from your risk register. Self-hosted means the data never leaves the network. The cost is the same order of magnitude as Prompt 3.
The break-even math, plain
A single A100 80GB workstation (refurbished) costs roughly $9,000-11,000. Add a server-class motherboard, 128 GB of system RAM, an NVMe SSD (fast solid-state drive) for the model, a 1000W PSU (power supply unit), and a 4U case: you're at $13,000-15,000 all-in. Amortize that over 36 months: $360-420/month.
Electricity for an A100 under sustained inference: ~250-300W average, 24/7 if you're running overnight jobs, ~$30-45/month at $0.13/kWh. Call it $50 with the rest of the box.
Colocation (renting rack space in a data center for the hardware) or a proper office with cooling: $0-200/month depending on what you have.
Total realistic monthly cost: $400-700.
At that cost, the model is "free" up to roughly 50 million generated tokens a month. Beyond that, you're GPU-bound (limited by how fast the GPU can process tokens) and you'd need a second box. A team running the four jobs above will produce maybe 5-15 million tokens a month, well inside the free zone.
The break-even volume against gpt-4o-mini at the workloads above is roughly 8-12 million tokens per month. The break-even against gpt-4o is closer to 3-5 million. The honest conclusion: if your monthly API bill is under $1,000, self-hosting is a hobby. If it's over $5,000 and the data is anything sensitive, the conversation is worth having.
When you should absolutely not bother
Some jobs, you keep paying the API. Self-hosting is the wrong answer when:
- Your total volume is under 5 million tokens a month. The API is cheaper than the electricity.
- You need a frontier model for the task. Llama 3.3 70B Q4_K_M is roughly
gpt-4o-mini-class. If your job needs Claude Opus oro1-class reasoning, self-hosting is not the answer — wait for a better open-weight model. - Your engineering team is small and not interested in infrastructure. Self-hosting is a part-time job: OS updates, driver updates, model upgrades, monitoring. If nobody on the team wants that, you'll be off the API and back on the API within six months.
- The job is latency-sensitive and you can't afford queueing. One user waiting on a chatbot needs sub-second response. A self-hosted 70B model at 20 tok/s with a single user is fine; with three concurrent users on a single GPU, it is not. The API wins on multi-user latency every time.
- You need multimodal (text + images + audio) support. Llama 3.3 70B is text-only. Multimodal self-hosting is a different and much harder conversation.
The two jobs that are basically never worth self-hosting: short ad-hoc copy generation (under 50 calls a day, just use the API), and anything that requires the absolute best model. The four jobs above are the long tail where volume, privacy, or both turn the math around.
A practical starting point
If the math for your team says yes, do not start with a 70B model. Start with a smaller model on the same hardware — llama3.1:8b or qwen2.5:14b — and get the Docker, the network setup, the monitoring, and the batch job orchestration working first. Those are the parts that will eat your time. The model swap from 8B to 70B is one line in a docker exec command. The infrastructure that runs the model is the part that takes a month to get right.
Buy the A100 used from a reputable refurbisher with a 12-month warranty. The 80GB version exists; the 40GB version does not fit Llama 3.3 70B comfortably. Set up Prometheus (a monitoring tool that collects metrics) and Grafana (a dashboard tool for visualizing those metrics) on day one — you want to see tokens per second, VRAM utilization, and request queue depth from the start, not after the third mystery outage. Run the first three weeks of jobs in parallel with the API and diff the outputs. The diff is the thing that tells you whether the self-hosted model is good enough for the work.
And keep the API account open. You will absolutely have a job that the self-hosted model can't do well, and the right answer is "send this one to Claude, send the other 12,000 to the local box." The goal is not to delete the API. The goal is to stop seeing five-figure bills for work a workstation can do in a closet.