Marketing

UTM Hygiene Audit: Broken, Duplicated, Cannibalizing Tags (1,000 URLs, Claude)

UTM Hygiene Audit: Broken, Duplicated, Cannibalizing Tags (1,000 URLs, Claude)
Contents

Last quarter, a client handed me a Google Sheet with 1,037 tagged landing-page URLs exported from GSC (Google Search Console) + GA4 (Google Analytics 4) and said "tell me what's wrong with our UTMs (Urchin Tracking Module parameters — the utm_source / utm_medium / utm_campaign tags appended to URLs for attribution)." Two hours later, Claude had flagged 312 of them across three categories — and we had a canonical UTM sheet that I still use as the reference for every campaign since.

The workflow is the part I want to share, because auditing UTM hygiene by eye across a thousand rows is exactly the kind of work that makes people quit halfway and pretend the data is fine.

The export

Two files, both last 90 days:

  • GSC exportpage, clicks, impressions, query. The UTM string is already inside page.
  • GA4 exportlanding page, sessions, source, medium, campaign, utm_content.

I joined them on the URL string in Sheets (a SPLIT on ? plus a VLOOKUP), then exported the joined sheet as audit-input.csv — one row per tagged URL, with the UTM parameters parsed into separate columns.

The three things Claude actually found

1. Broken UTMs (typos and structural errors). 47 URLs had parameter names that GA4 silently ignores — utm-so instead of utm_source, a Capitalized UTM_Medium, two ? in the same URL, percent-encoding bugs. The page still loads. The traffic shows up as "Unassigned." Nobody notices for months.

2. Duplicate UTMs (case + spelling variants). This was the bulk — 218 URLs. The same source tagged as google, Google, Goog, and googel across different campaigns. Same medium tagged as cpc, CPC, paid, paidsearch, and paid-search. GA4 treats every variant as a different channel, so the "Paid Search" line in your report is showing maybe 60% of the truth.

3. Cannibalization (utm_source overlapping organic). 47 URLs had utm_source=google + utm_medium=organic — usually pasted by a well-meaning marketer who thought "Google is Google." GA4 honors the explicit UTM and reclassifies what would have been organic search as a paid-ish line. Your "Organic" report is artificially low; your "Paid" is artificially high. Multiply that across 47 URLs and your attribution model is lying to you.

The prompt I actually used

I gave Claude the CSV plus a short brief:

You are auditing a UTM taxonomy. For each row, classify into one of: BROKEN (typo/structural), DUPLICATE (variant of a canonical value), CANNIBALIZE (utm_source/medium conflicts with organic classification), or OK. Output a new column with the label and a second column with the canonical replacement (e.g. Googlegoogle, paid-searchpaidsearch). Do not guess. If a row is OK, leave both columns empty. Return the full file.

The key instruction was the last one — "return the full file." Without it, Claude summarizes instead of operating on every row. With it, the 1,037 rows come back annotated. I review the CANNIBALIZE and BROKEN buckets manually; the DUPLICATE bucket is a 30-second scan because the replacements are mechanical.

The canonical sheet

Three tabs. sources (12 approved values, lowercase, no .com), mediums (8 values, matched to GA4 default channel groupings), campaigns (one row per campaign with owner, start date, expected source/medium). Every new campaign starts here. The audit CSV is the source of truth for what was — the canonical sheet is the source of truth for what should be.

The two together close the loop. Without the canonical sheet, the audit is a one-time cleanup. With it, the next 1,000 URLs take ten minutes to validate, not ten hours.