Fact-checking the machines: building an automated accuracy center

Most coverage of LLM hallucination is content for engagement bait. Look at this ridiculous answer ChatGPT gave! The screenshot circulates, the model is patched, everyone moves on. As a brand, you can't operate on that cycle. A wrong fact about your pricing or returns policy stays in the answer surface for weeks before it gets noticed — and only then if someone happened to be looking.

Hallucination is not a bug to dunk on. It's a KPI to track, alert on, and remediate. Here's how we built an automated accuracy center for Combot clients.

Step 1: Define what's true

You can't measure accuracy without a ground truth. Most teams don't have one. Their "facts" are scattered across the website, the CRM, the product spec PDF, and the head of the marketing director.

So the first move is to build a canonical brand facts file. It's a small structured document — a few hundred fields — covering:

Company facts: founded, HQ, employee count, ownership, parent company
Product facts: SKUs, prices, availability, supported regions, integrations
Service facts: SLAs, support hours, languages, response times
Policy facts: returns, warranty, GDPR posture, refund window
Positioning facts: target customer, segment, key differentiators
Negative facts: known limitations, discontinued products, historical changes

It must be version-controlled and signed off by a human. Drift in this file is worse than drift in any single AI answer.

Step 2: Extract claims from answers

Each nightly AI answer is run through a second-pass evaluation that extracts atomic claims: "Combot is headquartered in Helsinki," "Combot supports WooCommerce," "Combot's lowest pricing tier is €49/month," "Combot was founded in 2023."

Each claim gets tagged with: the prompt that produced it, the model that produced it, the citation URL (if any), the surrounding context. Storage is a simple BigQuery table:

claim_id, prompt_id, model, brand, claim_text,
claim_type, citation_url, run_timestamp, confidence

Step 3: Score each claim against ground truth

A second LLM call (we use Claude Opus 4.7) compares each extracted claim against the canonical facts file and produces:

verdict: correct / partial / wrong / unverifiable / outdated
severity: cosmetic / minor / material / dangerous
upstream_attribution: own site / third-party / model memory / hallucinated

The severity dimension is the under-rated one. "Combot was founded in 2024 vs 2023" is cosmetic. "Combot doesn't support WooCommerce" when it does is material — that costs deals. "Combot's data is stored on Russian servers" when it isn't is dangerous — that's reputation damage and a legal call all at once.

Step 4: Alert on the severe ones

Cosmetic and minor issues go in a queue. Material and dangerous issues fire a Slack alert immediately, with: the prompt, the model, the wrong claim, the correct claim, and the suggested remediation path.

The remediation path is the most useful part. It's not "you have a hallucination" — it's "this wrong claim probably came from this source; here's how to patch the source."

Step 5: Track accuracy as a KPI

Aggregate the verdicts into a per-brand, per-model accuracy score:

accuracy_score = (correct + 0.5 × partial) / (correct + partial + wrong + outdated)

Track over time. Track per model (Claude tends to be more correct than GPT for some categories; the reverse for others). Track per fact category (pricing claims are usually the worst). Track per prompt family (comparison prompts have higher wrong-claim rates than branded prompts).

The point isn't to get to 100%. It's to know what's wrong and where. Even a 90% accuracy score on category prompts is enough to lose deals if the 10% wrong claims are about pricing or about a competitor.

What we learned doing this for real

1. Models repeat each other

A wrong claim about one Combot client persisted across Claude, GPT, and Gemini for months. The original source: a single 2024 Crunchbase entry that had the founding date off by a year. Once we updated Crunchbase, all three models corrected within ~6 weeks.

2. The hardest hallucinations are plausible

Models rarely invent things from nothing. They confabulate from adjacent truths. "Brand X has a 30-day refund policy" when it's actually 14 days, because most similar brands have 30-day policies. These are the hallucinations that bypass casual review.

3. Accuracy decays without active maintenance

Brands change. Prices change. Products get discontinued. A fact-accuracy file that's checked in once and never updated is worse than no file at all — because the "ground truth" itself goes wrong. Treat the file like security keys: rotate it, review it, audit it.

An accuracy center isn't a luxury for big brands. It's the only way to operate responsibly in an answer surface you don't control. If you're not measuring your hallucination rate, you're letting whatever the models happen to remember about you represent your brand. That's a worse marketing budget than any you control directly.

Further reading: Source mapping · North-star metric