Source mapping: tracing an LLM's answer back to its roots

When an AI search engine cites a URL, that URL is not the source. It's a witness. The actual source is whoever wrote the paragraph the model paraphrased from — and whoever wrote the page they linked to before that. If you want to influence AI answers, you have to chase the citation upstream, not stop at the link.

We call this source mapping, and it's the most under-discussed part of AI visibility work in 2026.

The pattern

A typical AI answer for a category prompt has 3–5 citations. They tend to fall into four buckets:

The brand's own site — only if the brand has strong, indexable, on-topic content
Editorial third parties — listicles, comparison sites, magazine reviews
Aggregators — G2, Capterra, Trustpilot, TripAdvisor, industry-specific directories
Forums & community — Reddit, Stack Overflow, industry Slack-likes, niche forums

The cited URL is the answer's surface. The interesting question is: what was the source of the claim in that cited URL? Often it's a Reddit thread from 2023, an analyst blog post from a single consultant, or one well-written comparison article that everyone else paraphrased. Find that, and you've found the real lever.

The mapping procedure

For each high-value prompt, Combot runs this nightly:

Capture the AI answer + every citation URL
Fetch each citation page; extract the paragraph that contains the claim associated with the citation
Run a second-pass LLM evaluation that classifies the paragraph: original claim, paraphrase, aggregated from external source
For paraphrases, extract the upstream link (if present) or generate a search query that finds the likely upstream
Walk this graph 1–2 levels deep, recording the chain
Store edges in a citation graph: (prompt, cited_url, claim, source_url, source_type, sentiment)

The output is a per-brand source graph. You can query it: "for prompts where we're recommended, what sources tend to surface? for prompts where a competitor wins, what's different about their source set?"

What we've learned from running this

Three patterns show up consistently:

1. A small number of sources do disproportionate work

In most categories, 5–10 source pages drive the framing for 50%+ of citations. They're the comparison articles, the deep reviews, the well-engaged Reddit threads. If those sources misrepresent your brand, you're going to look misrepresented across half the AI surface. If they recommend you, you're going to look recommended.

2. Aggregators leak through

G2 / Capterra / Trustpilot reviews don't usually get cited directly, but the editorial content that does get cited routinely paraphrases the average rating and the negative themes. So your aggregator hygiene matters even when aggregators aren't the direct source.

3. Old Reddit threads outweigh new website pages

A 2023 Reddit thread with 200 upvotes will be cited or paraphrased far more often than your beautifully redesigned 2026 product page. The model trusts engagement more than recency. This is fixable — but it means SEO content production is not enough. You also need to invest in the community surfaces where the claims originate.

The remediation playbook

Once you have a source map, the remediation work falls into three buckets:

Repair sources you control. Your About page, product pages, FAQs, documentation. Make them the most authoritative source for the claims you want attached to your brand. Schema, freshness, citations of your own sources.
Influence third-party sources you can reach. Reach out to the comparison sites and review aggregators that surface in your top sources. Provide them updated facts. Pitch them new angles. This is PR, not SEO.
Participate in community sources you previously ignored. If a 2023 Reddit thread misrepresents your pricing, the fix is not to ask Reddit to delete it. The fix is to have a current, well-engaged thread with the correct framing that the model weights more heavily.

AI search isn't a black box. It cites its sources. The work is following those citations upstream and treating the resulting graph as the actual surface you optimise against. The brand's own website is one node in that graph — important, but not the whole picture.

Further reading: North-star metric · The 7 layers