Optimising for Memory: how brands earn durable presence in LLM training data

The 3×2 GEO matrix. Each cell is one chapter of the series. You're reading the highlighted cell.

Memory is the slowest part of AI visibility to move, and the hardest to fake. Search can find a new page next week. Fetch can read a fixed URL today. Memory changes when enough durable public evidence is absorbed into future model training. For high-Memory prompts, Combot treats this as the long-arc work: entity discipline, independent sources, structured identifiers, and community consensus.

1. Why Memory matters

Memory is the parametric layer: knowledge baked into model weights during training. It is what answers when tools are off, when search fails, or when the prompt asks for broad category judgement rather than a current fact. Combot's research frames high-level category discovery as heavily Memory-led; the model synthesises its learned semantic graph instead of quoting a single live result.

That makes Memory the durable layer. It is slow to earn, slow to lose, and strategically different from Search or Fetch. The foundation post, AI Knowledge Modes, separates these modes because one vague "AI visibility" score cannot tell a team whether to fix crawler access, URL readability, or the brand's public entity footprint.

The consequence for SEO teams is uncomfortable: Memory does not reward last-click tactics. It rewards repeated, stable, independent descriptions of the brand across sources that are likely to survive crawling, filtering, deduplication and model-release cycles. The work looks less like a launch campaign and more like institutional record keeping.

2. The four inputs to Memory

The safe playbook starts with inputs we can actually name. Combot's research identifies four: broad web corpora such as Common Crawl, curated encyclopaedic sources such as Wikipedia's organisation notability guidance, structured entity stores such as Wikidata, and canonical secondary coverage that describes the brand independently.

Those inputs do different jobs. Owned pages state the canonical facts. Independent sources prove that the facts are not merely self-asserted. Wikipedia and Wikidata help reconcile the entity. Community discussion supplies the language real users apply to the product, even when that language is uncomfortable.

Source mode	Memory value	Can you directly control it?
Owned evergreen pages	Consistent factual language for crawled corpora	✓
Independent secondary sources	Notability and third-party corroboration	✗
Wikipedia / Wikidata	Entity disambiguation and durable identifiers	✗
Community discussion	Language and sentiment models may later absorb	✗

3. Wikipedia is load-bearing

Wikipedia cannot be hacked into existence. The relevant page says an organisation is generally notable only when it has significant coverage in reliable, independent secondary sources. Trivial, incidental, paid, or self-authored material is not enough. That is why the Memory playbook starts outside Wikipedia: earn serious coverage first, then let the encyclopaedic layer reflect reality.

Wikidata is the structured companion. Its introduction describes Wikidata as a structured knowledge base, and its Q identifiers are useful because they give an entity a stable machine-readable node. The goal is not to spam an identifier everywhere. The goal is to make it unambiguous that the website, Wikipedia/Wikidata entity, social profiles, and recognised secondary sources all describe the same organisation.

For large brands, the practical audit is simple: do the Wikipedia article, Wikidata item, corporate site, investor profile, LinkedIn page and major third-party profiles agree on the organisation name, parent entity, market, product categories and official URL? If they do not, a model has to guess which node is authoritative.

4. Evergreen content velocity

Memory rewards evergreen evidence, not campaign spikes. Common Crawl publishes large web crawl archives, but a crawl is only an input to a possible future training set. The delay from publishing to durable Memory is therefore long. Combot's working estimate is a lag of six to twelve months for frontier-model Memory effects, depending on crawl inclusion, deduplication, training, and release cycles.

Use server logs to make this less mystical. Track CCBot and other training crawlers, then connect crawl evidence to future Memory probes. The operational companion is our server logs and AI visibility post: logs cannot prove a model learned you, but they can prove whether the public evidence was available to be learned.

This is also why stale evergreen pages are a Memory problem, not just a conversion problem. If the best crawlable explanation of your category is two years old, future models may learn the old framing. Refresh the canonical pages when facts change, but avoid fake freshness: new dates without new evidence do not create better training material.

5. Reddit and Hacker News density

Reddit and Hacker News are important community surfaces, but specific deal terms and percentage claims about them are hard to verify, so we do not publish them as facts here. Keep the lesson and drop the false precision: community language matters because it creates repeated, independent descriptions of how people evaluate a product, support experience, pricing model, or technical architecture.

For a brand team, the practical work is not manipulating communities. It is reducing the distance between marketing claims and public consensus. If a technical audience repeatedly describes your product in a way your site never acknowledges, the Memory problem is not schema. It is reputation and language drift.

6. schema.org sameAs anchoring

The on-site layer is schema.org Organization. Use JSON-LD to describe the organisation, then use sameAs to point at reference URLs that unambiguously identify the same entity. Schema.org describes sameAs as a URL of a reference page that indicates the item's identity, including examples such as Wikipedia, Wikidata, or an official website.

A clean homepage graph should include the organisation name, URL, logo, identifiers where appropriate, and sameAs links for stable public profiles. This does not force a model to believe you. It removes avoidable ambiguity when crawlers and training pipelines reconcile entities.

Keep the graph restrained. A short, accurate @graph that joins the organisation to its canonical URLs is stronger than a bloated schema payload with speculative properties. The job is disambiguation, not decoration.

7. What does not help Memory

Paid press releases do not create Wikipedia-grade notability. Link buying does not create semantic consensus. AI-generated content farms do not create durable authority. If the public web receives the same shallow paragraph syndicated across dozens of low-trust pages, the best case is that it is ignored; the worse case is that it teaches the model a vague, low-quality association.

Memory optimisation is closer to evidence architecture than link building. Publish canonical evergreen pages, earn independent coverage, keep facts consistent, fix entity ambiguity, and let community feedback surface real product issues.

The same rule applies to AI-generated slop. More text is not more Memory if it adds no new independent evidence. A model training pipeline has every incentive to down-rank duplication, boilerplate and low-information pages. Optimise the density of true, corroborated statements instead.

8. Anti-pattern: "we are in GPT-5.5 training data"

No agency can verify that claim from outside the model maker. The constraint is plain: OpenAI, Anthropic, and peers do not publish exact post-training data composition, weighting algorithms, or proprietary filtering blacklists. Saying a brand is "in the training data" is theatre unless the vendor says so.

Use observable language instead. You can verify inputs: crawlers hit the site, independent sources exist, schema is present, Wikidata is aligned. You can verify outputs: tools-off prompts recall or omit the brand. You cannot inspect the weights.

That wording matters with clients. The promise is not certainty about a hidden training set. The promise is disciplined evidence: better public inputs, cleaner entity links, and repeated tools-off output tests that show whether the model now recalls the brand unaided.

9. Measure your Memory presence

The next step is measurement. Run unaided prompts with tools off, score whether the brand appears in the top recommendations, and separate Memory from Search and Fetch. That is the only way to tell whether long-arc entity work is changing model behaviour.

Read Measuring Memory next. It turns this playbook into a repeatable probe: declarative top-three prompts, binary inclusion scoring, per-model variance, and Memory SOR in Pulse and Visibility.

Paired post: Measuring Memory

Series: Optimising Memory · Optimising Search · Optimising Fetch · Measuring Memory · Measuring Search · Measuring Fetch

Foundation: AI Knowledge Modes — Memory · Search · Fetch