Measuring Memory: probing the parametric layer

The 3×2 GEO matrix. Each cell is one chapter of the series. You're reading the highlighted cell.

Memory is the hardest knowledge mode to measure because nothing visits your server at the moment it works. Search leaves queries and citations. Fetch leaves URL access patterns. Memory leaves only the answer: a model recalling a brand, category, fact, or association from its parametric weights. Measuring it means designing a closed-book exam for AI systems.

1. Why Memory measurement is harder than Search

Search measurement has observable artefacts. You can inspect search citations, crawl logs, and fetched URLs. Memory has none of that. If a model recommends Elisa, DNA, or Telia from its internal knowledge, there is no GPTBot hit to grep and no citation list to parse. The evidence is the generated recommendation itself.

That makes Memory both more valuable and more dangerous to measure casually. You are not testing whether the model can find a current page. You are testing whether the brand already exists strongly enough in the model's learned associations to appear without help.

2. The Combot methodology: tools off

Combot isolates Memory by turning off external tools. The probe is configured so web_search, web_fetch, and URL context paths cannot assist the answer. In API terms, the research baseline is: Anthropic with tool_choice: {"type": "none"}, OpenAI with tool_choice: "none", and Gemini with function_calling_config set to NONE.

The point is not to mimic a consumer chat interface. The point is experimental control. If tools are available, the model may silently turn a Memory test into a Search test. Our internal measurement framework is explicit about this distinction: Memory is model-weight knowledge with no tools enabled; Search and Fetch are separate knowledge modes with separate failure modes.

3. Declarative top-3 prompts and binary inclusion

Open-ended prompts are too noisy for board metrics. A model asked to "discuss the best providers" may mention a brand several times in one run and not at all in another, simply because the essay took a different path. Our internal cadence framework rejects that shape and adopts declarative prompts: "Name your top 3 recommendations for X. Provide exactly one sentence of rationale for each."

That converts a volatile mention count into a binary inclusion score. If the brand appears in the top three, the cell scores 1. If it does not, the cell scores 0. This is deliberately strict. Memory visibility is not "did the model talk about us somewhere"; it is "did the model put us on the shortlist when unaided?"

4. Per-model Memory variance

Parametric memory is not shared across vendors. Sonnet, GPT, Gemini, and Grok are trained on different corpora, tuned differently, and updated on different schedules. A brand can be obvious to one model and absent from another. Measuring only one model is therefore a sampling error, not a strategy.

Model family	Tools-off probe	Top-3 inclusion	What variance means
Sonnet	✓	Binary 1/0	Anthropic's learned associations recall or omit the brand.
GPT	✓	Binary 1/0	OpenAI's learned associations recall or omit the brand.
Gemini	✓	Binary 1/0	Google's model family recalls or omits the brand without grounding.
Grok	✓	Binary 1/0	A separate model family gives an additional consensus check.

5. Training-cutoff effects

Memory is a lagging indicator. A product launch, award, pricing change, or reputation event cannot reliably appear in a model's parametric layer until a future training cycle includes enough evidence. At zero months after a cutoff, new events should be treated as absent from Memory. At three and six months, Search may already reflect the change while Memory still trails. At twelve months, a new base model may shift the baseline abruptly.

This is why Pulse should not panic over every Memory dip. A Search or Fetch gain can be fast. Memory movement is slower and usually needs entity work, canonical secondary sources, and time.

6. AI SOR for Memory

Memory SOR is the Memory slice of Combot's AI Share of Recommendation formula. The weighted version is prompt weight multiplied by model weight, channel weight, and is_recommended. For a Memory-only view, the channel is fixed to memory, and is_recommended is the binary top-3 inclusion score from the declarative prompt.

This matters commercially because not all prompts have equal value and not all models have equal importance. A high-intent category prompt should not carry the same weight as a low-value informational prompt. The metric is designed to produce a board-ready score while still allowing the Visibility matrix to show the underlying prompt, model, and mode cells.

7. Brand recall versus brand recognition

Memory probes measure recall. The model receives a category or problem and must retrieve the brand unaided. Search and Fetch can test recognition: the model sees pages, citations, or URLs and decides whether the brand is relevant. Both are useful, but recall is the harder test.

For executives, this distinction is simple: if the AI has to look you up, you have a retrieval presence. If it names you without looking, you have Memory presence. Combot keeps those modes separate so the fix is clear. Memory problems need durable brand and entity work. Search problems need retrieval engineering. Fetch problems need URL readability.

8. Model-consensus detection

The strongest Memory signal is not one model listing the brand. It is several independent model families converging on the same recommendation set in a tools-off probe. Internally, this is model-consensus detection. Avoid the misleading term "model collusion": nothing is colluding. The systems are independently recalling the same shortlist.

Consensus is worth flagging because it separates a one-model quirk from a cross-corpus association. If Sonnet, GPT, and Gemini all produce the same top-three set without live retrieval, the brand category has become part of the settled model world. If one model diverges, the Visibility drill-down shows where the variance sits.

9. See Memory SOR in Combot

Combot surfaces this as a workflow, not a vanity chart. Pulse shows the high-level Memory SOR movement alongside Search and Fetch. Visibility lets the team drill into the exact prompt and model cells. Competitors shows whether rivals are winning the closed-book test. Monitoring and Alerts remain important for URL health and incident response, but they do not replace Memory measurement because Memory has no live URL to inspect.

The practical next step is to pair this measurement with the optimisation playbook. Use Memory probes to find where the model does not recall you, then use entity, source, and reputation work to change the training-data footprint future models learn from.

Paired post: Optimising Memory

Series: Optimising Memory · Optimising Search · Optimising Fetch · Measuring Memory · Measuring Search · Measuring Fetch

Foundation: AI Knowledge Modes — Memory · Search · Fetch