Measuring Search: instrumenting live AI retrieval

The 3×2 GEO matrix. Each cell is one chapter of the series. You're reading the highlighted cell.

Traditional SEO tells you where a page ranks in an index. Search-mode AI measurement tells you which retrieval backend a model queried, which pages were cited, and whether your brand survived the answer synthesis. Here is how to instrument your live Generative Engine Optimisation (GEO) tracking for Search mode.

1. Why Search measurement looks like SEO and is not

For two decades, measurement teams have obsessed over one core question: "Where do we rank for this query?" This positional mindset collapses entirely in the AI era. Search-mode AI answers frequently utilise live retrieval from indexes, but the observable output is not a static Search Engine Results Page (SERP).

When measuring AI Search, the pertinent question becomes: When the model is allowed to search, which system did it query, which URLs did it read or cite, and did the synthesised answer ultimately recommend us? AI Search measurement starts where rank tracking ends—after retrieval, but before recommendation. A page can rank highly on Google and still not be cited by an LLM; worse, it can be cited by the model, yet fail to move the final recommendation in your favour if the contextual sentiment is weak.

This is a fundamental shift. Search mode is not Google organic ranking. The familiar part is prompt sampling; the unfamiliar part is the citation extraction layer. If you import standard SERP logic unchanged, your dashboards will fail to diagnose true generative visibility.

2. The probe instrumentation

The probe you run against AI APIs is not just a prompt runner; it is a retrieval recorder. If you only log the final text of the answer, you lose the crucial part of the system that changed. To measure Search mode reliably, you must enable the distinct search tools provided by each vendor API.

In our own deployed worker architecture, we explicitly configure these tools to log retrieval behaviour:

Anthropic: We push the Search tool as { type: "web_search_20260209", name: "web_search", max_uses: 5 }.
OpenAI: We use tools: [{ type: "web_search" }].
Gemini: Search relies on { google_search: {} }, and URL-fetch is enabled via { url_context: {} }.

We preserve the exact tool list sent, preserve the raw response blob, and extract the resulting citations into a normalised row model. Do not infer a cited URL from a search result unless the final answer or vendor metadata explicitly cites it. By recording whether the model actually decided to utilise the Search tool (which Anthropic and OpenAI decide dynamically based on the prompt), we build a complete picture of the retrieval event.

3. SOR-via-Search: the open-book exam

Share of Recommendation (SOR) is our foundational metric, but measuring it through the lens of Search provides tactical clarity. Search SOR is the per-prompt brand inclusion rate when the AI is allowed to use Search tools. If Memory SOR is the closed-book exam (measuring purely parametric knowledge), Search SOR is the open-book exam.

The baseline formula is straightforward:

search_sor = prompts_where_brand_is_recommended_with_search / prompts_run_with_search

However, we recommend a richer, weighted formula that assigns value to each prompt based on commercial intent and model market share:

weighted_search_sor = sum(prompt_weight * brand_included_with_search) / sum(prompt_weight)

Comparing these metrics yields powerful strategic insights. High Memory SOR coupled with high Search SOR indicates durable brand salience. Low Memory SOR with high Search SOR means your current web footprint is rescuing the answer—the live web is doing the heavy lifting. If a brand only appears when Search is enabled, it lacks parametric authority and remains fragile to crawling disruptions.

4. Cited-URL extraction per vendor

Citation extraction is the measurement layer. Without it, Search mode is just a black box with a text transcript. A model can search ten times and cite nothing; that is an entirely different operational outcome from citing ten competitor pages.

The core rule is to extract URLs from structured vendor metadata first, using answer-body Markdown extraction only as a fallback. Here is how the formats diverge:

Vendor	Format and Location
Anthropic	Final text content blocks include `citations[]` with `type`, `url`, `title`, and `cited_text`.
OpenAI	Responses API `output[]` includes `web_search_call` and `message.content[].annotations[]` with `type: "url_citation"`.
Gemini	`groundingMetadata.groundingChunks[]` holds web URIs; `groundingSupports[]` maps text spans to chunk indices.

To standardise this data, you must write extraction functions that normalise URLs (stripping hashes, converting to lowercase, standardising paths). Below is a practical implementation snippet for extracting Anthropic citations safely:

function extractAnthropicCitations(response) {
  const out = [];
  const seen = new Set();
  const blocks = Array.isArray(response?.content) ? response.content : [];

  for (const block of blocks) {
    const citations = Array.isArray(block?.citations) ? block.citations : [];
    for (const citation of citations) {
      pushCitation(out, seen, {
        url: citation.url,
        title: citation.title,
        context: citation.cited_text || block.text || null,
        source_kind: "anthropic_text_citation"
      });
    }
  }
  return out;
}

5. Backend index ownership

The backend index can differ substantially by vendor and surface, meaning a brand can be highly visible in one Search mode and utterly absent in another. Gemini natively utilises Google Search, grounding its answers in the same index that powers traditional Google organic results. OpenAI's current Responses-API web_search backend owner is not disclosed in the public tool docs; measure OpenAI Search behaviour directly rather than assuming the underlying index.

Anthropic's backend choices have drawn particular industry focus. Anthropic documents the Brave Search API for the Claude for Government MCP web search, and Brave positions its Search API as an agentic/RAG search backend. We treat Brave visibility as a practical signal, not a universal vendor guarantee. For more context on tracking this correlation, see our server logs analysis post.

Because the index backends vary, your measurement instrumentation must attribute citation share strictly by vendor. A visibility gap in Claude might require a distinct technical fix from a visibility gap in Gemini.

6. Tying citations to the Monitoring URL DB

Every URL that an AI model cites should become a formally monitored asset. In our architecture, the urls_citations table stores one row per AI-answer URL citation and joins back to the per-probe cells captured for SOR. Alongside it, Monitoring maintains urls_current, urls_history, and urls_fetches for URL health, change-tracking, and raw fetch diagnostics.

When a Search-mode probe returns a cited URL, we join it against the monitoring database. That lets us correlate AI recommendation rates with underlying technical health: status code, indexability, AI fetchability, robots policy, Core Web Vitals verdict, citation counts, and raw fetch diagnostics. If a cited URL starts returning errors to an AI search crawler, the Monitoring and Alerts pipeline can flag the citation or health drop when the relevant probe and alert rules observe it.

7. Search-mode anomaly detection

Because Search mode relies on live retrieval, it is highly sensitive to website migrations, schema changes, and crawler blocks. A sudden drop in your cited-URL-share is a leading indicator of an impending collapse in AI Share of Recommendation.

Anomaly detection systems should trigger an alert if a priority URL sees a 2-sigma standard deviation drop in citation frequency. These alerts must distinguish between a drop in raw ranking and a drop in model citations. By hooking your citation extraction directly into an alerting system, you can triage Search-mode failures before they impact your overall quarterly reporting.

8. The "search ratio"

We actively measure the "search ratio"—how often a vendor decides to use Search versus relying purely on Memory for the exact same prompt. Models like Claude and GPT-5.5 use heuristics to determine if a prompt requires live grounding.

Tracking the search_invoked boolean provides invaluable context. If a vendor's search ratio for a specific product category suddenly drops from 90% to 10%, it may indicate the model is now relying more on Memory for that prompt — but always validate against model release notes, the answer content itself, and repeated probes before assigning cause. If the cause holds up, the optimisation strategy for that category should pivot from Search-mode engineering to Memory-mode authority building.

9. Anti-patterns in measurement

If you fail to instrument Search mode correctly, you will generate false confidence. Avoid these common anti-patterns:

Counting raw search invocations instead of citations: As noted earlier, an AI can execute five web_search tool calls and still cite zero pages. An invocation is an attempt; a citation is a success.
Ignoring vendor-specific metadata: Do not rely exclusively on regular expressions scraping Markdown links. Vendor APIs expose rich metadata—such as Gemini's groundingSupports and Anthropic's server_tool_use. Use the API contracts to extract high-fidelity citations.
Collapsing Search and Fetch measurement: They are fundamentally different surfaces. Search measurement evaluates discovery and index ranking; Fetch measurement evaluates raw URL parsability. Merging them obscures the root cause of a visibility failure.

10. Next steps: Fetch and Optimise

Search measurement exposes the gaps in your retrieval visibility, but it is only part of the generative pipeline. You must also optimise the content that the model reads, ensuring it is semantically complete and correctly chunked.

If a model successfully finds your URL via Search but fails to read it due to a Client-Side Rendering (CSR) block, your Search-mode success is wasted. Next, read Measuring Fetch to test whether the URLs Search discovers can actually be parsed — and revisit Optimising Search when the dashboards point you back at the crawl path.

Paired post: Optimising Search

Series: Optimising Memory · Optimising Search · Optimising Fetch · Measuring Memory · Measuring Search · Measuring Fetch

Foundation: AI Knowledge Modes — Memory · Search · Fetch