← Blog · Technical

From fetchability to trust: the technical SEO of language models

Most teams treat "AI visibility" as a content problem. Better answers in the model means better content on the website. That's part of it. The bigger and more common failure mode is technical: the AI never sees your content in the first place, because its crawler can't or won't fetch it the way Googlebot can.

Three uncomfortable facts about the 2026 AI crawler landscape:

The result: many sites that rank fine in Google are functionally invisible inside ChatGPT, Claude, and Perplexity — not because their content is bad, but because the crawlers never made it past the shell.

Who's actually crawling you in 2026

The 2026 AI bot population for a typical mid-size e-commerce site:

CrawlerOperatorRenders JS?Notes
GPTBotOpenAINoTraining crawler.
OAI-SearchBotOpenAIPartialUsed to surface websites in ChatGPT search features.
ChatGPT-UserOpenAI (user-triggered)NoOn-demand fetches.
ClaudeBotAnthropicNoModel-development crawler.
Claude-SearchBotAnthropicNoSearch-quality and indexing.
Claude-UserAnthropic (user-triggered)NoUser-directed fetch.
PerplexityBotPerplexityNoSearch-results crawler. Perplexity-User handles user-triggered fetches.
Google-ExtendedGooglen/aRobots product token (not a separate HTTP UA); controls Gemini training and grounding eligibility.
BytespiderByteDanceNoVery large volume.
Meta-ExternalAgentMetaNoGrowing.
CCBotCommon CrawlNoGoes into many training corpora.

Pattern: do not assume AI crawlers execute JavaScript. Most documented AI bot user-agents are not described as Googlebot-style renderers; test raw HTML and rendered DOM separately for each AI UA you care about. If your category pages are CSR (client-side rendered) React SPAs, they are likely invisible to the AI fetchers that do not render. Note that Google-Extended is a robots product token rather than a separate HTTP crawler; Gemini grounding draws on the same Google Search index that Googlebot populates.

The five technical failures we see most

1. Client-side rendered category / hub pages

Most of the well-known mid-2020s e-commerce platforms shipped category pages as CSR React apps. They look fine in a browser, they rank fine in Google (Googlebot renders), and they're completely empty to GPTBot and the other AI fetchers that do not run JavaScript. Curl them with each AI bot user-agent (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot) and compare to a rendered-browser snapshot — the AI surfaces will commonly return a 200 OK with a JavaScript shell and zero <a href> tags.

Fix: SSR the pages. If you can't ship SSR in-house, an edge-prerender layer (Cloudflare Worker calling a managed Chromium) gets you there in two weeks without touching application code.

2. robots.txt accidentally blocking AI bots

Many sites added User-agent: GPTBot \n Disallow: / in 2023, then forgot about it. Two years later they wonder why they're invisible in ChatGPT. The opposite happens too: a wildcard Disallow that was meant for a single noisy bot now blocks Claude and Perplexity along with it.

Fix: audit robots.txt explicitly per known AI bot. Allow by default if you want AI surface; Disallow explicitly per bot if not. Don't rely on "everyone is the same."

3. Soft 404s served as HTTP 200

An SPA that handles routing client-side will often return 200 OK + a React 404 page for non-existent URLs. AI crawlers don't run the React app, so they index the URL as a valid page with thin content. Cumulative effect: hundreds of "valid" empty pages diluting your topical authority.

Fix: detect soft 404s at the edge and rewrite the status code to 404 for crawlers. This is corrective compliance, not cloaking — Google explicitly asks for accurate status codes.

4. Cookie banners blocking content

OneTrust-style consent gates that delay rendering of the actual page until consent is granted: a real human clicks "accept." A crawler never does. Many AI bots see only the cookie banner and conclude the page is the cookie banner.

Fix: render real content server-side, layer the consent banner client-side as an overlay that doesn't block initial HTML. CWV thanks you too.

5. Critical content in PDFs or images

Product specs in a downloadable PDF. Hero copy as a baked-in image. Pricing in a hosted SVG. AI text extractors handle PDFs better than they used to, but image OCR is still inconsistent across bots. If a fact only exists in a non-text asset, you're betting on the worst-case parser.

Fix: anything you want models to know about you exists as plain HTML text somewhere on the indexable site. The PDF / image is optional duplicate.

The audit

A useful nightly audit covers, per representative URL set:

  1. Fetch with each AI bot user-agent, record HTTP status, byte size, anchor count, h1, meta description, schema presence
  2. Diff the raw HTML against the post-JS DOM (a real browser render); flag URLs where the raw HTML is <50% of post-JS content
  3. Check robots.txt against each AI bot UA
  4. Detect soft-404 patterns on a curated bad-URL list
  5. Validate schema.org JSON-LD
  6. Crawler access matrix: which bots got 200 vs blocked vs errored

Combot ships these as a single audit_ai tool that runs nightly per client. Results land in BigQuery, with deltas flagged when crawler access changes or content extraction drops.


Content quality is a real layer of AI visibility. But you can't synthesise out of nothing. If the technical layer is broken, every word of content you write is invisible to most of the bot population. Fix the pipes first, then optimise the flow.


Further reading: The 7 layers · Source mapping