From fetchability to trust: the technical SEO of language models

Most teams treat "AI visibility" as a content problem. Better answers in the model means better content on the website. That's part of it. The bigger and more common failure mode is technical: the AI never sees your content in the first place, because its crawler can't or won't fetch it the way Googlebot can.

Three uncomfortable facts about the 2026 AI crawler landscape:

Most AI crawlers do not execute JavaScript
Most AI crawlers do not respect crawl-delay the way Googlebot does
Most AI crawlers identify themselves with new user-agents that your defaults probably block

The result: many sites that rank fine in Google are functionally invisible inside ChatGPT, Claude, and Perplexity — not because their content is bad, but because the crawlers never made it past the shell.

Who's actually crawling you in 2026

The 2026 AI bot population for a typical mid-size e-commerce site:

Crawler	Operator	Renders JS?	Notes
GPTBot	OpenAI	No	Training crawler.
OAI-SearchBot	OpenAI	No	Used to surface websites in ChatGPT search features. Fetches JavaScript but does not execute it.
ChatGPT-User	OpenAI (user-triggered)	No	On-demand fetches.
ClaudeBot	Anthropic	No	Model-development crawler.
Claude-SearchBot	Anthropic	No	Search-quality and indexing.
Claude-User	Anthropic (user-triggered)	No	User-directed fetch.
PerplexityBot	Perplexity	No	Search-results crawler. `Perplexity-User` handles user-triggered fetches.
Google-Extended	Google	n/a	Robots product token (not a separate HTTP UA); controls Gemini training and grounding eligibility.
Bytespider	ByteDance	No	Very large volume.
Meta-ExternalAgent	Meta	No	Growing.
CCBot	Common Crawl	No	Goes into many training corpora.

Pattern: do not assume AI crawlers execute JavaScript — assume they don't. A Vercel/MERJ analysis of over 500 million crawler requests found that none of the major AI crawlers (OpenAI's GPTBot, OAI-SearchBot and ChatGPT-User; Anthropic's ClaudeBot; PerplexityBot; Bytespider; Meta-ExternalAgent) render JavaScript — they fetch JS files but never execute them, so they only ever see your initial HTML response. Only Google renders JS (Googlebot's headless-Chrome pass, which Gemini grounding reuses). Test raw HTML and rendered DOM separately for each AI UA you care about. If your category pages are CSR (client-side rendered) React SPAs, they are likely invisible to every AI fetcher except Google's. Note that Google-Extended is a robots product token rather than a separate HTTP crawler; Gemini grounding draws on the same Google Search index that Googlebot populates.

Source: Vercel & MERJ, "The rise of the AI crawler" (≈569M GPTBot and ≈370M ClaudeBot requests/month observed; zero JavaScript execution across all major AI crawlers). vercel.com/blog/the-rise-of-the-ai-crawler

The five technical failures we see most

1. Client-side rendered category / hub pages

Most of the well-known mid-2020s e-commerce platforms shipped category pages as CSR React apps. They look fine in a browser, they rank fine in Google (Googlebot renders), and they're completely empty to GPTBot and the other AI fetchers that do not run JavaScript. Curl them with each AI bot user-agent (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot) and compare to a rendered-browser snapshot — the AI surfaces will commonly return a 200 OK with a JavaScript shell and zero <a href> tags.

Fix: SSR the pages. If you can't ship SSR in-house, an edge-prerender layer (Cloudflare Worker calling a managed Chromium) gets you there in two weeks without touching application code.

2. robots.txt accidentally blocking AI bots

Many sites added User-agent: GPTBot \n Disallow: / in 2023, then forgot about it. Two years later they wonder why they're invisible in ChatGPT. The opposite happens too: a wildcard Disallow that was meant for a single noisy bot now blocks Claude and Perplexity along with it.

Fix: audit robots.txt explicitly per known AI bot. Allow by default if you want AI surface; Disallow explicitly per bot if not. Don't rely on "everyone is the same."

3. Soft 404s served as HTTP 200

An SPA that handles routing client-side will often return 200 OK + a React 404 page for non-existent URLs. AI crawlers don't run the React app, so they index the URL as a valid page with thin content. Cumulative effect: hundreds of "valid" empty pages diluting your topical authority.

Fix: detect soft 404s at the edge and rewrite the status code to 404 for crawlers. This is corrective compliance, not cloaking — Google explicitly asks for accurate status codes.

4. Cookie banners blocking content

OneTrust-style consent gates that delay rendering of the actual page until consent is granted: a real human clicks "accept." A crawler never does. Many AI bots see only the cookie banner and conclude the page is the cookie banner.

Fix: render real content server-side, layer the consent banner client-side as an overlay that doesn't block initial HTML. CWV thanks you too.

5. Critical content in PDFs or images

Product specs in a downloadable PDF. Hero copy as a baked-in image. Pricing in a hosted SVG. AI text extractors handle PDFs better than they used to, but image OCR is still inconsistent across bots. If a fact only exists in a non-text asset, you're betting on the worst-case parser.

Fix: anything you want models to know about you exists as plain HTML text somewhere on the indexable site. The PDF / image is optional duplicate.

The audit

A useful nightly audit covers, per representative URL set:

Fetch with each AI bot user-agent, record HTTP status, byte size, anchor count, h1, meta description, schema presence
Diff the raw HTML against the post-JS DOM (a real browser render); flag URLs where the raw HTML is <50% of post-JS content
Check robots.txt against each AI bot UA
Detect soft-404 patterns on a curated bad-URL list
Validate schema.org JSON-LD
Crawler access matrix: which bots got 200 vs blocked vs errored

Combot ships these as a single audit_ai tool that runs nightly per client. Results land in BigQuery, with deltas flagged when crawler access changes or content extraction drops.

Content quality is a real layer of AI visibility. But you can't synthesise out of nothing. If the technical layer is broken, every word of content you write is invisible to most of the bot population. Fix the pipes first, then optimise the flow.

Further reading: The 7 layers · Source mapping