Skip to content
Updated Posts, tools, and technical notes collected in one archive.
See recent articles

SEO Experiments / Blog / AI

OpenAI Crawlers Are Turning ChatGPT Into a Search System

By Hendrik 16 min read May 06, 2026
Abstract technical illustration of OpenAI crawler streams, server logs, structured data blocks, and a search index

Table of contents

0 sections

    ChatGPT visibility is becoming easier to measure, and harder to fake. The most useful signal is no longer a vague screenshot from an answer engine. It is the server log. A recent Botify and Nectiv analysis looked at roughly 7 billion OpenAI-bot log events from large websites between November 2024 and March 2026. The pattern is hard to miss: OpenAI is crawling more, the search crawler has become more important, and the old distinction between technical SEO and AI visibility is starting to collapse.

    This version is written specifically for SEO Experiments. It uses the same underlying research topic, but the structure, framing, examples, and wording are rebuilt around a more implementation-focused question: what should a technical SEO, developer, or content owner actually change after reading the data?

    The short version is simple. If you want to appear in ChatGPT Search, you need the parts of your site that matter to be reachable by OAI-SearchBot. If you want OpenAI to be able to use your content for future model training, GPTBot is the crawler to watch. If a user explicitly asks ChatGPT or a Custom GPT to open a page, ChatGPT-User may fetch it. Those are different access paths. Treating them as one generic "ChatGPT crawler" creates bad robots.txt decisions, bad reporting, and usually bad strategy.

    7B+ OpenAI bot log events analyzed in the Botify and Nectiv study.
    3.5x Reported growth for OAI-SearchBot activity after GPT-5 launched in August 2025.
    2.9x Reported growth for GPTBot activity over the same comparison window.
    -28% Reported drop in ChatGPT-User events from December 2025 to March 2026.

    The real story is not "AI crawlers are growing"

    AI crawler growth is no longer surprising. Every infrastructure team running a serious website has seen unusual bot traffic, duplicated requests, odd user agents, and bursts of fetches from AI companies. The interesting part of the Botify and Nectiv study is more specific: OpenAI's automated crawl activity did not just rise evenly. The search-oriented crawler, OAI-SearchBot, grew faster than the training-oriented crawler, GPTBot.

    Before GPT-5, the study found OAI-SearchBot and GPTBot relatively close to each other. After GPT-5, the ratio moved in favor of search. Botify and Nectiv report that OAI-SearchBot rose by a factor of 3.5, adding about 2.2 billion events in their dataset, while GPTBot rose by a factor of 2.9, adding about 1.8 billion events. That does not prove how every ChatGPT answer is generated, and it does not mean the whole web is represented equally. The source data comes from Botify's enterprise customer base, which leans toward large retail, ecommerce, publishing, travel, software, and marketplace properties. But it is still a strong signal because the sample is large and based on server logs, not surveys.

    The strategic implication is bigger than the percentages. It suggests that ChatGPT is moving toward a search-backed architecture where current web retrieval matters more often. In other words, the model does not have to "know" every current fact at training time if it can retrieve the right page when needed. For site owners, that is a more familiar world than pure model training. Search-backed systems have crawl paths, indexes, access rules, freshness constraints, and measurable server events.

    Three OpenAI user agents, three different jobs

    OpenAI's own crawler documentation separates the roles clearly. The current documentation lists OAI-SearchBot/1.3 for ChatGPT search visibility, GPTBot/1.3 for crawling content that may be used in training foundation models, and ChatGPT-User/1.0 for user-triggered actions in ChatGPT, Custom GPTs, and GPT Actions. OpenAI also says the settings are independent. A site can allow OAI-SearchBot for search while disallowing GPTBot for training.

    Crawler Primary role SEO meaning Typical control point
    OAI-SearchBot Search crawler for ChatGPT's search features. Most relevant for being surfaced as a source in search-backed ChatGPT answers. User-agent: OAI-SearchBot
    GPTBot Training crawler for foundation model improvement. More relevant for long-term model knowledge than immediate search visibility. User-agent: GPTBot
    ChatGPT-User User-triggered fetches, Custom GPTs, and GPT Actions. Useful for direct user requests, but not the crawler that determines ChatGPT Search inclusion. User-agent: ChatGPT-User

    This split matters because many robots.txt files still use broad bot rules written in a hurry during the first AI-scraping panic. Some block all OpenAI traffic. Some allow everything. Some block GPTBot and accidentally block search visibility because they copied a generic snippet without understanding the user-agent separation. The practical middle path for many commercial sites is more nuanced: allow search, decide separately on training, and allow user-triggered access if it supports real users or app workflows.

    Why the ChatGPT-User decline is not necessarily bad news

    The Botify and Nectiv study reports a 28 percent decline in ChatGPT-User events from December 2025 to March 2026. It is tempting to read that as a direct proxy for declining ChatGPT use, but that is too simple. ChatGPT-User is not an automated search crawler. It appears when a user asks ChatGPT, a Custom GPT, or a GPT Action to interact with a specific page or external application. A drop in this user agent may mean fewer direct page fetches, but it may also mean more answers are served from cached or indexed resources.

    That second interpretation is especially important for SEO. If OpenAI is building a more comprehensive HTML index from OAI-SearchBot crawls, ChatGPT may need fewer real-time user-triggered fetches. The work shifts from "open this URL now" toward "retrieve from a known index." That makes server logs for OAI-SearchBot more valuable, not less. It also means site owners should stop using ChatGPT-User as their main proxy for AI search visibility.

    Industry patterns: freshness drives crawl intensity

    The vertical breakdown in the study is one of the most useful parts because it shows that OpenAI's crawl expansion is not evenly distributed. Healthcare and media/publishing saw the largest reported OAI-SearchBot growth, with increases above 700 percent. Marketplaces, software, and retail also grew sharply, while travel increased far less in relative terms.

    That pattern makes sense if we assume that ChatGPT is trying to reduce outdated answers. News, publishing, and health information age quickly. A stale article about a developing event or outdated medical guidance can be obviously wrong and potentially harmful. Product, marketplace, and software pages also change frequently, but the level of urgency varies. Travel content can change quickly too, yet many travel answers may rely on semi-stable destination knowledge or third-party sources instead of crawling every page as aggressively.

    The important point is not that every healthcare site will receive the same crawl pattern. It will not. The dataset is enterprise-heavy and absolute crawl counts depend on URL inventory, internal linking, popularity, response health, and historical discovery. The point is that OpenAI appears to allocate crawl attention differently by content type. That makes industry-specific log analysis more useful than generic AI visibility advice.

    The JavaScript problem is still the unglamorous blocker

    Technical SEO has a habit of becoming fashionable every few years under a new label. AI visibility is doing exactly that. The fashionable words are GEO, AEO, answer optimization, and agent readiness. The unfashionable constraint is HTML. Vercel and MERJ's crawler research found that major AI crawlers, including OpenAI's OAI-SearchBot, ChatGPT-User, and GPTBot, did not render JavaScript in their tests. The crawlers may request JavaScript files, but requesting a script is not the same as executing it and seeing the hydrated page.

    For a modern frontend team, this is the uncomfortable sentence: if the article body, product facts, FAQ answers, author information, or schema markup only exist after client-side rendering, many AI crawlers may not see them. A React, Vue, Angular, or custom SPA can look perfect in a browser while sending a thin HTML shell to crawlers. Google may render enough of that page eventually. Many AI crawlers currently behave more like fast HTML fetchers.

    The fix is not mysterious. Use server-side rendering, static generation, incremental static regeneration, edge rendering, or reliable prerendering for content that must be discovered. Put critical copy, canonical links, title tags, meta descriptions, hreflang tags, schema, pagination links, and primary navigation in the initial HTML response. Test with curl, not just Chrome DevTools. If the content is not in the source response, do not assume OAI-SearchBot can use it.

    Structured data becomes a clarity layer, not a magic trick

    Structured data is often oversold. It is not a guarantee of ranking, citation, or inclusion in AI answers. But it is extremely useful as a clarity layer. JSON-LD can tell a crawler that a page is an Article, Product, Organization, Person, FAQPage, HowTo, or Event. It can connect authors to organizations, products to offers, and articles to publication dates. For a retrieval system that needs to parse millions of pages, explicit structure lowers ambiguity.

    The same caveat applies as with body content: structured data should be emitted server-side. Injecting schema with a tag manager or after hydration may work for some systems, but it is the wrong default for AI crawler visibility. The safest pattern is boring and robust: render JSON-LD in the HTML that the server returns. Then validate it with Google's Rich Results Test for syntax, inspect the raw HTML with curl, and watch logs to confirm the right crawler can access the page.

    For editorial sites, Article or BlogPosting markup should include headline, author, datePublished, dateModified, image, publisher, and mainEntityOfPage. For ecommerce, Product markup should include offers, availability, price, currency, brand, and review data only where it is actually visible and accurate. For expert-led topics, Person and Organization markup with sameAs references can help connect the entity graph. The goal is not to decorate the page with schema. The goal is to make the entity and content relationships unambiguous.

    Robots.txt strategy: separate visibility from training

    OpenAI's crawler documentation gives site owners a choice. You can allow OAI-SearchBot while disallowing GPTBot. You can allow both. You can block both. The right answer depends on the business model. A publisher negotiating licensing terms may take a different approach from a SaaS company that wants brand mentions to appear in future model knowledge. A marketplace may want product discovery in ChatGPT Search but still restrict training use. A public documentation site may decide that both search and training access create more upside than risk.

    A common commercial configuration looks like this:

    User-agent: OAI-SearchBot
    Allow: /
    
    User-agent: ChatGPT-User
    Allow: /
    
    User-agent: GPTBot
    Disallow: /

    This is not legal advice, and it is not universal SEO advice. It is a pragmatic example of separating search visibility from model training. Before adopting it, check paywall rules, licensing agreements, internal policy, and whether the content is actually intended to be discoverable through ChatGPT Search. Also verify real crawler IPs against OpenAI's published IP lists, because user-agent spoofing is trivial.

    Do not debug this from robots.txt alone. A clean robots file is only the policy. Logs show what happened. Check crawl frequency, status codes, canonical targets, redirects, blocked paths, timeout patterns, and whether important templates receive visits from the crawler you actually care about.

    What to measure in log files

    Most AI visibility dashboards are forced to infer. Server logs do not infer. They show requests. That is why this study matters: it uses the level of data that technical teams can reproduce on their own sites. You do not need a 7-billion-event dataset to learn something useful. You need clean log access, user-agent filtering, IP verification for high-stakes decisions, and a report that separates crawler types.

    Start with four views. First, crawl volume by user agent over time: OAI-SearchBot, GPTBot, ChatGPT-User, and any other AI crawlers you care about. Second, status-code distribution: 200, 301, 302, 304, 403, 404, 410, 429, 500, and timeout patterns. Third, template or directory coverage: which page types are being crawled and which are ignored. Fourth, freshness: how quickly important new pages receive their first visit and how often updated pages are revisited.

    The most actionable finding is often mundane. A site might allow OAI-SearchBot but serve it a 403 through a bot-protection layer. It might expose article pages but hide category pagination. It might return canonical tags to old URLs. It might deliver a 200 response with an empty JavaScript shell. It might block image assets that the page references. Each of these problems looks like an "AI visibility" problem from the outside. In logs and HTML, it is just technical SEO.

    The SEO vs. AEO debate looks less interesting after this data

    The industry has spent a lot of energy arguing whether Answer Engine Optimization is a new discipline. Some of that is useful because answer engines introduce new measurement problems. Some of it is branding. The crawler evidence pushes the practical work back toward fundamentals: crawlability, renderability, internal linking, canonicalization, structured data, page speed, clean content architecture, and authoritative entities.

    That does not mean nothing has changed. The output format has changed. Instead of ten blue links, users may see a synthesized answer, a product carousel, a cited source list, or a conversational recommendation. The measurement layer is less mature. There is no OpenAI Search Console equivalent. Citation tracking is noisy. User journeys are fragmented. But the input layer is surprisingly familiar. A crawler requests a URL. A server returns a response. The system tries to understand the page.

    That is good news for teams with strong technical SEO habits. It means they do not need to buy a mystical AI optimization playbook before fixing the basics. It is bad news for sites that have ignored those basics because they looked fine to human users in Chrome. AI systems are increasing the cost of vague HTML, late-rendered content, blocked crawlers, and unstructured pages.

    A practical checklist for ChatGPT Search readiness

    Area What to check Why it matters
    Robots access Separate OAI-SearchBot, GPTBot, and ChatGPT-User rules. Search visibility, model training, and user-triggered fetches are not the same thing.
    Initial HTML Confirm the main content appears without executing JavaScript. Many AI crawlers currently do not render client-side content.
    Schema Render JSON-LD server-side for the relevant page type. Structured data makes entity and content relationships easier to parse.
    Discovery Use clean internal links, XML sitemaps, canonical URLs, and indexable pagination. Unknown or orphaned URLs are difficult for search crawlers to prioritize.
    Logs Track crawler volume, status codes, templates, and freshness. Logs are the only reliable source for what a crawler actually requested.
    Performance Watch server response times, bot protection, redirects, and 5xx spikes. AI crawlers may be less patient and less forgiving than mature search crawlers.

    What I would change first on a real site

    If I had one sprint to improve ChatGPT Search readiness, I would not begin with prompt tracking. I would begin with crawl access and HTML. First, verify that OAI-SearchBot is allowed, not challenged, and not accidentally routed through a degraded bot experience. Second, sample the raw HTML of important templates. Third, check whether JSON-LD appears in that raw HTML. Fourth, inspect logs for status-code waste. Fifth, compare sitemap URLs with actual crawler hits.

    Only after that would I start looking at content-level optimization. The content work still matters: clear headings, concise definitions, direct answers, original examples, source references, and updated facts all help retrieval systems understand and trust a page. But content optimization is wasted if the crawler receives a blank shell or never discovers the URL.

    The biggest mindset shift is sequencing. Traditional content teams often write first and ask developers to "make it SEO-friendly" later. For AI search visibility, the rendering and access layer has to be settled earlier. Otherwise a beautiful expert article is effectively private from the crawler that might have cited it.

    Conclusion: AI visibility is becoming a crawl problem again

    The Botify and Nectiv study is valuable because it moves the conversation away from vibes. OpenAI's crawl activity has grown sharply in their enterprise log dataset. OAI-SearchBot has become more prominent relative to GPTBot. ChatGPT-User has declined, possibly because user-triggered fetching is being replaced by more systematic search indexing. The exact numbers will differ by site, but the direction is clear enough to act on.

    For SEO teams, the lesson is not to abandon classic search work. It is to apply it more rigorously. Let the right crawler in. Serve complete HTML. Make structured data visible without JavaScript. Use logs, not assumptions. Separate search access from training access. Measure by template, status code, and freshness. Treat AI visibility as an extension of crawlable, understandable, trustworthy web publishing.

    There will be new interfaces and new reporting layers. There will also be more crawlers from more AI platforms. But the foundation is stable: if the machine cannot discover, fetch, parse, and trust the page, it cannot reliably use the page. That was true for search engines. It is becoming true again for AI answers.

    Continue reading

    Continue with another article or a related tool.

    If you want to keep going, the archive and tool pages are linked below.

    Continue reading

    More from the lab

    Related reading to keep the topic connected to the broader experiment library.