How Much of the Web Does ChatGPT Actually Use?

By the AEOeye editorial team·Updated Jun 26, 2026

The short answer

ChatGPT uses two very different slices of the web. Its model was trained on a filtered crawl that boiled roughly 45TB of raw text down to about 570GB. But for live answers it reads only a handful of pages per query, fetched fresh from a search index. Neither touches most of the web.

Here's the thing almost everyone gets wrong: ChatGPT does not "use the whole internet." It uses two separate, surprisingly thin slices of it, and they work in completely different ways.

One slice is frozen training data from a few years ago. The other is live retrieval that reads maybe five to ten pages the moment you hit enter. If you want your content cited, you need to understand which slice you're fighting for. This page splits them cleanly and backs every number with a real source.

Training vs retrieval: which 'web' are we even talking about?

ChatGPT touches the web in two distinct ways, and conflating them is the single biggest mistake in AEO. Training is a one-time, frozen snapshot baked into the model's weights. Retrieval (ChatGPT Search) is live: it queries a search index in real time and reads a few pages per answer. Training shapes the model's default knowledge; retrieval is what cites your URL today.

Training data = the corpus used to build the model. Static. Has a cutoff date. You cannot rank in it after the fact.
Retrieval = live web search at query time. Dynamic. This is where fresh content gets pulled in and linked.

If your goal is getting cited in answers right now, retrieval is the game. Training is mostly out of your hands.

How much data was ChatGPT trained on?

The base GPT-3 model behind early ChatGPT was trained on a heavily filtered crawl: roughly 45TB of compressed plain text that was cleaned down to about 570GB, equal to roughly 400 billion tokens. That 570GB is the number that matters. It is a tiny, curated fraction of what got crawled, let alone the whole web.

The composition of that GPT-3 dataset, per OpenAI's own paper, broke down like this:

~60% filtered Common Crawl (open web pages)
~22% WebText2 (outbound links from Reddit posts with 3+ upvotes)
~16% two internet book corpora
A slice of English Wikipedia

Notice what that means: the model's web knowledge is filtered web, not raw web. Quality signals like Reddit upvotes literally decided what made the cut. Source: the GPT-3 paper, 'Language Models are Few-Shot Learners'.

How big is the crawl ChatGPT trained on, really?

Common Crawl is enormous on its own, yet it still captures only a slice of the live web. The corpus spans over 100 billion pages across multiple petabytes collected over a decade, and each monthly crawl now grabs roughly 2 to 3 billion pages. The September 2025 crawl alone held about 2.39 billion pages and 421 TiB uncompressed, per Common Crawl's own release notes.

But here's the catch that AEO folks need to internalize: Common Crawl explicitly does not crawl the deep web, password-protected pages, login-walled content, or sites that block bots. So even the raw input excludes huge swathes of the internet before any filtering happens.

Then OpenAI filtered that down to ~570GB for GPT-3. The web is estimated in the thousands of petabytes. The training set is a rounding error against the whole thing.

How much of the web does ChatGPT read for a live answer?

For a live answer, ChatGPT reads almost nothing, just a handful of pages per query. ChatGPT Search rewrites your prompt into one or more targeted search queries, pulls back a short list of results from its search partner index, and reads the top few it can cleanly parse. We're talking single digits, not the whole web.

OpenAI launched ChatGPT search broadly in late 2024 and early 2025. As of 2025 there are three retrieval depths, and the page count scales hard with each:

Standard web search — a handful of sources per answer, query-driven, with inline citations.
Deep Research — synthesizes dozens to hundreds of sources for a single report.
Agent mode — actively clicks, scrolls, and scrapes across sites.

For 95% of normal questions, you're competing to be one of about five to ten links the model chooses to read. That scarcity is the whole ballgame for AEO.

How does ChatGPT decide which pages to use?

ChatGPT doesn't grab result #1 and stop. It scans the available sources and prioritizes pages it can cleanly parse and reuse, which rewards structure and punishes mush. Readability and machine-parseability are the deciding factors, not just raw rank.

What tips the scales, based on how the retrieval layer behaves:

Clean HTML structure — well-formed tables, clear headings, labeled sections.
Answer-first content — a direct answer near the top it can lift verbatim.
Credibility cues — visible author names, dates, consistent formatting, citations.
Licensed publishers — OpenAI has content deals with AP, Reuters, Condé Nast, Hearst and others, which feed the index.

If the model can't quickly understand and extract from your page, it moves to one it can. This is why a thin, ad-choked page loses to a tightly structured one even on the same topic. Want to see which of these queries you already show up for? AEOeye runs a free AI visibility audit across ChatGPT, Perplexity, Google AI, Claude and Gemini so you can stop guessing.

So what should you actually do about it?

Stop optimizing for training, start optimizing for retrieval. You can't retroactively get into a frozen 570GB training set, but you can make your live pages the easy pick for the five-to-ten links ChatGPT reads per query. That's where the leverage is.

Practical moves, in priority order:

Lead with the answer. Put a 40-60 word direct answer at the top of every page. Retrieval models lift these.
Make it machine-readable. Real HTML headings, tables, lists. Add schema.org structured data so parsers don't guess.
Don't block the crawlers. If you wall off content or block AI bots, you self-eliminate from both training and live retrieval.
Earn credibility signals. Bylines, dates, sources, and outbound citations to authoritative pages.
Audit your current visibility. Measure where you appear before and after changes, not from a single account that personalizes results.

Key terms

Common Crawl: A free, open repository of web crawl data spanning 100B+ pages over multiple petabytes, widely used to train large language models. It skips the deep web, login walls, and bot-blocked sites. ↗
Training data: The static corpus used to build a model's weights. It has a cutoff date and cannot be changed after training, which is why it goes stale. ↗
Retrieval (RAG / live search): Fetching fresh pages from a search index at query time and feeding them to the model so it can cite current sources beyond its training cutoff. ↗
Token: The unit of text a language model processes, roughly a word fragment. GPT-3's ~570GB training set equated to about 400 billion tokens. ↗

	Dimension	Training data
What it is	Frozen snapshot baked into model weights	Real-time web search at query time
Scale of web used	~570GB filtered text (from ~45TB compressed)	~5-10 pages per query
Freshness	Stuck at a cutoff date	Current, fetched on the spot
Can you influence it now?	No — set in stone after training	Yes — structure and publish for it
What wins	Being in the corpus years ago	Machine-readable, answer-first pages

Key takeaways

ChatGPT uses the web in two unrelated ways: frozen training data and live retrieval. Don't confuse them.
GPT-3's training set was filtered from ~45TB of compressed text down to roughly 570GB, about 400 billion tokens. A tiny slice of the web.
Around 60% of that GPT-3 data came from filtered Common Crawl, plus ~22% Reddit-linked WebText2 and ~16% books.
Common Crawl spans 100B+ pages over multiple petabytes but skips the deep web, login walls, and bot-blocked sites entirely.
For a live answer, ChatGPT reads only a handful of pages per query, not the whole web. That scarcity is what AEO competes for.
Win retrieval by being machine-readable and answer-first; you can't retroactively enter the training set.

See how AI talks about your brand

Run a free AI visibility audit in under a minute.

FAQ

How much data does ChatGPT use in total?+

It depends on which mode. The underlying GPT-3 model was trained on a filtered corpus of roughly 570GB of text (boiled down from about 45TB compressed). For a live answer via ChatGPT Search, it reads only a handful of fresh pages per query, not a large dataset.

Does ChatGPT use the entire internet?+

No. Its training set is a filtered fraction of the web that excludes the deep web, login-walled pages, and bot-blocked sites. Its live retrieval reads only about five to ten pages per query. Most of the internet never touches a given ChatGPT answer.

What percentage of ChatGPT's training came from Common Crawl?+

For GPT-3, roughly 60% of the training mix came from a filtered version of Common Crawl. The rest was about 22% WebText2 (Reddit-linked pages), 16% book corpora, and a portion of Wikipedia, per OpenAI's GPT-3 paper.

Is ChatGPT's training data up to date?+

No. Training data is a frozen snapshot with a cutoff date. That's exactly why live retrieval (ChatGPT Search) exists, to pull current pages at query time. If you want to be cited on recent topics, you're competing in retrieval, not training.

How can I tell if ChatGPT uses my website?+

Run a visibility audit that tests real prompts across engines and shows whether your pages get cited. AEOeye offers a free audit across ChatGPT, Perplexity, Google AI, Claude, and Gemini so you can see exactly where you appear instead of guessing.