Where does ChatGPT get its information?

ChatGPT doesn’t draw from a single source. It combines three distinct channels, each with its own rules. I show you which ones and where to act to exist in the answers.

Sommaire

TL;DR

ChatGPT draws from three distinct channels: its training corpus (frozen at a given date), the Bing index it queries in real time via ChatGPT Search, and the session context (memory, files, instructions). Each channel follows different inclusion rules. Working on one doesn’t make you visible in the others.

Training corpus: Common Crawl plus licensed publisher partnerships (News Corp, AP, Reddit, Vox Media, Axel Springer, Le Monde, Financial Times, Guardian, and others).
ChatGPT Search: the Bing index as backbone. 87% of ChatGPT citations match the top organic Bing results.
Session context: uploaded files, user memory, custom instructions.
2026 reality: Wikipedia (13.15%) plus Reddit (11.97%) account for more than 25% of US citations. LinkedIn is climbing fast (#5). WSJ, NYT, Bloomberg, and FT are absent from the top 20 despite their partnerships.

The three information channels of ChatGPT

When you ask ChatGPT a question, the model doesn’t consult one single database. Depending on the query, it combines three sources that differ in nature and timing.

Channel	Nature	Timing	Main lever
Training corpus	Public web plus licensed data	Frozen until next model	Awareness, third-party mentions, Wikipedia
ChatGPT Search	Bing index, real-time	Near-instant	Bing indexing, technical SEO, freshness
Session context	Files, memory, instructions	Length of the conversation	Availability of structured content to load

The mobilised channel depends on the question. A general definition? The training corpus is usually enough. Breaking news? The model switches to ChatGPT Search. A business document? It draws on the context the user has loaded. This three-way split explains why the same brand can dominate in one case and be invisible in another.

Channel 1: the training corpus

The model was trained on a massive volume of text from the public web, mainly through Common Crawl, complemented by licensed sources. This corpus is frozen at the cutoff date OpenAI announces. You no longer enter it once the window closes.

Since 2024, OpenAI has multiplied paid publisher partnerships. Schibsted, Axios, Guardian, Hearst, Condé Nast, People Inc., Dotdash Meredith, The Atlantic, Prisa Media, Vox Media, News Corp, Le Monde, Financial Times, Axel Springer, Reddit, and Associated Press are among the public names. These deals grant privileged access to content, either for training, or for ChatGPT Search with explicit attribution.

Reverse symmetry: 60% of major publishers now block GPTBot via robots.txt despite the financial incentives available. The corpus closes off as it gets monetised. For a non-media brand, the question is no longer being ingested as a publisher, but being cited by partner publishers.

Why your brand probably isn’t in the corpus. Three recurring reasons: your site is too recent, you block GPTBot by default, or the online content about you is too thin to cross the representation threshold. The consequence: the model doesn’t know you, and can’t cite you without browsing.

Channel 2: ChatGPT Search and the Bing index

Launched in late 2024 and rolled out to everyone in February 2025, ChatGPT Search lets the model query the web in real time. The technical backbone is Microsoft’s Bing index.

The number that matters: 87% of ChatGPT citations match the top organic Bing results, versus 56% for Google. If Bing hasn’t indexed your page, ChatGPT Search can’t cite it. The condition is necessary before any other technical consideration.

OpenAI’s three crawlers

OpenAI operates three distinct bots, with separate roles:

OAI-SearchBot: feeds the ChatGPT Search index.
ChatGPT-User: triggered when a user asks for a page to be read in real time.
GPTBot: used to train future models.

You can allow or block them independently in your robots.txt. Allowing OAI-SearchBot does not sign you up for training.

Channel 3: session context

The source most ignored by marketing teams. When a user uploads a PDF, activates memory, or pastes text into the conversation, that content becomes a first-rate source for the answer, sometimes outweighing the corpus and Bing combined.

For a brand, this means your prospects build the very context in which ChatGPT evaluates you. The downloaded white paper, the shared analyst report, the copied product page: all of it enters the equation at the exact moment the decision is being prepared.

Who ChatGPT actually cites: 2026 data

Three recent studies converge on a finding that is uncomfortable for traditional media brands. ChatGPT cites few flagship outlets and many community platforms. Across 600,000 US citation events analysed by Similarweb over January and February 2026, Wikipedia accounts for 13.15% and Reddit for 11.97%. Reuters ranks 7th at 2.27%. Forbes closes the top 20 at 1.38%. The Wall Street Journal, The New York Times, Bloomberg, and the Financial Times do not appear, despite their OpenAI partnerships.

Two notable shifts since late 2025. LinkedIn moved from 11th to 5th place in three months and now appears in 14.3% of ChatGPT Search responses. Reddit saw its citation share collapse from around 60% to 10% in two weeks in September 2025, before partially recovering. Volatility is now weekly, no longer annual.

The annual audit is no longer enough. A ChatGPT visibility map ages in weeks. Without continuous tracking, you make decisions on an outdated snapshot. This is precisely the problem I address at Cockpyt AI.

What this changes for your strategy

Three channels, three timeframes, three levers. Conflating the three is the most frequent mistake I see in client work.

For marketing teams

Your brand has an exposure score per channel, not a global score. You can be over-represented on Bing and invisible in the training corpus. You can dominate Wikipedia and appear nowhere on LinkedIn. Diagnosis must be done channel by channel, otherwise budget arbitration happens in the dark.

For SEO and GEO consultants

Six signals to work on, ordered by increasing entry cost:

Complete Bing indexing (absolute prerequisite).
Presence on G2, Capterra, Trustpilot, and Yelp: being listed on at least three of these platforms roughly triples the probability of citation (5W, May 2026).
A neutral, sourced, stable Wikipedia entry.
Editorial activity on LinkedIn with long, structured content.
Relevant, durable, non-spammy Reddit presence.
Earned media via OpenAI partner publishers (Reuters, Forbes, Guardian, Le Monde).

None of these levers acts immediately. All require measurable follow-through. The logical starting point remains a complete GEO audit that ranks the six according to your sector and starting position.

FAQ

Does ChatGPT read my site live on every query?

Only if the user triggers ChatGPT Search or pastes a URL into the conversation. The model does not crawl the web on every request. When it does, it goes through the Bing index, so your page must be present there.

Does my site need to appear on Wikipedia?

Not your site, your brand. A neutral, sourced, stable Wikipedia entry remains the strongest authority signal for ChatGPT. Wikipedia represents about 13% of ChatGPT citations in 2026, according to Similarweb data consolidated by 5W.

Do ChatGPT publisher partnerships guarantee citation?

No. Q1 2026 studies show that WSJ, NYT, Bloomberg, and FT are absent from the top 20 cited sources despite active partnerships. The partnership grants the right to be ingested, not the right to be cited. Editorial quality and content structure weigh more than the commercial label.

How do I know if ChatGPT cites me?

You need to probe the model with a panel of prompts that represent your target intents, aggregate the responses, and track the evolution over time. I detail the method in measuring ChatGPT visibility, and more broadly in AI KPIs to track.

Wrap-up

ChatGPT builds its answers from three sources: a frozen training corpus, the live Bing index, and the context the user loads into the session. Each channel has its own rules, winners, and levers. Working blind means firing into the wrong channel. To identify yours, start with a GEO audit.

Sources

5W PR Group, Wikipedia and Reddit Now Drive Over 25% of ChatGPT Citations in the U.S., May 2026. prnewswire.com
Profound, How ChatGPT sources the web, February 2026. tryprofound.com
Profound, AI Platform Citation Patterns, August 2025. tryprofound.com
OpenAI, Introducing ChatGPT search, October 2024 (updated February 2025). openai.com
OpenAI, Partnering with Axios expands OpenAI’s work with the news industry, January 2025. openai.com
Will Scott, How AI Licensing Deals Determine Search Visibility in 2025, October 2025. willscott.me
Search Engine Journal, ChatGPT Search Indexing: Essential Steps For Websites, November 2024. searchenginejournal.com

Where does ChatGPT get its information?

TL;DR

The three information channels of ChatGPT

Channel 1: the training corpus

Channel 2: ChatGPT Search and the Bing index

OpenAI’s three crawlers

Channel 3: session context

Who ChatGPT actually cites: 2026 data

What this changes for your strategy

For marketing teams

For SEO and GEO consultants

FAQ

Wrap-up

Sources

Florian Zorgnotti

YouTube GEO: Why Does Perplexity Cite Your Videos More Than ChatGPT in 2026?

Where does ChatGPT get its information?

TL;DR

The three information channels of ChatGPT

Channel 1: the training corpus

Channel 2: ChatGPT Search and the Bing index

OpenAI’s three crawlers

Channel 3: session context

Who ChatGPT actually cites: 2026 data

What this changes for your strategy

For marketing teams

For SEO and GEO consultants

FAQ

Wrap-up

Sources

Florian Zorgnotti

YouTube GEO: Why Does Perplexity Cite Your Videos More Than ChatGPT in 2026?

You May Also Like

AI Traffic in GA4: How to Track ChatGPT, Perplexity and Claude?

Is your agency ready for AAO in 2026? The 10 questions that don’t lie

How to set up a monthly refresh cycle to stay cited by LLMs in 2026?