Guide

How do AI engines pick which sites to cite?

Q: How do AI engines decide which sites to cite?

In two stages. First, retrieval: the engine gathers candidate pages, either from a search index (ChatGPT leans heavily on Bing, Gemini on Google) or by fetching pages live at answer time. At this stage, classic visibility matters: your site must be crawlable, indexed and readable. Second, selection: the language model reads the retrieved pages and picks the passages that best answer the user's question, then cites their sources. This stage is passage-level, not site-level — the model quotes the paragraph that answers, not the best-known domain. That two-stage funnel is the whole playbook: being accessible and understandable gets you into the candidate pool; having self-contained passages that answer real questions directly is what converts candidacy into citations. Miss the first stage and you are invisible; miss the second and you are read but never quoted.

Q: Do classic SEO signals still matter?

Yes, at the retrieval stage — with one important twist. AI engines mostly discover content through search indexes and their own crawlers, so the fundamentals still gate everything: crawlable pages, indexation, decent titles, server-rendered HTML (several AI crawlers execute little or no JavaScript), reasonable load times. The twist: the selection stage is passage-level, so domain authority buys less than it does in classic SEO. A small site whose paragraph answers the question directly can be cited ahead of a big site whose page merely mentions the topic — that is the opportunity for independents and small businesses. Watch the reverse, though: blocking AI crawlers in robots.txt removes you from the pool entirely. GPTBot, ClaudeBot, PerplexityBot and Google-Extended each obey their own user-agent rules, and many sites block them without realizing it.

Q: What should you change on your site to get cited more?

Five things, in increasing order of effort. One: explicitly allow AI crawlers in your robots.txt — one line per bot. Two: publish an llms.txt at your root, so engines get a clean summary of who you are and what you offer. Three: add structured data — Q&A markup in schema.org format (JSON-LD) that exposes machine-readable question-and-answer pairs on your key pages. Four: reshape those pages as questions with direct, self-contained answers of roughly 40 to 160 words, backed by real facts, numbers and dates. Five: show a visible updated date and an accurate dateModified — freshness is a signal answer engines use. The first three are mechanical and fast; the last two are editorial and make the durable difference.

Q: Can anyone guarantee you'll be cited by AI?

No — and be wary of anyone who promises it. Nobody controls ChatGPT, Perplexity, Gemini or Claude: answers vary by question, by day and by engine, and the criteria keep evolving. What you do control is being in the candidate pool and being the easiest source to quote — the two stages described above. Mechanical changes (robots.txt, llms.txt, schema.org markup) take effect as soon as engines re-crawl your site, typically within days to a few weeks. To measure progress, keep it simple: ask the engines the questions your customers actually ask, and see who they cite — then re-ask regularly. That is exactly Citeable's logic: a best-efforts obligation, not a guarantee of results; we make your site as readable and citable as possible, honestly, and the studies above are why it is worth doing.

Updated · July 5, 2026 — Joffrey Bonifay

When ChatGPT, Perplexity, Gemini or Claude answer a question, they cite a handful of sources — rarely more than three or four. How do those sources get chosen? Here is what we know, mechanism by mechanism, with numbers and studies to back it.

How do AI engines decide which sites to cite?

In two stages. First, retrieval: the engine gathers candidate pages, either from a search index (ChatGPT leans heavily on Bing, Gemini on Google) or by fetching pages live at answer time. At this stage, classic visibility matters: your site must be crawlable, indexed and readable.

Second, selection: the model reads the retrieved pages and picks the passages that best answer the question, then cites their sources. This stage is passage-level, not site-level — the model quotes the paragraph that answers, not the best-known domain. Being accessible gets you into the candidate pool; having self-contained passages that answer directly converts candidacy into citations. Miss the first stage and you are invisible; miss the second and you are read but never quoted.

Why does being cited matter so much now?

Because clicks are moving from links to answers, and the numbers are steep. Seer Interactive measured organic click-through rates across 3,000+ informational queries: when a Google AI Overview is present, organic CTR drops by 61%. Ahrefs measured the same effect on the top-ranking result: a 58% reduction in clicks. Pew Research Center found users click a traditional link roughly half as often when an AI summary appears.

The flip side is the opportunity: the same Seer data shows brands cited inside AI answers earn about 35% more organic clicks. The traffic is not disappearing uniformly — it is being rerouted to the handful of sources the engines quote. The game is no longer to rank among ten blue links; it is to be the source the answer cites.

What kind of content do AI engines prefer to cite?

The best public evidence is the GEO study (Aggarwal et al., KDD 2024), which tested nine optimization strategies across 10,000 queries sent to generative engines. The winners were concrete: adding statistics improved a source's visibility by around 41%, adding quotations by around 28%, and citing credible external sources also produced clear gains — up to 30-40% combined, with the biggest lift for sites that do not already rank first. Keyword stuffing did nothing or hurt.

The pattern behind those numbers: answer engines favor passages that are specific, verifiable and self-contained — a claim with a number and a source is easier to quote confidently than a vague marketing sentence. Write paragraphs that each answer one question, directly, with evidence, and you match what the selection stage is looking for. (This very guide is written in that format.)

Do classic SEO signals still matter?

Yes, at the retrieval stage — with one important twist. AI engines mostly discover content through search indexes and their own crawlers, so the fundamentals still gate everything: crawlable pages, indexation, decent titles, server-rendered HTML (several AI crawlers execute little or no JavaScript), reasonable load times.

The twist: selection is passage-level, so domain authority buys less than it does in classic SEO. A small site whose paragraph answers the question directly can be cited ahead of a big site whose page merely mentions the topic — that is the opportunity for independents and small businesses. Watch the reverse, though: blocking AI crawlers in robots.txt removes you from the pool entirely. GPTBot, ClaudeBot, PerplexityBot and Google-Extended each obey their own user-agent rules, and many sites block them without realizing it.

What should you change on your site to get cited more?

Five things, in increasing order of effort. One: explicitly allow AI crawlers in your robots.txt — one line per bot. Two: publish an llms.txt at your root, so engines get a clean summary of who you are. Three: add structured data — Q&A markup in schema.org format (JSON-LD) on your key pages.

Four: reshape those pages as questions with direct, self-contained answers of roughly 40 to 160 words, backed by real facts, numbers and dates. Five: show a visible updated date and an accurate dateModified — freshness is a signal answer engines use. The first three are mechanical and fast; the last two are editorial and make the durable difference.

Can anyone guarantee you'll be cited by AI?

No — and be wary of anyone who promises it. Nobody controls ChatGPT, Perplexity, Gemini or Claude: answers vary by question, by day and by engine, and the criteria keep evolving. What you do control is being in the candidate pool and being the easiest source to quote — the two stages described above.

Mechanical changes (robots.txt, llms.txt, schema.org markup) take effect as soon as engines re-crawl your site, typically within days to a few weeks. To measure progress, keep it simple: ask the engines the questions your customers actually ask, and see who they cite. That is Citeable's logic: a best-efforts obligation, not a guarantee of results — we make your site as readable and citable as possible, and the studies above are why it is worth doing.