- If your URL is in the AI Overview source list but a competitor’s passage gets quoted, you have a selection problem. If your URL isn’t in the source list at all, you have a retrieval problem. These need opposite fixes.
- In our audits, cited passages usually run 40–80 words, name the subject early, and land the claim before any qualifier. Passages over ~110 words tend to get truncated or skipped.
- FAQPage schema only helps when the answer text matches a sentence in the body word-for-word. Per Google’s FAQPage docs, the on-page content must match the structured data.
- Most pages don’t need a rewrite. They need the first 60 words of each section refactored so the answer comes before the setup.
- Run the four passes in order: citation diff, entity coverage, atomic-answer extraction, schema reinforcement. Skip the order and you’ll spend rewrite budget on the wrong layer.
Questions this article answers:
- Why does Google AI Overview scrape my page but cite a competitor or Reddit thread?
- How long should an atomic answer be, and where on the page does it need to live?
- What schema actually reinforces AI citations vs. just being SEO theater?
- Should I rewrite the page or just refactor the first 60 words of each section?
- How do I measure GEO success when AIO impressions don’t show in GSC?
If AI Overviews are scraping your page but citing a Reddit thread, you don’t have a content problem. You have a selection problem. The page got retrieved. The page got read. The model picked a cleaner sentence somewhere else. A GEO content audit exists to diagnose which layer failed, before anyone touches the copy.
GEO stands for generative engine optimization. It is the work of getting your pages cited by AI answer engines like Google AI Overviews, Perplexity, ChatGPT search, and Gemini. It is not the same as ranking. A page can sit at position 3 in Google for six months and never appear in a single AI Overview citation. The fix isn’t writing more. It’s restructuring what’s already there so the model can lift a clean answer.
This guide walks through a four-pass audit: citation diff, entity coverage, atomic-answer extraction, and schema-claim reinforcement. Run them in order. The first pass tells you which of the other three you actually need.
A Page Can Rank in the Top 10 and Still Lose Every AI Overview Citation
Ranking and citation are two different jobs. Ranking decides which blue links appear. Citation decides which sentences get quoted inside the AI answer box. The systems share inputs (crawlability, authority, relevance) but score on different signals.
Per Google Search Central’s documentation on AI features in Search, AI Overviews use generative responses grounded in web results, with a separate process selecting which passages get surfaced. Perplexity runs its own retrieval-augmented stack and crawls with PerplexityBot. In our experience, both systems tend to favor short, self-contained passages where the subject is named explicitly and the claim lands before any qualifier.
A top-10 ranking proves the page is on-topic. It doesn’t prove any single passage is extractable. That is what the four-pass audit is built to find.
Retrieval Failure and Selection Failure Need Opposite Fixes
Most GEO guides treat citation gaps as one problem with one fix. They are two problems with opposite fixes.
- Retrieval failure: your URL never appears in the AI Overview source list. The model didn’t consider you. Fix: crawlability, internal linking, and entity coverage so the model knows the page is on-topic.
- Selection failure: your URL appears in the source list, but the cited sentence comes from someone else. The model considered you and picked a cleaner passage elsewhere. Fix: refactor the atomic answer so yours is the extractable one.
Rewriting the page when you have a retrieval failure is wasted work. Adding schema when you have a selection failure is decoration. The first pass tells you which one you’re solving for.
Pass 1: Run a Citation Diff Before You Change a Single Word
A citation diff compares the passage the AI Overview actually quoted against the equivalent passage on your page. It is the only diagnostic that separates retrieval failure from selection failure. It takes about 15 minutes per query.
Here’s the workflow:
- Pull the live AI Overview answer block for your target query. Use a clean browser session (no personalization) or a tool like Profound or Otterly.ai to capture the answer and source list at scale.
- List every source URL the AIO cites. Note which one provided the quoted passage (usually the first or most-linked source).
- Check if your URL is in the source list. If yes, you have a selection failure. If no, you have a retrieval failure.
- For selection failures: copy the quoted passage. Pull the equivalent passage from your page. Diff them sentence by sentence. Look at subject placement, claim placement, qualifier placement, and length.
- For retrieval failures: skip to Pass 2. The diff has nothing to compare yet.
Run this across 10–20 target queries before drawing conclusions. One missed citation is noise. A pattern across 20 queries is a diagnosis.
Two Ratios That Tell You Which Pass to Run Next
Two simple ratios turn citation data into a triage signal:
- Citation retrieval rate = queries where your URL appears in the AIO source set ÷ queries where you rank in the top 10 organic. A low number means retrieval failure. The model isn’t pulling you in.
- Citation pickup rate = queries where your URL is cited in the AIO answer ÷ queries where your URL is in the source set. A low number means selection failure. The model pulled you in and picked someone else.
Healthy retrieval but low pickup means clean passages buried under bad structure. Low retrieval means the model doesn’t see the page as topically relevant in the first place.
Read Server Logs to Confirm Selection Failure
Server logs are the cleanest signal nobody checks. Filter your access logs for GPTBot, PerplexityBot, Google-Extended, and ClaudeBot user agents over the last 30 days. If those bots are hitting a URL but it never appears in answers for related queries, you have confirmed selection failure. The model saw the page. The model didn’t pick it.
That is the scraped but not cited signature, and it points straight at Pass 3.

Pass 2: Audit Entity Coverage When the LLM Doesn’t Know Your Page Is On-Topic
Retrieval failures usually trace back to one structural gap: the page assumes the reader already knows what the topic is. The subject is never stated as a subject-predicate sentence in the first paragraph. The model has to infer it from context, and it often infers wrong.
Implicit entity definition is the most common structural failure we see in pages that rank but don’t get cited. Adding a clean X is a Y that does Z sentence in the first paragraph closes a meaningful share of citation gaps before any schema work happens.
The First-Paragraph Subject-Predicate Test
Open the page. Read the first 100 words. Ask: does a one-sentence definition of the main entity appear, in subject-predicate form, before anything else?
If the page is about journey-aware bidding, the first paragraph needs a sentence like: “Journey-aware bidding is a Google Ads feature that values conversions based on where they sit in the customer path.” Not “Bidding has changed a lot lately.” Not “If you’re running Google Ads, you’ve probably noticed…” The definition has to be there, named, in the open.
Entity disambiguation matters here too. The acronym GEO collides with geographic. The term attribution collides across marketing, psychology, and law. If your topic shares a name with a more famous entity, define it explicitly on first use or the model will route the query to the dominant meaning.
Anchor Text Is Entity Disambiguation, Not Just Link Equity
Internal links do two jobs. They pass authority, and they tell the model what the destination page is about. Most teams only think about the first job.
Link to a page about atomic answers with the anchor “atomic answers in GEO,” not “read more here.” Link to your AI Overview citation guide with “AI Overview citation optimization,” not “this post.” Generic anchor text wastes the disambiguation signal. Our companion piece on why Perplexity scrapes lead-gen pages but cites Reddit instead covers the retrieval-side fix in more depth.
Auditing Entity Coverage Without an Enterprise Tool
You don’t need a six-figure platform. Three free steps:
- Paste your page into ChatGPT or Claude and ask: “What is the main entity of this page? List every supporting entity mentioned, and flag any that are referenced but never defined.”
- Pull the AIO source list for your target query. Note every entity the cited passage names. Compare against your page.
- For any entity the AIO answer names that your page assumes, add a subject-predicate definition sentence in the relevant section.
This is the single highest-leverage fix in the four-pass audit.
Pass 3: Extract Atomic Answers Into the First 60 Words of Every Section
Selection failures live here. The page was retrieved. The model read it. The cited sentence came from somewhere else because somewhere else had a cleaner one. An atomic answer is a self-contained sentence (or short paragraph) that answers the section’s implied question with the subject named, the claim stated, and no qualifier in front.
What the Cited Passage Actually Looks Like
Pull 20 AIO citations across your space and study them. In our experience, the same shape tends to repeat:
- The subject is a noun phrase near the start of the passage. Not “It is important to note that…” Not “When considering…” The thing itself, named.
- A numeric or definitional claim lands before any qualifier. “Cited passages run 40–80 words” comes before “depending on query type.”
- Total length sits in a tight range — most often somewhere between 40 and 80 words in the diffs we’ve run.
- No more than one dependent clause per sentence.
- The passage works lifted out of context. A reader who never saw the rest of the page still understands the answer.
Narrative Intros Get Scraped and Ignored
Narrative intros, the ones that warm up with a story before getting to the point, are exactly what AI engines skip. The bot reads the section. The bot scores each candidate passage. The passage that scores highest is the one where the noun phrase comes first and the claim isn’t gated behind a setup clause.
Compare:
- Bad: “When you’re trying to figure out why your page isn’t getting cited, one of the most important things to consider is the structure of your opening sentence.” (Subject is buried at word 18.)
- Good: “Atomic answers lead with the subject up front. The claim follows before any qualifier. Total length stays between 40 and 80 words.” (Subject at word 1. Claim early.)
Same idea. Only one gets cited.
Refactor, Don’t Rewrite
Most pages don’t need a rewrite. They need the first 60 words of each section reordered. Move the atomic answer to the top. Push the narrative warm-up down or cut it. Keep the brand voice in the supporting paragraphs.
This is the fastest measurable lift in the audit. Our guide on agentic content QA pipelines covers how to automate the atomic-answer check across hundreds of URLs once you’ve validated the pattern manually.
Pass 4: Use Schema to Reinforce Atomic Claims, Not Create Them
Schema is the most over-prescribed fix in GEO. Schema doesn’t create citability. It reinforces a claim that already exists in the body. If the atomic answer isn’t in the body, adding FAQPage schema on top of weak prose is unlikely to move citations.
That’s why schema is Pass 4, not Pass 1. There’s nothing to reinforce until Passes 2 and 3 are done.
FAQPage, HowTo, and Speakable Are the Three That Matter
Three types of structured data carry weight in AI citation contexts:
- FAQPage: each question’s answer text matches a sentence in the body verbatim. Per Google’s FAQPage guidelines, the questions and answers must be visible on the page in the same form.
- HowTo: numbered steps in the schema align with numbered steps in the body. Useful for procedural content where each step is its own atomic answer.
- Speakable: marks the specific sentence on the page that should be spoken aloud by voice assistants. In practice, it doubles as a this is the atomic answer signal for any extraction layer.
Google deprecated rich-result eligibility for FAQPage on most sites in 2023, and the FAQ rich results deprecation finished out in 2026. The markup may still help as a citation reinforcement signal even where the visual rich result no longer appears.
Schema and Body Copy Need to Match
A FAQPage entry whose answer text doesn’t appear in the body is, in our experience, unlikely to do much work. Per Google’s FAQPage guidelines, the full text of each question and answer must be present on the page in the same form as the structured data. When they mismatch, the structured data risks being discounted or ignored.
The fix: write the atomic answer in the body first. Copy that exact sentence into the FAQPage schema. Don’t paraphrase. Don’t write a friendlier version for the schema. The schema-body match is the reinforcement.
Schema-First Audits Don’t Move Citations
Teams who lead with schema fixes (adding FAQPage to every page, marking up everything with Article schema, generating HowTo blocks from templates) usually see no movement in citations. The schema confirms claims that aren’t strong enough to be cited on their own. You can’t reinforce a sentence that doesn’t exist yet.
The order matters: entity coverage first, atomic answers second, schema third. Reverse it and you’re decorating a structural problem.
Triage 500 URLs With a Citation Gap × Commercial Intent × Traffic Score
Most teams don’t have time to run four passes on every URL. The triage rule is simple: prioritize where the gap is biggest, the commercial value is highest, and the demand signal already exists.
A workable scoring framework:
| Factor | Weight | What to measure |
|---|---|---|
| Citation gap size | 40% | Queries where you rank top 10 but aren’t cited (from Pass 1) |
| Commercial intent | 35% | Transactional > comparison > informational > navigational |
| Existing traffic | 25% | Pages with organic clicks already (don’t waste passes on zero-demand URLs) |
Score each URL 1–10 on each factor. Multiply by the weight. Sort descending. The top 20 URLs get the full four-pass audit this quarter. The next 50 get Pass 1 (citation diff) only. Half of those usually just need an atomic-answer refactor.
Audit Cadence: Triggers, Not a Calendar
Running the audit on a calendar isn’t useful. Run it on triggers:
- Monthly: citation diffs on top 10 commercial URLs. Cheap to run, fast to act on.
- Quarterly: full entity coverage sweep on the top 50 URLs. Update internal anchors, fix implicit entity definitions, refresh schema-body matches.
- Claim-change-triggered: any time a factual claim changes (a price, a regulation, a benchmark, a product feature), rerun Pass 3 and Pass 4 on the affected URLs the same week.
Measuring Success When GSC Clicks Haven’t Moved Yet
Google Search Console doesn’t break out AI Overview impressions cleanly from standard search impressions. Clicks are a lagging indicator. You need leading indicators:
- Citation count by query class: how many of your tracked queries cite your URL in the AIO answer. Tools like Profound, Otterly.ai, and BrightEdge’s Generative Parser track this at scale.
- AIO impression share: percentage of tracked queries where your URL appears in the AIO source set (cited or not).
- Branded prompt mentions: how often ChatGPT, Perplexity, and Gemini reference your brand by name when asked about your category.
- Bot-hit-to-citation ratio: GPTBot/PerplexityBot hits on a URL divided by citations of that URL. A widening gap means selection failure is getting worse.
In our experience, clicks tend to move well after these leading signals shift — often a quarter or more later. Track the leading indicators or you’ll quit the audit before it pays.
Frequently Asked Questions
Why does Google AI Overview scrape my page but cite a competitor or Reddit thread?
Your page got retrieved but lost the selection step. The model read it and picked a cleaner passage somewhere else. Reddit threads often win because forum answers are written as direct subject-predicate sentences with the claim up front and no marketing setup. Run a citation diff (Pass 1) to confirm your URL is in the source list, then refactor the first 60 words of the relevant section into an atomic answer that leads with the subject and states the claim before any qualifier.
How long should an atomic answer be, and where on the page does it need to live?
In our audits, atomic answers that get cited usually run 40–80 words and live in the first 60 words of each H2 section. The subject should appear in a noun phrase near the start of the passage, and a definitional or numeric claim should land before any qualifier. Passages longer than ~110 words tend to get truncated and dropped. Passages under 25 words rarely carry enough claim density to be selected. Treat these as operating ranges from our own diffs, not universal rules.
What schema actually reinforces AI citations vs. just being SEO theater?
FAQPage, HowTo, and Speakable schema reinforce citations when the marked-up text matches the body copy. Per Google’s structured data guidelines, the questions and answers in FAQPage markup must appear visibly on the page in the same form — when they don’t, the structured data can be discounted or ignored. Schema confirms a claim that already exists in the page. It doesn’t create one. Write the atomic answer in the body first, then copy that exact sentence into the structured data.
Should I rewrite the page or just refactor the first 60 words of each section?
In most cases, refactor. The page already has the right information. It just has the warm-up before the answer. Move the atomic answer to the top of each section, push the narrative context below it, and leave the rest of the body alone. Full rewrites are only justified when the entity coverage is broken (Pass 2 failure) or the page is factually outdated. Refactoring preserves brand voice and existing internal links while fixing the extractability problem.
How do I measure GEO success when AIO impressions don’t show in GSC?
Track citation count by query class, AIO impression share, and bot-hit-to-citation ratio as leading indicators. Tools like Profound, Otterly.ai, and BrightEdge’s Generative Parser monitor citations across Google AI Overviews, Perplexity, and ChatGPT search at scale. Clicks in GSC tend to move well after these leading signals shift. Don’t wait for click data to validate the audit. You’ll quit before the lift shows up.
What’s the difference between a retrieval failure and a selection failure?
Retrieval failure means your URL never appears in the AI Overview’s source list at all. The model didn’t consider you. Selection failure means your URL is in the source list, but a competitor’s passage is the one quoted in the answer. The two require opposite fixes. Retrieval failures need entity coverage and crawlability work. Selection failures need atomic-answer refactoring. The citation diff in Pass 1 is the only diagnostic that tells you which one you’re solving.
Talk to Elevarus Before You Rewrite Anything
Most teams burn rewrite budget on pages that needed a 60-word refactor. The four-pass audit exists to prevent that, and the first pass, the citation diff, usually takes 15 minutes per query.
If you want a second set of eyes on which of your pages are retrieval failures, which are selection failures, and which just need atomic-answer refactoring before any schema work, book a free consultation. We’ll walk a citation diff on your highest-intent URLs with you and show which pass to run first.