Content chunking shows up in almost every GEO and AEO checklist. It means breaking long content into clear, self-contained sections. Each section covers one point, uses a specific heading, and is formatted so a reader can scan it fast.
AI search experiences use passage-level retrieval. They don’t need the full page. They need the one section that answers a question, and well-structured pages are said to make that section easier to find and reuse. Which would make sense for a reader and a crawler bot.
Now, marketers connect content chunking to citations in ChatGPT, Perplexity, and Google AI Overviews. The working theory is that clearer sections increase the chances the right passage gets pulled, and that increases the chances your page shows up as a source.
What I wanted to know was how big that effect is when you control the setup. I wasn’t trying to make a big claim about AI overall. I wanted to measure one specific thing. Does chunking change whether the right passage gets retrieved, and where it ranks?
I ran three experiments on real long-form articles to see whether chunking and structure changed how often the correct passage was retrieved, and how high it ranked once it showed up.
The goal of the experiment was not to prove that chunking always works or doesn’t work. It was to understand when chunking helps, what kind of chunking helps, and where the popular advice breaks down when tested against real, long-form content.
Join 1,000+ marketers and stay ahead of the industry
Why I Ran This Experiment
GEO/AEO and AI in general have changed how people write. Marketers are reformatting long pages into chunks because they think it will help AI systems pull the right passage and cite them. Even if that’s not their regular style.
The problem is that there’s minimal data out there showing chunking improves retrieval. There’s plenty of advice, but not many controlled tests that quantify the impact.
So I ran an experiment to test it. If chunking really improves citation odds, it should show up first in retrieval performance.
What I Tested (And What I Didn’t)
What I tested was simple. I took the same articles and changed how they were formatted, then measured whether a retrieval system could still find the right passage when asked a question.
What I didn’t test was ‘ranking in ChatGPT,’ or whether ChatGPT would show a citation link. LLM functions are not public knowledge and can only be simulated, like I have done in this experiment.
This experiment focuses on retrieval. Retrieval is the step where an LLM searches its index and pulls back the text it might use to answer. If the right passage doesn’t get retrieved, it can’t be used, and it can’t be cited.
Data and Setup
For these tests, I used 165 long-form articles pulled from about 29 domains within marketing, SaaS, ecommerce, and tech. The median article length was 1,792 words, a good simulation where key points can be easy to miss if you’re scanning.
Most chunking advice is aimed at this exact situation. It’s said that when an article is long, the ‘answer’ is often buried somewhere in the middle or near the end, and AI will struggle to retrieve it unless the page is broken into clearer sections.
To run the retrieval tests, I used a common baseline setup called MiniLM embeddings; embeddings turn text into a numeric representation of meaning, so the system can match a question to the most relevant passage even if the wording isn’t identical.
To simulate citation likelihood, I did three things:
- Generated questions using sentences from the articles themselves, so I already knew what the correct supporting passage should be.
- Ran each question through the retrieval system and checked whether the correct passage appeared in the top five results.
- If the correct passage was in that top five, I treated it as ‘retrievable.’ In a real LLM scenario, that’s the minimum condition needed for the content to be available to cite.
The Metrics Used
I used two standard retrieval metrics:
Passage-hit@5
This answers one question: Did the system find the correct part of the article in its top five results?
If the correct passage doesn’t show up in the top five, it’s unlikely an LLM would use it. So this works as a practical proxy for ‘could this be cited at all.’
MRR@5
This answers a slightly different question: If the correct passage is in the top five, how high did it rank?
Higher rank usually means the system has more confidence that the passage is the best match, and it’s more likely to be selected when an answer is generated.
I use both because they capture different failure points. One tells you whether the content is findable. The other tells you whether it’s findable quickly and reliably compared to competing passages.
Experiment 1: Chunking vs Full Articles
In this first experiment, I tested whether chunking an article improves retrieval at a baseline level.
Each article was indexed in four different ways:
Full article (naive): the page treated as a single retrieval unit, exactly as-is
Full article (proper full): still treated as a single unit, but processed in a way that actually includes the whole article
Fixed-size chunks: fixed-length chunks with no awareness of paragraph boundaries
Structure-aware chunks: chunks aligned to paragraphs to preserve more context
The content itself did not change. Only the way it was split and indexed.
The reason I added the second full article condition is simple: MiniLM can’t see an entire long article in one go. If you embed a long page as a single block, it only uses the beginning and ignores the rest. That makes full article look worse than it should.
So the real question here was:
Does chunking improve retrieval compared to a full-article baseline that actually includes the whole page?
What the data shows

Note: The chart uses a compressed y-axis range to make differences visible. All metrics range from 0 to 1.
On the primary metric, Passage-hit@5, the results were:
- Full article (naive): 0.654
- Full article (proper full): 0.882
- Fixed-size chunks: 0.930
- Structure-aware chunks: 0.654
The naive full article result underperformed because it was effectively only indexing the start of each page.
Once the full article was represented properly, performance jumped.
But chunking still came out on top.
Fixed-size chunks found the correct passage in the top five results 93% of the time, compared to 88% for properly handled full articles.
The same pattern showed up in MRR@5, which measures how highly the correct passage ranked once it was retrieved.
- Full article (naive): 0.506
- Full article (proper full): 0.724
- Fixed-size chunks: 0.862
- Structure-aware chunks: 0.506
So chunking didn’t just help the system find something from the right article. It helped the right passage show up higher.
What it means for content chunking
When full articles are handled properly, they perform well. Long-form content is absolutely retrievable without being split. But chunking did provide a measurable advantage in this setup.
Fixed-size chunks (a technical method of splitting text) beat correctly structured full articles on both metrics. This confirms that smaller, distinct segments are easier for AI to retrieve than massive blocks.
While writers can’t write in ‘fixed tokens,’ this validates the core concept: breaking content down helps the machine. The gap isn’t massive, but it’s consistent.
The real takeaway is a lot of chunking advice is based on shaky comparisons where the full article isn’t actually the full article. If you don’t account for that, you can end up drawing the wrong conclusion.
So in this test:
- Full articles work fine. Chunking isn’t required just to be retrievable.
- But chunking can make retrieval slightly more accurate, especially for pulling the right passage near the top.
Experiment 2: Same Content, Different Formatting
In the second experiment, I kept the content itself the same and changed only how it was formatted.
Each article was recreated in three versions:
- Dense prose: headings removed and paragraphs merged into long blocks of text
- Structured content: headings and paragraphs preserved as they appeared in the original article
- Q&A format: headings converted into question and answer pairs, with short answers under each question
This experiment was designed to separate chunking from structure. All three formats contained the same information. The only difference was how that information was organized.
What the data shows

Note: The chart uses a 0–0.90 y-axis range to make differences visible. The metrics themselves range from 0 to 1.
This experiment used needle queries, meaning the questions were tightly matched to specific sentences in the articles.
On Passage-hit@5, structured formatting performed best by a clear margin.
- Structured content: 0.80
- Dense prose: 0.68
- Q&A format: 0.12
Structured articles were retrieved correctly 80% of the time, compared to 68% for dense prose. The Q&A format collapsed to 12%, meaning the correct passage was rarely found.
The ranking metric MRR@5 showed the same pattern.
- Structured content: 0.704
- Dense prose: 0.531
- Q&A format: 0.098
Here the gap is larger. When structured content was retrieved, it ranked much higher than dense prose. Q&A not only failed to retrieve passages, but it also ranked poorly when it did.
What it means for content chunking
This experiment shows that structure has a large impact on retrieval, even when the content itself doesn’t change.
Preserving headings and paragraph boundaries made it easier for the system to understand where ideas began and ended. Dense prose removed those signals and performed worse as a result.
The Q&A format performed much worse than both. Breaking content into very short answers stripped away the surrounding context that helps a retrieval system understand meaning. Short answers looked clean, but they carried less semantic information.
This is where the common ‘Q&A works best for AI’ idea breaks down. In this dataset, over-chunking reduced retrieval instead of improving it.
Experiment 3: Needle-in-a-Haystack – Buried Answers
The third experiment targeted what chunking is meant to solve.
Instead of drawing questions from anywhere in the article, I generated all queries from the last 20% of each article. This is the part of a long page where important details are often buried and assumed to be hardest to pull.
These were intended to simulate ‘needle-in-a-haystack’ queries. The information exists, but it sits deep enough in the article that a retrieval system might struggle to find it.
What the data shows

Note: The chart uses a 0–0.90 y-axis range to make differences visible. The metrics themselves range from 0 to 1.
In this scenario, structured content maintained its advantage, but the gap narrowed compared to the general-query experiment.
On Passage-hit@5, structured content again performed best.
- Structured content: 0.843
- Dense prose: 0.764
- Q&A format: 0.079
However, the difference between structured and dense prose was smaller than before. In the general query test, the gap was 12 percentage points. Here, it dropped to roughly 8 percentage points.
Dense prose performed better than expected.
Position alone did not make these queries harder. In many articles, the final sections contain clearer summary language and more distinctive phrasing. That kind of wording is often easier for retrieval systems to match, even if it appears near the bottom of the page.
The ranking results (MRR@5) showed the same pattern.
- Structured content: 0.735
- Dense prose: 0.590
- Q&A format: 0.058
Structured content still ranked correct passages higher once found. But again, dense prose closed some of the gap compared to the earlier experiment.
When answers were drawn from the final 20% of the article, structure retained its lead, but it didn’t become more dominant. The advantage remained, rather than expanding.
What it means for content chunking
This experiment shows where chunking and structure help, and also where the assumptions break.
When information is easy to match, long articles can perform well without being split apart. In this test, deep queries were not automatically harder just because they came from the last 20% of the page.
In many articles, that section is written as a conclusion or summary, which often uses clearer, more distinctive phrasing. That can actually make retrieval easier, even if the content is buried.
Structure still helped. It stayed ahead.
But the gap did not widen in the deep-query scenario. Dense prose improved more than expected, which narrowed the difference compared to the general-query experiment.
So the takeaway isn’t ‘chunking only matters when content is deep.’ It’s more specific than that:
Chunking and structure help most when they make the target passage more distinct and easier to isolate, not simply because the passage appears later on the page.
What the Experiments Show About Content Chunking
Chunking is not a general AI visibility or ranking shortcut.
In the baseline test, full articles performed well once the entire page was actually indexed. However, fixed-size chunking still produced higher retrieval and ranking scores. Simply breaking a page into smaller pieces wasn’t magic, but it did provide a noticeable lift in this setup.
Formatting and structure consistently improved retrieval.
Across both needle queries and deep queries, content with clear headings and paragraph boundaries outperformed dense blocks of text. Structured formatting increased both how often passages were retrieved and how highly they ranked once retrieved. The gap narrowed in the deep-query test, but structure maintained its lead.
Over-chunking reduced retrieval performance in this dataset.
The Q&A format performed poorly in every experiment. The most likely explanation is loss of context. Very short answers carry less semantic information, which makes it harder for a retrieval system to match them reliably to a question. This doesn’t mean Q&A never works, but it didn’t work well under these conditions.
The benefit of chunking didn’t depend only on how deep the answer was in the article.
Queries drawn from the final 20% of content were not necessarily harder to retrieve. In many cases, conclusion sections contain clearer, more distinctive phrasing, which can actually make retrieval easier. Structure continued to outperform dense prose in that scenario, but the advantage did not widen. Deep position alone wasn’t the deciding factor.
These findings don’t suggest that chunking is required or that more chunking is better.
They suggest that structure improves retrieval reliability, and that chunking provides a consistent but incremental advantage over full-article representations in this type of retrieval setup.
Limits and What This Doesn’t Prove
These experiments answer a specific question about retrieval, but they don’t cover every factor that affects whether an LLM will cite a source.
One embedding model. I used a single embedding model (MiniLM) as a common baseline. Other models may behave differently, especially larger or newer ones. The direction of the results may hold, but the exact scores might not.
A retrieval simulation, not product behavior. This work tests the retrieval layer that sits underneath many RAG systems. It does not test ChatGPT, Perplexity, or Google AI Overviews directly, and it does not measure how those products choose to display citations. Retrieval is a prerequisite for citation, but it is not the only step.
Q&A formatting was automated. The Q&A versions were generated by transformation rules, not written by an editor. Poor Q&A structure could exaggerate the downside. The results still show that aggressive Q&A chunking can fail under realistic automation, which is how many teams would implement it at scale.
Authority signals were not included. I did not test factors like date, brand strength, backlink profiles, or topical authority. Those signals can influence whether a system indexes content, trusts it, or selects it when multiple sources say similar things. This experiment isolates formatting and chunking, not the full set of ranking and selection inputs.
Retrieval vs. Active Browsing. This experiment measures how a system finds a passage within a large index (Retrieval-Augmented Generation). It doesn’t specifically test ‘browsing’ mode, where an AI tool like ChatGPT or Perplexity fetches a single URL and reads the entire page.
However, the two are linked. Even when a model has a full page in its ‘context window,’ it still faces the lost in the middle problem, which is where the tendency for AI to overlook information buried in dense, unstructured text. The structure that helps a system find the content in an index likely helps a browsing model focus on the right section of a page.
What This Means for Content Marketing
In this dataset, the biggest difference wasn’t simply ‘chunked vs not chunked.’ It was clear structure vs messy structure.
When I kept headings and normal paragraph breaks, the retrieval system did a better job finding the right passage. When I removed that structure and turned the same content into a wall of text, results dropped. When I forced everything into short Q&A blocks, results dropped.
So for content chunking:
Chunking provided a consistent lift over full-article representations in this experiment. But the bigger reason was clarity and structure. Deep placement alone didn’t determine performance, and buried answers were not automatically harder to retrieve.
The data suggests a hybrid approach is best: write with the structure of a chunked article (clear headings, distinct sections), but you don’t need to aggressively break content into tiny pieces. In these tests, structure was more important than cutting the page into smaller pieces.
The point is to make information easier to locate on the page. In these experiments, clear headings and well-defined sections did more work than aggressive reformatting.



ChatGPT
Claude
Perplexity






