What Makes Good Markdown for LLMs? A Guide to Document Chunking
Every LLM application that ingests documents faces the same bottleneck: the quality of the text you put in determines the quality of the answers you get out. Most teams focus on embedding models, vector databases, and prompt engineering. They should be focusing on the input format.
Markdown is the best format for feeding documents to LLMs. But not all markdown is created equal. The difference between good markdown and bad markdown is the difference between a RAG pipeline that works and one that hallucinates.
Why markdown beats plain text and HTML
Three formats dominate LLM input pipelines. Here is how they compare:
| Format | Structure | Token efficiency | LLM comprehension |
|---|---|---|---|
| Plain text | None | High | Poor — no context for what’s a heading, table, or list |
| HTML | Full | Low — tags consume 30-50% of tokens | Good, but wasteful |
| Markdown | Semantic | High | Best — models are trained on massive amounts of markdown |
Plain text strips all structure. A heading looks the same as a paragraph. A table becomes whitespace-separated numbers. The LLM has to guess what’s what, and it guesses wrong often enough to matter.
HTML preserves structure but wastes tokens. A simple heading like <h2 class="section-title" id="revenue">Revenue</h2> costs 15+ tokens. The same heading in markdown — ## Revenue — costs 2. When you’re working within context windows of 8k-128k tokens, that overhead adds up fast. HTML also carries CSS classes, data attributes, and nested divs that carry zero semantic value for the model.
Markdown hits the sweet spot. It preserves the structural cues that LLMs need — heading hierarchy, table relationships, list ordering, emphasis — while staying token-efficient and noise-free. Large language models have seen enormous volumes of markdown during training (GitHub READMEs, documentation sites, wikis), so they parse it natively.
The four properties of LLM-ready markdown
Not all markdown conversion is equal. Here is what separates markdown that works well in LLM pipelines from markdown that doesn’t.
1. Heading hierarchy
Clean heading hierarchy (#, ##, ###) is the single most important property for downstream chunking. Headings serve as semantic boundaries — they tell a chunker where one topic ends and another begins.
Bad markdown flattens everything to a single level or omits headings entirely:
Revenue Summary
The company generated $10M in Q3...
Regional Breakdown
North America accounted for 62%...
Good markdown preserves the document’s heading structure:
## Revenue Summary
The company generated $10M in Q3...
### Regional Breakdown
North America accounted for 62%...
The difference matters because heading-based chunking (covered below) relies on these markers to produce semantically coherent chunks. Without them, you fall back to token-count splitting, which cuts through the middle of paragraphs and tables.
2. Table formatting
Tables are where most extractors fail catastrophically. A financial table rendered as flat text is nearly useless:
Segment Revenue Pct Enterprise 4200000 42 Cloud 3100000 31
The same data in a proper markdown table preserves the column-row relationships that an LLM needs to answer questions like “What percentage of revenue came from Cloud?”:
| Segment | Revenue | % |
|------------|-------------|---|
| Enterprise | $4,200,000 | 42 |
| Cloud | $3,100,000 | 31 |
When this table lands in an LLM context window, the model can read it as structured data. When the flat-text version lands, the model has to guess which numbers belong to which labels.
3. List structure
Ordered and unordered lists carry implicit relationships — sequence, hierarchy, grouping. Markdown preserves these:
The application requires:
1. Python 3.10 or higher
2. A valid API key
3. At least 4GB of RAM
Optional dependencies:
- Redis (for caching)
- PostgreSQL (for persistent storage)
Flat text extraction would merge these into a single paragraph, losing the distinction between required and optional items, and between ordered steps and unordered options.
4. Clean section separation
Good markdown has consistent blank lines between sections, no trailing whitespace noise, and no artifacts from PDF rendering (repeated headers, page numbers inline, watermark text). Every piece of non-content text that leaks into the markdown becomes noise in your embeddings.
Chunking strategies for markdown
Once you have clean markdown, you need to split it into chunks for embedding or context injection. There are four common strategies, each with different trade-offs.
Heading-based chunking
Split the document at heading boundaries. Each chunk starts with a heading and contains everything until the next heading of equal or higher level.
Best for: Structured documents with consistent heading hierarchy — reports, documentation, manuals, academic papers.
Why it works: Each chunk is a semantically coherent unit. A chunk titled ”## Q3 Revenue” will embed close to revenue-related queries. A chunk titled ”## Risk Factors” will embed close to risk-related queries. The heading itself acts as a natural label for the chunk’s content.
Limitation: Requires the markdown to actually have headings. If your PDF extractor flattened the heading hierarchy, this strategy produces a single giant chunk.
Here is a Python implementation:
import re
from dataclasses import dataclass
@dataclass
class MarkdownChunk:
heading: str
level: int
content: str
def chunk_by_headings(
markdown: str,
max_heading_level: int = 2
) -> list[MarkdownChunk]:
"""Split markdown into chunks at heading boundaries.
Args:
markdown: The markdown string to chunk.
max_heading_level: Only split on headings up to this level.
1 = split on # only, 2 = split on # and ##, etc.
"""
pattern = r'^(#{1,' + str(max_heading_level) + r'})\s+(.+)$'
lines = markdown.split('\n')
chunks: list[MarkdownChunk] = []
current_heading = ""
current_level = 0
current_lines: list[str] = []
for line in lines:
match = re.match(pattern, line)
if match:
# Save the previous chunk
if current_lines:
content = '\n'.join(current_lines).strip()
if content:
chunks.append(MarkdownChunk(
heading=current_heading,
level=current_level,
content=content,
))
current_heading = match.group(2)
current_level = len(match.group(1))
current_lines = [line]
else:
current_lines.append(line)
# Don't forget the last chunk
if current_lines:
content = '\n'.join(current_lines).strip()
if content:
chunks.append(MarkdownChunk(
heading=current_heading,
level=current_level,
content=content,
))
return chunks
Usage:
from pdftomarkdown import convert
result = convert("annual_report.pdf", api_key="your-key")
chunks = chunk_by_headings(result.markdown, max_heading_level=2)
for chunk in chunks:
print(f"[H{chunk.level}] {chunk.heading}")
print(f" {len(chunk.content)} chars")
Each chunk carries its heading as metadata, which you can store alongside the embedding vector for better retrieval filtering.
Page-based chunking
Split the document at page boundaries. One page = one chunk.
Best for: Quick prototyping, documents where pages are self-contained (slide decks, forms), or when you need to reference page numbers in citations.
Why it works: It’s simple and deterministic. No parsing logic required — just split on page markers.
Limitation: Pages are arbitrary boundaries. A section that spans pages 3-4 gets split into two chunks, losing coherence. Tables that break across pages get severed.
# pdfToMarkdown returns per-page results
from pdftomarkdown import convert
result = convert("report.pdf", api_key="your-key")
# Each page is already a separate chunk
for i, page in enumerate(result.pages):
embed_and_store(
text=page.markdown,
metadata={"page": i + 1}
)
Token-count splitting with overlap
Split the text into chunks of a fixed token count, with an overlap window between consecutive chunks to preserve context at boundaries.
Best for: Unstructured or inconsistently formatted documents where heading-based chunking isn’t viable.
Why it works: Guarantees every chunk fits within your embedding model’s token limit. The overlap ensures that sentences split at a boundary still appear in full in at least one chunk.
Limitation: Chunks are semantically arbitrary. A 512-token chunk might contain the tail of one section and the start of another. Overlap increases total storage and embedding cost.
import tiktoken
def chunk_by_tokens(
text: str,
chunk_size: int = 512,
overlap: int = 64,
model: str = "text-embedding-3-small"
) -> list[str]:
"""Split text into overlapping chunks by token count."""
encoder = tiktoken.encoding_for_model(model)
tokens = encoder.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunks.append(encoder.decode(chunk_tokens))
start += chunk_size - overlap
return chunks
Recursive splitting
Start with the largest structural separators (headings), then fall back to smaller ones (paragraphs, sentences) if any chunk exceeds a size limit.
Best for: Production RAG systems that need to handle diverse document types with a single pipeline.
Why it works: It tries heading-based chunking first (best quality), but has fallback logic for documents where a single section is too large. You get semantic coherence when possible, size guarantees always.
This is what LangChain’s RecursiveCharacterTextSplitter does internally, and it works well with markdown input:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100,
separators=[
"\n## ", # Try heading level 2 first
"\n### ", # Then heading level 3
"\n\n", # Then paragraphs
"\n", # Then lines
" ", # Then words
]
)
chunks = splitter.split_text(markdown)
The separator list is ordered from most to least semantically meaningful. The splitter tries the first separator, and only falls back to the next one when a chunk would exceed chunk_size.
How chunking strategy affects retrieval
The choice of chunking strategy directly impacts retrieval quality. Here is a concrete example.
Given a 40-page annual report, a user asks: “What were the key risk factors?”
| Strategy | What gets retrieved | Quality |
|---|---|---|
| Heading-based | The chunk starting with ## Risk Factors — exactly the right section | High |
| Page-based | Pages 12-13, which happen to contain the risk section plus unrelated footnotes | Medium |
| Token-count (512) | 3-4 chunks that partially overlap with the risk section, mixed with adjacent content | Low-Medium |
| Recursive | The risk section as a coherent chunk, split into sub-sections if it exceeds the size limit | High |
Heading-based and recursive chunking outperform the others because they produce chunks aligned with the document’s semantic structure. But they require markdown with intact heading hierarchy to work. If your PDF extractor strips headings, you’re forced into token-count splitting.
Why extraction quality determines chunking quality
This is the core point: your chunking strategy is only as good as the markdown it operates on.
If your PDF-to-text pipeline flattens headings, heading-based chunking is impossible. If it destroys tables, your table data embeds as noise. If it merges multi-column layouts, your chunks contain interleaved text from unrelated sections.
pdfToMarkdown is built specifically to produce markdown that works well with these chunking strategies. The vision-language model pipeline preserves:
- Heading hierarchy (
#through######) — enabling heading-based and recursive chunking - Table structure (pipe-delimited markdown tables) — keeping relational data intact within chunks
- List formatting (ordered and unordered) — preserving sequence and grouping
- Clean section breaks — giving chunkers unambiguous split points
The output is designed to be chunked, embedded, and retrieved. Not just read.
Putting it together: from PDF to chunks
Here is the full pipeline — PDF to structured markdown to heading-based chunks, ready for embedding:
import os
import re
from pdftomarkdown import convert
def pdf_to_chunks(pdf_path: str) -> list[dict]:
"""Convert a PDF to semantically chunked markdown."""
result = convert(pdf_path, api_key=os.environ["PDFTOMARKDOWN_API_KEY"])
chunks = []
current = {"heading": "", "content": ""}
for line in result.markdown.split("\n"):
if re.match(r'^#{1,2}\s+', line):
if current["content"].strip():
chunks.append(current)
current = {
"heading": line.lstrip("#").strip(),
"content": line + "\n",
}
else:
current["content"] += line + "\n"
if current["content"].strip():
chunks.append(current)
return chunks
# Usage
chunks = pdf_to_chunks("research_paper.pdf")
for chunk in chunks:
print(f"{chunk['heading']}: {len(chunk['content'])} chars")
For a deeper walkthrough of building a complete RAG pipeline with pdfToMarkdown — including embedding, vector storage, and retrieval — see the PDF to Markdown for RAG Pipelines guide.
Get started
The pdfToMarkdown API returns markdown with clean heading hierarchy, preserved tables, and consistent formatting out of the box. No post-processing, no custom parsing rules.
Test it on your own documents — the Hacker tier is free with no signup:
curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
Sign in with GitHub for 100 pages/month, no credit card required.