· pdfToMarkdown team

What Makes Good Markdown for LLMs? A Guide to Document Chunking

guidesllmragchunkingmarkdown

Every LLM application that ingests documents faces the same bottleneck: the quality of the text you put in determines the quality of the answers you get out. Most teams focus on embedding models, vector databases, and prompt engineering. They should be focusing on the input format.

Markdown is the best format for feeding documents to LLMs. But not all markdown is created equal. The difference between good markdown and bad markdown is the difference between a RAG pipeline that works and one that hallucinates.

Why markdown beats plain text and HTML

Three formats dominate LLM input pipelines. Here is how they compare:

FormatStructureToken efficiencyLLM comprehension
Plain textNoneHighPoor — no context for what’s a heading, table, or list
HTMLFullLow — tags consume 30-50% of tokensGood, but wasteful
MarkdownSemanticHighBest — models are trained on massive amounts of markdown

Plain text strips all structure. A heading looks the same as a paragraph. A table becomes whitespace-separated numbers. The LLM has to guess what’s what, and it guesses wrong often enough to matter.

HTML preserves structure but wastes tokens. A simple heading like <h2 class="section-title" id="revenue">Revenue</h2> costs 15+ tokens. The same heading in markdown — ## Revenue — costs 2. When you’re working within context windows of 8k-128k tokens, that overhead adds up fast. HTML also carries CSS classes, data attributes, and nested divs that carry zero semantic value for the model.

Markdown hits the sweet spot. It preserves the structural cues that LLMs need — heading hierarchy, table relationships, list ordering, emphasis — while staying token-efficient and noise-free. Large language models have seen enormous volumes of markdown during training (GitHub READMEs, documentation sites, wikis), so they parse it natively.

The four properties of LLM-ready markdown

Not all markdown conversion is equal. Here is what separates markdown that works well in LLM pipelines from markdown that doesn’t.

1. Heading hierarchy

Clean heading hierarchy (#, ##, ###) is the single most important property for downstream chunking. Headings serve as semantic boundaries — they tell a chunker where one topic ends and another begins.

Bad markdown flattens everything to a single level or omits headings entirely:

Revenue Summary

The company generated $10M in Q3...

Regional Breakdown

North America accounted for 62%...

Good markdown preserves the document’s heading structure:

## Revenue Summary

The company generated $10M in Q3...

### Regional Breakdown

North America accounted for 62%...

The difference matters because heading-based chunking (covered below) relies on these markers to produce semantically coherent chunks. Without them, you fall back to token-count splitting, which cuts through the middle of paragraphs and tables.

2. Table formatting

Tables are where most extractors fail catastrophically. A financial table rendered as flat text is nearly useless:

Segment Revenue Pct Enterprise 4200000 42 Cloud 3100000 31

The same data in a proper markdown table preserves the column-row relationships that an LLM needs to answer questions like “What percentage of revenue came from Cloud?”:

| Segment    | Revenue     | % |
|------------|-------------|---|
| Enterprise | $4,200,000  | 42 |
| Cloud      | $3,100,000  | 31 |

When this table lands in an LLM context window, the model can read it as structured data. When the flat-text version lands, the model has to guess which numbers belong to which labels.

3. List structure

Ordered and unordered lists carry implicit relationships — sequence, hierarchy, grouping. Markdown preserves these:

The application requires:

1. Python 3.10 or higher
2. A valid API key
3. At least 4GB of RAM

Optional dependencies:

- Redis (for caching)
- PostgreSQL (for persistent storage)

Flat text extraction would merge these into a single paragraph, losing the distinction between required and optional items, and between ordered steps and unordered options.

4. Clean section separation

Good markdown has consistent blank lines between sections, no trailing whitespace noise, and no artifacts from PDF rendering (repeated headers, page numbers inline, watermark text). Every piece of non-content text that leaks into the markdown becomes noise in your embeddings.

Chunking strategies for markdown

Once you have clean markdown, you need to split it into chunks for embedding or context injection. There are four common strategies, each with different trade-offs.

Heading-based chunking

Split the document at heading boundaries. Each chunk starts with a heading and contains everything until the next heading of equal or higher level.

Best for: Structured documents with consistent heading hierarchy — reports, documentation, manuals, academic papers.

Why it works: Each chunk is a semantically coherent unit. A chunk titled ”## Q3 Revenue” will embed close to revenue-related queries. A chunk titled ”## Risk Factors” will embed close to risk-related queries. The heading itself acts as a natural label for the chunk’s content.

Limitation: Requires the markdown to actually have headings. If your PDF extractor flattened the heading hierarchy, this strategy produces a single giant chunk.

Here is a Python implementation:

import re
from dataclasses import dataclass


@dataclass
class MarkdownChunk:
    heading: str
    level: int
    content: str


def chunk_by_headings(
    markdown: str,
    max_heading_level: int = 2
) -> list[MarkdownChunk]:
    """Split markdown into chunks at heading boundaries.

    Args:
        markdown: The markdown string to chunk.
        max_heading_level: Only split on headings up to this level.
            1 = split on # only, 2 = split on # and ##, etc.
    """
    pattern = r'^(#{1,' + str(max_heading_level) + r'})\s+(.+)$'
    lines = markdown.split('\n')
    chunks: list[MarkdownChunk] = []
    current_heading = ""
    current_level = 0
    current_lines: list[str] = []

    for line in lines:
        match = re.match(pattern, line)
        if match:
            # Save the previous chunk
            if current_lines:
                content = '\n'.join(current_lines).strip()
                if content:
                    chunks.append(MarkdownChunk(
                        heading=current_heading,
                        level=current_level,
                        content=content,
                    ))
            current_heading = match.group(2)
            current_level = len(match.group(1))
            current_lines = [line]
        else:
            current_lines.append(line)

    # Don't forget the last chunk
    if current_lines:
        content = '\n'.join(current_lines).strip()
        if content:
            chunks.append(MarkdownChunk(
                heading=current_heading,
                level=current_level,
                content=content,
            ))

    return chunks

Usage:

from pdftomarkdown import convert

result = convert("annual_report.pdf", api_key="your-key")
chunks = chunk_by_headings(result.markdown, max_heading_level=2)

for chunk in chunks:
    print(f"[H{chunk.level}] {chunk.heading}")
    print(f"  {len(chunk.content)} chars")

Each chunk carries its heading as metadata, which you can store alongside the embedding vector for better retrieval filtering.

Page-based chunking

Split the document at page boundaries. One page = one chunk.

Best for: Quick prototyping, documents where pages are self-contained (slide decks, forms), or when you need to reference page numbers in citations.

Why it works: It’s simple and deterministic. No parsing logic required — just split on page markers.

Limitation: Pages are arbitrary boundaries. A section that spans pages 3-4 gets split into two chunks, losing coherence. Tables that break across pages get severed.

# pdfToMarkdown returns per-page results
from pdftomarkdown import convert

result = convert("report.pdf", api_key="your-key")

# Each page is already a separate chunk
for i, page in enumerate(result.pages):
    embed_and_store(
        text=page.markdown,
        metadata={"page": i + 1}
    )

Token-count splitting with overlap

Split the text into chunks of a fixed token count, with an overlap window between consecutive chunks to preserve context at boundaries.

Best for: Unstructured or inconsistently formatted documents where heading-based chunking isn’t viable.

Why it works: Guarantees every chunk fits within your embedding model’s token limit. The overlap ensures that sentences split at a boundary still appear in full in at least one chunk.

Limitation: Chunks are semantically arbitrary. A 512-token chunk might contain the tail of one section and the start of another. Overlap increases total storage and embedding cost.

import tiktoken


def chunk_by_tokens(
    text: str,
    chunk_size: int = 512,
    overlap: int = 64,
    model: str = "text-embedding-3-small"
) -> list[str]:
    """Split text into overlapping chunks by token count."""
    encoder = tiktoken.encoding_for_model(model)
    tokens = encoder.encode(text)
    chunks = []

    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunks.append(encoder.decode(chunk_tokens))
        start += chunk_size - overlap

    return chunks

Recursive splitting

Start with the largest structural separators (headings), then fall back to smaller ones (paragraphs, sentences) if any chunk exceeds a size limit.

Best for: Production RAG systems that need to handle diverse document types with a single pipeline.

Why it works: It tries heading-based chunking first (best quality), but has fallback logic for documents where a single section is too large. You get semantic coherence when possible, size guarantees always.

This is what LangChain’s RecursiveCharacterTextSplitter does internally, and it works well with markdown input:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=[
        "\n## ",    # Try heading level 2 first
        "\n### ",   # Then heading level 3
        "\n\n",     # Then paragraphs
        "\n",       # Then lines
        " ",        # Then words
    ]
)

chunks = splitter.split_text(markdown)

The separator list is ordered from most to least semantically meaningful. The splitter tries the first separator, and only falls back to the next one when a chunk would exceed chunk_size.

How chunking strategy affects retrieval

The choice of chunking strategy directly impacts retrieval quality. Here is a concrete example.

Given a 40-page annual report, a user asks: “What were the key risk factors?”

StrategyWhat gets retrievedQuality
Heading-basedThe chunk starting with ## Risk Factors — exactly the right sectionHigh
Page-basedPages 12-13, which happen to contain the risk section plus unrelated footnotesMedium
Token-count (512)3-4 chunks that partially overlap with the risk section, mixed with adjacent contentLow-Medium
RecursiveThe risk section as a coherent chunk, split into sub-sections if it exceeds the size limitHigh

Heading-based and recursive chunking outperform the others because they produce chunks aligned with the document’s semantic structure. But they require markdown with intact heading hierarchy to work. If your PDF extractor strips headings, you’re forced into token-count splitting.

Why extraction quality determines chunking quality

This is the core point: your chunking strategy is only as good as the markdown it operates on.

If your PDF-to-text pipeline flattens headings, heading-based chunking is impossible. If it destroys tables, your table data embeds as noise. If it merges multi-column layouts, your chunks contain interleaved text from unrelated sections.

pdfToMarkdown is built specifically to produce markdown that works well with these chunking strategies. The vision-language model pipeline preserves:

  • Heading hierarchy (# through ######) — enabling heading-based and recursive chunking
  • Table structure (pipe-delimited markdown tables) — keeping relational data intact within chunks
  • List formatting (ordered and unordered) — preserving sequence and grouping
  • Clean section breaks — giving chunkers unambiguous split points

The output is designed to be chunked, embedded, and retrieved. Not just read.

Putting it together: from PDF to chunks

Here is the full pipeline — PDF to structured markdown to heading-based chunks, ready for embedding:

import os
import re
from pdftomarkdown import convert


def pdf_to_chunks(pdf_path: str) -> list[dict]:
    """Convert a PDF to semantically chunked markdown."""
    result = convert(pdf_path, api_key=os.environ["PDFTOMARKDOWN_API_KEY"])

    chunks = []
    current = {"heading": "", "content": ""}

    for line in result.markdown.split("\n"):
        if re.match(r'^#{1,2}\s+', line):
            if current["content"].strip():
                chunks.append(current)
            current = {
                "heading": line.lstrip("#").strip(),
                "content": line + "\n",
            }
        else:
            current["content"] += line + "\n"

    if current["content"].strip():
        chunks.append(current)

    return chunks


# Usage
chunks = pdf_to_chunks("research_paper.pdf")
for chunk in chunks:
    print(f"{chunk['heading']}: {len(chunk['content'])} chars")

For a deeper walkthrough of building a complete RAG pipeline with pdfToMarkdown — including embedding, vector storage, and retrieval — see the PDF to Markdown for RAG Pipelines guide.

Get started

The pdfToMarkdown API returns markdown with clean heading hierarchy, preserved tables, and consistent formatting out of the box. No post-processing, no custom parsing rules.

Test it on your own documents — the Hacker tier is free with no signup:

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'

Sign in with GitHub for 100 pages/month, no credit card required.