· pdfToMarkdown team

The Hidden Cost of Bad PDF Parsing in RAG Systems

ragpdf-parsingembeddingsllm

Most teams debugging a underperforming RAG system start at the wrong end. They swap embedding models, tune chunk sizes, experiment with rerankers, add prompt engineering guardrails. These are all downstream fixes for an upstream problem: the text that went into the pipeline was garbage from the start.

The root cause is almost always PDF parsing.

The garbage-in chain

Bad PDF parsing triggers a cascade that compounds at every stage:

  1. Poor parsing — tables lose structure, headings disappear, multi-column text interleaves
  2. Bad chunks — without structural markers, your chunker splits on arbitrary character counts, producing fragments that mix unrelated content
  3. Bad embeddings — garbled text produces vectors that represent noise, not meaning
  4. Irrelevant retrieval — cosine similarity matches on noise return the wrong chunks
  5. LLM hallucinations — the model receives context that doesn’t answer the question, so it invents an answer that sounds right

Each stage amplifies the error. A table that loses its structure in step 1 doesn’t just degrade one chunk — it produces a chunk that actively competes with correct chunks during retrieval, pushing them out of the top-k results.

Example 1: Tables become random numbers

Consider a financial document with a revenue breakdown table. A traditional PDF extractor (PyMuPDF, pdfplumber, PDFMiner) produces this:

Q3 Revenue Breakdown Enterprise 4200000 42 Cloud 3100000 31
On-Premise 1800000 18 Services 900000 9 Total 10000000
Gross Margin 72.3 Operating Margin 28.1

Now a user asks: “What was the enterprise revenue in Q3?”

The embedding for that garbled chunk is a blend of every number and label mashed together. It might match — or it might not. Even if it does match, the LLM sees Enterprise 4200000 42 Cloud 3100000 31 and has to guess what “42” means. Is it 42%? $42M? Row 42? The model will pick one interpretation and present it confidently. If it guesses wrong, you get a hallucination that looks authoritative.

The same table through a structure-preserving parser:

## Q3 Revenue Breakdown

| Segment    | Revenue     |  % |
|------------|-------------|-----|
| Enterprise | $4,200,000  | 42 |
| Cloud      | $3,100,000  | 31 |
| On-Premise | $1,800,000  | 18 |
| Services   | $900,000    |  9 |
| **Total**  | **$10,000,000** |   |

The heading anchors the chunk semantically. The table preserves column relationships. The LLM reads the markdown table and extracts $4,200,000 without guessing.

The cost: every table in your corpus that loses its structure is a potential wrong answer. If your documents are financial reports, medical records, or legal contracts — domains where precision matters — a single wrong number can be worse than no answer at all.

Example 2: Lost headings destroy retrieval context

Headings are the single most important structural element for RAG. They tell the chunker where to split, and they tell the embedding model what the chunk is about.

When a PDF extractor strips headings — which most text-based extractors do, since PDF format has no native heading concept — every chunk becomes a context-free block of text. Consider a 40-page technical manual with sections like “Installation”, “Configuration”, “Troubleshooting”, and “API Reference”. Without headings, a chunk from the Troubleshooting section looks like this:

If the connection times out after 30 seconds, verify that port 8443 is
open in your firewall rules. The default timeout can be increased by
setting CONNECTION_TIMEOUT_MS in the environment configuration.

And a chunk from the Configuration section looks like this:

Set CONNECTION_TIMEOUT_MS to the desired value in milliseconds. The
default is 30000. Values below 5000 are not recommended for production
deployments.

Both chunks mention CONNECTION_TIMEOUT_MS and 30 seconds/30000. Without a heading to disambiguate, the embedding model produces similar vectors for both. A user asking “How do I fix connection timeout errors?” might retrieve the Configuration chunk instead of the Troubleshooting chunk — or worse, retrieve both and force the LLM to reconcile two chunks that look similar but serve different purposes.

With headings preserved:

## Troubleshooting

If the connection times out after 30 seconds, verify that port 8443 is open...
## Configuration

Set CONNECTION_TIMEOUT_MS to the desired value in milliseconds...

The heading text is embedded alongside the content. The Troubleshooting chunk now clusters closer to “fix”, “error”, “problem” queries. The Configuration chunk clusters closer to “set up”, “configure”, “change” queries. Retrieval precision improves because the chunks are semantically distinct.

The cost: in a corpus with 10 major sections per document, losing headings means every chunk competes with chunks from unrelated sections. Your effective retrieval precision can drop by 30-50% on topic-specific queries, because the embedding space can’t distinguish chunks that differ only in context, not content.

Example 3: Multi-column text produces nonsense embeddings

Academic papers, newsletters, reports with sidebars — any document with multi-column layout is a minefield for traditional extractors. The extractor reads left-to-right across the full page width, interleaving columns:

Introduction The experiment Results show
was conducted over a that the treatment
six-week period at group exhibited
three clinical sites. significantly higher
Participants were response rates
randomly assigned to (p < 0.001) compared
treatment (n=120) and to the control group
control (n=118) groups. across all endpoints.

This is not text. This is two columns shuffled together into a sequence that is semantically meaningless. An embedding model trained on coherent English will produce a vector that lands in a no-man’s-land of the embedding space — not close to anything useful.

If 20% of your chunks contain this kind of interleaved garbage, your retrieval precision drops proportionally. But the real damage is worse than 20%, because these garbled chunks don’t just fail to match — they actively pollute the vector space, making nearby valid chunks less distinguishable from noise.

A structure-aware parser linearizes the columns in reading order:

## Introduction

The experiment was conducted over a six-week period at three clinical
sites. Participants were randomly assigned to treatment (n=120) and
control (n=118) groups.

## Results

Results show that the treatment group exhibited significantly higher
response rates (p < 0.001) compared to the control group across all
endpoints.

Each section becomes a coherent, embeddable unit with a clear topic heading.

The cost: multi-column interleaving doesn’t just reduce match quality. It creates chunks that look like content to your pipeline — they pass length filters, they aren’t empty, they contain real words — but they carry zero retrievable meaning. These are phantom chunks that consume storage, slow down search, and never return useful results.

Quantifying the impact

The damage from bad parsing is measurable, but most teams don’t measure it because they assume parsing is a solved problem.

Here’s a rough framework for estimating the impact on your system:

  • Audit 50 random chunks from your vector store. Read them. Count how many are garbled, lack context, or contain interleaved text. If the answer is more than 5%, you have a parsing problem.
  • Measure retrieval precision on a set of test queries with known correct answers. If your top-3 results contain the correct chunk less than 70% of the time, bad input is likely a factor.
  • Check your table-heavy documents. If you’re ingesting financial reports, datasheets, or any document with tabular data, pull a sample of table chunks. If the tables aren’t in markdown table format, they’re not being parsed correctly.

A typical corpus of business documents — reports, contracts, manuals — has 15-30% of its content in tables, sidebars, or multi-column layouts. If your parser mishandles those, you’re starting with 15-30% noise in your index before you’ve even considered chunking strategy or embedding model choice.

No amount of retrieval tuning fixes bad input.

The fix

This is not a prompting problem or an embedding model problem. It’s a parsing problem, and the solution is to fix parsing before anything enters your pipeline.

pdfToMarkdown converts PDFs to structured markdown that preserves headings, tables, lists, and reading order. The output is designed for exactly this use case: clean text that chunks well, embeds well, and retrieves well.

from pdftomarkdown import convert

result = convert("quarterly_report.pdf", api_key="your-key")

# Chunk on headings — natural topic boundaries
chunks = []
current = ""
for line in result.markdown.split("\n"):
    if line.startswith("## ") and current:
        chunks.append(current.strip())
        current = ""
    current += line + "\n"
if current.strip():
    chunks.append(current.strip())

# Each chunk has a heading, preserves tables, reads coherently

The heading-based chunking here is simple on purpose. When your input has proper structure, you don’t need sophisticated chunking strategies. The document’s own headings are better split points than any recursive character splitter can find.

Start with better input

If your RAG system is underperforming, stop tuning the downstream components and look at what’s going into the pipeline. Pull 20 chunks from your vector store. Read them. If they don’t make sense to you, they don’t make sense to the embedding model either.

Try the API with your own documents — the free tier needs no signup:

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://your-document-url.pdf"}}'

Compare the output against what your current parser produces. The difference is usually obvious within the first page.

See how pdfToMarkdown fits into a RAG pipeline or read the API docs to get started.