The RAG pipeline garbage-in problem

Your retrieval-augmented generation pipeline is only as good as the documents you feed into it. Most PDF extractors — PyMuPDF, pdfplumber, PDFMiner — give you a flat stream of characters. Headings disappear. Tables collapse into whitespace-separated gibberish. Multi-column layouts interleave into nonsense.

When you embed that garbled text, you get garbled vectors. When you retrieve those vectors, you get irrelevant context. When the LLM reads that context, it hallucinates.

The fix is not a better embedding model. The fix is cleaner input.

What bad extraction does to your chunks

Here is what a typical PDF text extractor produces from a document with tables and headings:

Q3 2024 Financial Summary Revenue by Segment
Enterprise 4200000 42 Cloud Services 3100000 31
On-Premise 1800000 18 Professional Services 900000 9
Total Revenue 10000000 Key Metrics Gross Margin 72.3
Operating Margin 28.1 Net Income 1840000 Customer
Acquisition Cost 1250 Lifetime Value 48000

No structure. No way to tell where the table ends and the body text begins. An embedding model will compress this into a vector that represents nothing useful.

The same page through pdfToMarkdown:

## Q3 2024 Financial Summary

### Revenue by Segment

| Segment               | Revenue      |   % |
|-----------------------|-------------|-----|
| Enterprise            | $4,200,000  |  42 |
| Cloud Services        | $3,100,000  |  31 |
| On-Premise            | $1,800,000  |  18 |
| Professional Services |   $900,000  |   9 |
| **Total Revenue**     | **$10,000,000** |     |

### Key Metrics

| Metric                    | Value       |
|---------------------------|-------------|
| Gross Margin              | 72.3%       |
| Operating Margin          | 28.1%       |
| Net Income                | $1,840,000  |
| Customer Acquisition Cost | $1,250      |
| Lifetime Value            | $48,000     |

Headings give your chunker natural split points. Tables remain tables. When this chunk lands in an LLM context window, the model can actually read it.

Why structure matters for embeddings

Embedding models encode semantic meaning. But semantic meaning depends on structure:

Headings establish topic scope. A chunk that starts with ## Revenue by Segment will embed closer to revenue-related queries than the same numbers without a heading.
Tables preserve relationships. “Enterprise: $4,200,000” carries meaning that “Enterprise 4200000 42 Cloud Services 3100000” does not.
Markdown formatting reduces noise. Clean separators between sections mean your chunks don’t bleed across topics.

The result: better cosine similarity scores on relevant queries, fewer irrelevant chunks in your top-k, and fewer hallucinations in the final answer.

Build a RAG pipeline in 20 lines

PDF to markdown to chunks to embeddings to retrieval — end to end:

import os
from pdftomarkdown import convert
from openai import OpenAI

# Step 1: Convert PDF to structured markdown
result = convert(
    "quarterly_report.pdf",
    api_key=os.environ["PDFTOMARKDOWN_API_KEY"]
)

# Step 2: Chunk on markdown headings
chunks = []
current_chunk = ""
for line in result.markdown.split("\n"):
    if line.startswith("## ") and current_chunk:
        chunks.append(current_chunk.strip())
        current_chunk = ""
    current_chunk += line + "\n"
if current_chunk.strip():
    chunks.append(current_chunk.strip())

# Step 3: Embed chunks
client = OpenAI()
embeddings = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks
)

# Step 4: Store in your vector DB (pseudocode)
for chunk, embedding in zip(chunks, embeddings.data):
    vector_db.upsert(
        id=hash(chunk),
        vector=embedding.embedding,
        metadata={"text": chunk}
    )

# Step 5: Query
query = "What was the gross margin in Q3?"
query_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input=[query]
).data[0].embedding

results = vector_db.query(vector=query_embedding, top_k=3)

Because the markdown preserves headings, the chunker splits on section boundaries instead of arbitrary character counts. Each chunk is a coherent unit of information.

Try it with curl

No signup required. Send a PDF, get markdown back:

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'

Or with a local file:

from pdftomarkdown import convert

result = convert("document.pdf")
print(result.markdown)

The free tier processes page one of any PDF with no API key signup. Get a Developer key for multi-page documents and higher rate limits.

What breaks in traditional PDF extraction

Problem	Traditional extractors	pdfToMarkdown
Tables	Columns collapse into space-separated text	Preserved as markdown tables
Headings	Indistinguishable from body text	Mapped to `#`, `##`, `###`
Multi-column layouts	Left and right columns interleave	Linearized in reading order
Scanned documents	Require separate OCR step	Handled natively by the vision model
Footnotes and captions	Mixed into body text	Separated and labeled
Bullet and numbered lists	Flattened into paragraphs	Preserved as markdown lists

Every one of these failures degrades your RAG retrieval quality. Tables are the worst offender — a financial table rendered as flat text is nearly useless for question answering.

Works with any RAG framework

The API returns standard markdown. Feed it into whatever stack you use:

LangChain — use the markdown string directly with RecursiveCharacterTextSplitter or a markdown-aware splitter
LlamaIndex — pass the output as a Document node
Haystack — pipe into a PreProcessor for chunking and indexing
Custom pipelines — split on \n## for heading-based chunks, or use any markdown parser

No vendor lock-in, no proprietary format. It is just markdown.

PDF Parsing API — general-purpose PDF text extraction
OCR API for Developers — overview of the OCR capabilities
API documentation — endpoint reference, response schema, error codes

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

Public demo key — copy & paste
Only page 1 is processed
1 request/min per IP
Watermark in output

View docs →

Developer

Free, GitHub login

Personal API key
100 pages/month
Multi-page PDFs
No watermark

Get API key →

Try the API free

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.

Sign in with GitHub Or read the docs first →

PDF to Markdown for RAG Pipelines