pdf parsing for rag

PDF to Markdown for RAG Pipelines

One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.

$ curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
{
  "markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
  "pages": 3,
  "request_id": "req_abc123"
}

The RAG pipeline garbage-in problem

Your retrieval-augmented generation pipeline is only as good as the documents you feed into it. Most PDF extractors — PyMuPDF, pdfplumber, PDFMiner — give you a flat stream of characters. Headings disappear. Tables collapse into whitespace-separated gibberish. Multi-column layouts interleave into nonsense.

When you embed that garbled text, you get garbled vectors. When you retrieve those vectors, you get irrelevant context. When the LLM reads that context, it hallucinates.

The fix is not a better embedding model. The fix is cleaner input.

What bad extraction does to your chunks

Here is what a typical PDF text extractor produces from a document with tables and headings:

Q3 2024 Financial Summary Revenue by Segment
Enterprise 4200000 42 Cloud Services 3100000 31
On-Premise 1800000 18 Professional Services 900000 9
Total Revenue 10000000 Key Metrics Gross Margin 72.3
Operating Margin 28.1 Net Income 1840000 Customer
Acquisition Cost 1250 Lifetime Value 48000

No structure. No way to tell where the table ends and the body text begins. An embedding model will compress this into a vector that represents nothing useful.

The same page through pdfToMarkdown:

## Q3 2024 Financial Summary

### Revenue by Segment

| Segment               | Revenue      |   % |
|-----------------------|-------------|-----|
| Enterprise            | $4,200,000  |  42 |
| Cloud Services        | $3,100,000  |  31 |
| On-Premise            | $1,800,000  |  18 |
| Professional Services |   $900,000  |   9 |
| **Total Revenue**     | **$10,000,000** |     |

### Key Metrics

| Metric                    | Value       |
|---------------------------|-------------|
| Gross Margin              | 72.3%       |
| Operating Margin          | 28.1%       |
| Net Income                | $1,840,000  |
| Customer Acquisition Cost | $1,250      |
| Lifetime Value            | $48,000     |

Headings give your chunker natural split points. Tables remain tables. When this chunk lands in an LLM context window, the model can actually read it.

Why structure matters for embeddings

Embedding models encode semantic meaning. But semantic meaning depends on structure:

  • Headings establish topic scope. A chunk that starts with ## Revenue by Segment will embed closer to revenue-related queries than the same numbers without a heading.
  • Tables preserve relationships. “Enterprise: $4,200,000” carries meaning that “Enterprise 4200000 42 Cloud Services 3100000” does not.
  • Markdown formatting reduces noise. Clean separators between sections mean your chunks don’t bleed across topics.

The result: better cosine similarity scores on relevant queries, fewer irrelevant chunks in your top-k, and fewer hallucinations in the final answer.

Build a RAG pipeline in 20 lines

PDF to markdown to chunks to embeddings to retrieval — end to end:

import os
from pdftomarkdown import convert
from openai import OpenAI

# Step 1: Convert PDF to structured markdown
result = convert(
    "quarterly_report.pdf",
    api_key=os.environ["PDFTOMARKDOWN_API_KEY"]
)

# Step 2: Chunk on markdown headings
chunks = []
current_chunk = ""
for line in result.markdown.split("\n"):
    if line.startswith("## ") and current_chunk:
        chunks.append(current_chunk.strip())
        current_chunk = ""
    current_chunk += line + "\n"
if current_chunk.strip():
    chunks.append(current_chunk.strip())

# Step 3: Embed chunks
client = OpenAI()
embeddings = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks
)

# Step 4: Store in your vector DB (pseudocode)
for chunk, embedding in zip(chunks, embeddings.data):
    vector_db.upsert(
        id=hash(chunk),
        vector=embedding.embedding,
        metadata={"text": chunk}
    )

# Step 5: Query
query = "What was the gross margin in Q3?"
query_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input=[query]
).data[0].embedding

results = vector_db.query(vector=query_embedding, top_k=3)

Because the markdown preserves headings, the chunker splits on section boundaries instead of arbitrary character counts. Each chunk is a coherent unit of information.

Try it with curl

No signup required. Send a PDF, get markdown back:

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'

Or with a local file:

from pdftomarkdown import convert

result = convert("document.pdf")
print(result.markdown)

The free tier processes page one of any PDF with no API key signup. Get a Developer key for multi-page documents and higher rate limits.

What breaks in traditional PDF extraction

ProblemTraditional extractorspdfToMarkdown
TablesColumns collapse into space-separated textPreserved as markdown tables
HeadingsIndistinguishable from body textMapped to #, ##, ###
Multi-column layoutsLeft and right columns interleaveLinearized in reading order
Scanned documentsRequire separate OCR stepHandled natively by the vision model
Footnotes and captionsMixed into body textSeparated and labeled
Bullet and numbered listsFlattened into paragraphsPreserved as markdown lists

Every one of these failures degrades your RAG retrieval quality. Tables are the worst offender — a financial table rendered as flat text is nearly useless for question answering.

Works with any RAG framework

The API returns standard markdown. Feed it into whatever stack you use:

  • LangChain — use the markdown string directly with RecursiveCharacterTextSplitter or a markdown-aware splitter
  • LlamaIndex — pass the output as a Document node
  • Haystack — pipe into a PreProcessor for chunking and indexing
  • Custom pipelines — split on \n## for heading-based chunks, or use any markdown parser

No vendor lock-in, no proprietary format. It is just markdown.

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

  • Public demo key — copy & paste
  • Only page 1 is processed
  • 1 request/min per IP
  • Watermark in output
View docs →

Developer

Free, GitHub login

  • Personal API key
  • 100 pages/month
  • Multi-page PDFs
  • No watermark
Get API key →

Try the API free

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.