pdf parsing for rag
PDF to Markdown for RAG Pipelines
One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.
$ curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}' {
"markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
"pages": 3,
"request_id": "req_abc123"
} The RAG pipeline garbage-in problem
Your retrieval-augmented generation pipeline is only as good as the documents you feed into it. Most PDF extractors — PyMuPDF, pdfplumber, PDFMiner — give you a flat stream of characters. Headings disappear. Tables collapse into whitespace-separated gibberish. Multi-column layouts interleave into nonsense.
When you embed that garbled text, you get garbled vectors. When you retrieve those vectors, you get irrelevant context. When the LLM reads that context, it hallucinates.
The fix is not a better embedding model. The fix is cleaner input.
What bad extraction does to your chunks
Here is what a typical PDF text extractor produces from a document with tables and headings:
Q3 2024 Financial Summary Revenue by Segment
Enterprise 4200000 42 Cloud Services 3100000 31
On-Premise 1800000 18 Professional Services 900000 9
Total Revenue 10000000 Key Metrics Gross Margin 72.3
Operating Margin 28.1 Net Income 1840000 Customer
Acquisition Cost 1250 Lifetime Value 48000
No structure. No way to tell where the table ends and the body text begins. An embedding model will compress this into a vector that represents nothing useful.
The same page through pdfToMarkdown:
## Q3 2024 Financial Summary
### Revenue by Segment
| Segment | Revenue | % |
|-----------------------|-------------|-----|
| Enterprise | $4,200,000 | 42 |
| Cloud Services | $3,100,000 | 31 |
| On-Premise | $1,800,000 | 18 |
| Professional Services | $900,000 | 9 |
| **Total Revenue** | **$10,000,000** | |
### Key Metrics
| Metric | Value |
|---------------------------|-------------|
| Gross Margin | 72.3% |
| Operating Margin | 28.1% |
| Net Income | $1,840,000 |
| Customer Acquisition Cost | $1,250 |
| Lifetime Value | $48,000 |
Headings give your chunker natural split points. Tables remain tables. When this chunk lands in an LLM context window, the model can actually read it.
Why structure matters for embeddings
Embedding models encode semantic meaning. But semantic meaning depends on structure:
- Headings establish topic scope. A chunk that starts with
## Revenue by Segmentwill embed closer to revenue-related queries than the same numbers without a heading. - Tables preserve relationships. “Enterprise: $4,200,000” carries meaning that “Enterprise 4200000 42 Cloud Services 3100000” does not.
- Markdown formatting reduces noise. Clean separators between sections mean your chunks don’t bleed across topics.
The result: better cosine similarity scores on relevant queries, fewer irrelevant chunks in your top-k, and fewer hallucinations in the final answer.
Build a RAG pipeline in 20 lines
PDF to markdown to chunks to embeddings to retrieval — end to end:
import os
from pdftomarkdown import convert
from openai import OpenAI
# Step 1: Convert PDF to structured markdown
result = convert(
"quarterly_report.pdf",
api_key=os.environ["PDFTOMARKDOWN_API_KEY"]
)
# Step 2: Chunk on markdown headings
chunks = []
current_chunk = ""
for line in result.markdown.split("\n"):
if line.startswith("## ") and current_chunk:
chunks.append(current_chunk.strip())
current_chunk = ""
current_chunk += line + "\n"
if current_chunk.strip():
chunks.append(current_chunk.strip())
# Step 3: Embed chunks
client = OpenAI()
embeddings = client.embeddings.create(
model="text-embedding-3-small",
input=chunks
)
# Step 4: Store in your vector DB (pseudocode)
for chunk, embedding in zip(chunks, embeddings.data):
vector_db.upsert(
id=hash(chunk),
vector=embedding.embedding,
metadata={"text": chunk}
)
# Step 5: Query
query = "What was the gross margin in Q3?"
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=[query]
).data[0].embedding
results = vector_db.query(vector=query_embedding, top_k=3)
Because the markdown preserves headings, the chunker splits on section boundaries instead of arbitrary character counts. Each chunk is a coherent unit of information.
Try it with curl
No signup required. Send a PDF, get markdown back:
curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
Or with a local file:
from pdftomarkdown import convert
result = convert("document.pdf")
print(result.markdown)
The free tier processes page one of any PDF with no API key signup. Get a Developer key for multi-page documents and higher rate limits.
What breaks in traditional PDF extraction
| Problem | Traditional extractors | pdfToMarkdown |
|---|---|---|
| Tables | Columns collapse into space-separated text | Preserved as markdown tables |
| Headings | Indistinguishable from body text | Mapped to #, ##, ### |
| Multi-column layouts | Left and right columns interleave | Linearized in reading order |
| Scanned documents | Require separate OCR step | Handled natively by the vision model |
| Footnotes and captions | Mixed into body text | Separated and labeled |
| Bullet and numbered lists | Flattened into paragraphs | Preserved as markdown lists |
Every one of these failures degrades your RAG retrieval quality. Tables are the worst offender — a financial table rendered as flat text is nearly useless for question answering.
Works with any RAG framework
The API returns standard markdown. Feed it into whatever stack you use:
- LangChain — use the markdown string directly with
RecursiveCharacterTextSplitteror a markdown-aware splitter - LlamaIndex — pass the output as a
Documentnode - Haystack — pipe into a
PreProcessorfor chunking and indexing - Custom pipelines — split on
\n##for heading-based chunks, or use any markdown parser
No vendor lock-in, no proprietary format. It is just markdown.
Related pages
- PDF Parsing API — general-purpose PDF text extraction
- OCR API for Developers — overview of the OCR capabilities
- API documentation — endpoint reference, response schema, error codes
Pricing
Both tiers are free. No credit card required.
Hacker
Free, no signup
- Public demo key — copy & paste
- Only page 1 is processed
- 1 request/min per IP
- Watermark in output
Developer
Free, GitHub login
- Personal API key
- 100 pages/month
- Multi-page PDFs
- No watermark
Try the API free
Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.