Parse PDFs without writing a parser

PDF is a print format, not a data format. Every library that extracts text from a PDF — PyMuPDF, pdfplumber, PDFMiner — gives you a stream of characters with approximate x/y coordinates. Turning that into structured data means writing fragile heuristics that break on the next document format you encounter.

pdfToMarkdown skips that entirely. Send a PDF, get back markdown that preserves the document’s structure: headings, tables, bullet lists, numbered sections. The markdown is immediately usable in LLM pipelines, document search, or data extraction workflows.

What “structured” actually means

A raw PDF text extraction of a table looks like this:

Item Qty Price Total
Widget A 10 $2.50 $25.00
Widget B 5 $4.00 $20.00

The same table from pdfToMarkdown:

| Item     | Qty | Price  | Total  |
|----------|-----|--------|--------|
| Widget A |  10 | $2.50  | $25.00 |
| Widget B |   5 | $4.00  | $20.00 |

The difference matters when you’re feeding output into an LLM, a database, or a downstream parser.

Common use cases

RAG and document Q&A

Chunk the markdown output into passages and index them in a vector database. Because headers and table cells are preserved, retrieval quality improves significantly compared to flat text extraction.

Accounts payable automation

Parse invoices to extract vendor name, line items, totals, and due dates. The markdown table format makes downstream extraction straightforward — either with regex or an LLM.

Contract review

Extract clause text, definitions, and structured schedules from legal PDFs. Section headings in the markdown map directly to document sections, making it easy to locate specific provisions.

Data pipelines

Drop PDF parsing into any Python script. The SDK is three lines of code and reads your API key from the environment.

How the parsing works

The API uses a vision-language model — not a text extraction library. The model “sees” the page the way a human reader does, understanding layout cues like column alignment, font weight, and whitespace to reconstruct document structure.

This means it works on:

Native PDFs — documents with selectable text
Scanned PDFs — image-only documents, including low-resolution scans
Mixed PDFs — documents with both text and scanned pages

Drop-in Python integration

import os
from pdftomarkdown import convert

result = convert(
    "report.pdf",
    api_key=os.environ["PDFTOMARKDOWN_API_KEY"]
)

# Use with LangChain, LlamaIndex, or any RAG framework
chunks = result.markdown.split("\n\n")

OCR API for Developers — general overview of the OCR capabilities
Research Paper to Markdown — parsing academic PDFs with equations
API documentation — complete endpoint reference

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

Public demo key — copy & paste
Only page 1 is processed
1 request/min per IP
Watermark in output

View docs →

Developer

Free, GitHub login

Personal API key
100 pages/month
Multi-page PDFs
No watermark

Get API key →

Start parsing PDFs free

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.

Sign in with GitHub Or read the docs first →

PDF Parsing API