PDF parsing

PDF Parsing API

One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.

$ curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
{
  "markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
  "pages": 3,
  "request_id": "req_abc123"
}

Parse PDFs without writing a parser

PDF is a print format, not a data format. Every library that extracts text from a PDF — PyMuPDF, pdfplumber, PDFMiner — gives you a stream of characters with approximate x/y coordinates. Turning that into structured data means writing fragile heuristics that break on the next document format you encounter.

pdfToMarkdown skips that entirely. Send a PDF, get back markdown that preserves the document’s structure: headings, tables, bullet lists, numbered sections. The markdown is immediately usable in LLM pipelines, document search, or data extraction workflows.

What “structured” actually means

A raw PDF text extraction of a table looks like this:

Item Qty Price Total
Widget A 10 $2.50 $25.00
Widget B 5 $4.00 $20.00

The same table from pdfToMarkdown:

| Item     | Qty | Price  | Total  |
|----------|-----|--------|--------|
| Widget A |  10 | $2.50  | $25.00 |
| Widget B |   5 | $4.00  | $20.00 |

The difference matters when you’re feeding output into an LLM, a database, or a downstream parser.

Common use cases

RAG and document Q&A

Chunk the markdown output into passages and index them in a vector database. Because headers and table cells are preserved, retrieval quality improves significantly compared to flat text extraction.

Accounts payable automation

Parse invoices to extract vendor name, line items, totals, and due dates. The markdown table format makes downstream extraction straightforward — either with regex or an LLM.

Contract review

Extract clause text, definitions, and structured schedules from legal PDFs. Section headings in the markdown map directly to document sections, making it easy to locate specific provisions.

Data pipelines

Drop PDF parsing into any Python script. The SDK is three lines of code and reads your API key from the environment.

How the parsing works

The API uses a vision-language model — not a text extraction library. The model “sees” the page the way a human reader does, understanding layout cues like column alignment, font weight, and whitespace to reconstruct document structure.

This means it works on:

  • Native PDFs — documents with selectable text
  • Scanned PDFs — image-only documents, including low-resolution scans
  • Mixed PDFs — documents with both text and scanned pages

Drop-in Python integration

import os
from pdftomarkdown import convert

result = convert(
    "report.pdf",
    api_key=os.environ["PDFTOMARKDOWN_API_KEY"]
)

# Use with LangChain, LlamaIndex, or any RAG framework
chunks = result.markdown.split("\n\n")

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

  • Public demo key — copy & paste
  • Only page 1 is processed
  • 1 request/min per IP
  • Watermark in output
View docs →

Developer

Free, GitHub login

  • Personal API key
  • 100 pages/month
  • Multi-page PDFs
  • No watermark
Get API key →

Start parsing PDFs free

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.