PDFs are the universal document format. Invoices, research papers, legal contracts, technical manuals — billions of documents locked inside a format designed for printing, not for processing.

If you’ve ever tried to extract text from a PDF programmatically, you know the pain:

Copy-paste gives you garbage. Columns merge, headers repeat inline, tables lose their structure.
Traditional OCR tools output flat text with no formatting, no headings, no structure.
PDF parsing libraries (like PyMuPDF or pdfplumber) work on text-native PDFs but fail completely on scanned documents.

Why markdown?

Markdown is the sweet spot between raw text and full HTML. It preserves the structure of a document — headings, lists, tables, code blocks, emphasis — while staying lightweight and easy to process.

For developers building on top of document data, markdown is ideal because:

LLMs understand it natively. Feed markdown to GPT-4, Claude, or any language model and it processes the structure correctly. Feed it raw OCR text and you lose context.
It’s pipeline-friendly. Markdown parses cleanly into ASTs, converts to HTML, embeds into vector databases, and renders in any UI.
It’s human-readable. Unlike JSON or XML extraction schemas, you can open a markdown file and immediately see if the conversion worked.

The gap in the market

Most PDF extraction tools fall into two camps:

Camp 1: OCR-only tools (Tesseract, AWS Textract, Google Vision)

These give you raw text. No headings, no tables, no structure. You have to write custom post-processing to reconstruct the document layout — and that code breaks every time you encounter a new document format.

Camp 2: Expensive enterprise platforms (Mathpix, ABBYY, Adobe Extract)

These handle structure well but come with enterprise pricing, complex SDKs, and usage-based billing that makes them impractical for side projects or early-stage products.

pdfToMarkdown sits in between: structured output, simple API, free to start.

How it works

Under the hood, pdfToMarkdown uses a vision-language model pipeline that sees the document the way a human does. Instead of trying to parse PDF internals (which are notoriously inconsistent), it:

Renders each page as an image
Runs a specialized OCR model that understands document layout
Outputs clean markdown with proper heading hierarchy, table formatting, and list structure

The result is a single API call:

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'

And you get back structured markdown — ready to feed into your LLM pipeline, render in your app, or store in your database.

Get started

The Hacker tier is free, no signup required. You get 1 page per PDF with a watermark — enough to test the quality on your documents.

Need more? Sign in with GitHub to get 100 pages/month, no watermark, no credit card.

Read the docs to get started in under a minute.