Text extraction libraries don’t work on scanned PDFs

If your PDF is a scan — a photo or image embedded in a PDF wrapper — libraries like PyMuPDF, pdfplumber, and PDFMiner return an empty string. They extract text from the PDF’s text layer, and scanned documents don’t have one. There’s nothing to extract.

The usual workaround is Tesseract: install Poppler to rasterize pages, install Tesseract with language packs, wire them together with pdf2image and pytesseract, and hope the output is usable. The result is a flat string with no structure — no tables, no headings, no formatting — that requires significant post-processing.

pdfToMarkdown handles scanned PDFs natively. The API uses a vision-language model that reads the page as an image, the same way a human would. It returns structured markdown with tables, headings, and lists preserved.

What the API produces from a scanned PDF

A Tesseract pipeline gives you something like this:

INVOICE 1042
Acme Corp
2024 01 15
API Pro Plan 1 299.00 299.00
Setup fee 1 49.00 49.00
Subtotal 348.00
Tax 8% 27.84
Total due 375.84

The same scanned page through pdfToMarkdown:

# Invoice #1042

**Vendor:** Acme Corp
**Date:** 2024-01-15

| Description  | Qty | Unit Price | Total   |
|--------------|-----|------------|---------|
| API Pro Plan |   1 |    $299.00 | $299.00 |
| Setup fee    |   1 |     $49.00 |  $49.00 |

**Subtotal:** $348.00
**Tax (8%):** $27.84
**Total due:** $375.84

The difference is layout understanding. Tesseract sees characters. The vision model sees a document.

Works on real-world scan quality

Production scans are rarely clean 300 DPI images. The vision model handles what Tesseract struggles with:

Low-resolution scans — 150 DPI or lower, common from older multifunction printers
Rotated and skewed pages — slightly crooked scans from flatbed scanners or phone cameras
Faded and low-contrast text — old documents, thermal paper receipts, carbon copies
Mixed native + scanned PDFs — documents where some pages have selectable text and others are images
Handwritten annotations — printed forms with handwritten fill-ins
Non-English text — scanned documents in German, Japanese, Arabic, and other languages

No pre-processing pipeline needed. No deskewing. No binarization. Send the PDF as-is.

One API for both native and scanned PDFs

You don’t need to detect whether a PDF is scanned or native, and you don’t need separate pipelines. The same endpoint handles both:

# Scanned PDFs are usually local files — use base64 upload
pdf_base64=$(base64 < scanned-document.pdf | tr -d '\n')

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d "{\"input\":{\"pdf_base64\":\"$pdf_base64\"}}"

The response is identical regardless of whether the source is a native PDF, an image-only scan, or a mix of both. Your downstream code doesn’t need to know or care.

No Tesseract dependency chain

A typical Tesseract-based OCR setup requires:

Poppler or MuPDF — to rasterize PDF pages into images
Tesseract — the OCR engine, installed as a system binary
Language packs — tesseract-ocr-eng, tesseract-ocr-deu, etc., one per language
Python bindings — pytesseract, pdf2image, and their transitive dependencies
Post-processing — custom code to reconstruct tables, headings, and structure from raw text

This breaks in Docker builds, fails silently when a language pack is missing, and produces different results across OS versions.

pdfToMarkdown is a single HTTP call. No system dependencies. No binaries to install. No language packs to manage.

from pdftomarkdown import convert

# Works on scanned PDFs, native PDFs, and mixed PDFs
result = convert("scanned-document.pdf")
print(result.markdown)

Compared to Tesseract-based OCR

	pdfToMarkdown	Tesseract pipeline
Table extraction	Markdown tables with alignment	Raw text, columns lost
Heading detection	H1–H6 hierarchy preserved	No structure
Rotated pages	Handled automatically	Requires deskew preprocessing
Mixed native + scanned	Single pipeline	Separate detection + routing
System dependencies	None (HTTP API)	Poppler, Tesseract, language packs
Setup time	One curl command	Hours of configuration

OCR API for Developers — general overview of the OCR API
PDF Parsing API — extracting structured data from native PDFs
Invoice OCR API — vertical guide for invoice extraction
API documentation — endpoint reference, response schema, error codes

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

Public demo key — copy & paste
Only page 1 is processed
1 request/min per IP
Watermark in output

View docs →

Developer

Free, GitHub login

Personal API key
100 pages/month
Multi-page PDFs
No watermark

Get API key →

Try with a scanned document

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.

Sign in with GitHub Or read the docs first →

Scanned PDF OCR API