scanned pdf ocr api

Scanned PDF OCR API

One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.

$ curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
{
  "markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
  "pages": 3,
  "request_id": "req_abc123"
}

Text extraction libraries don’t work on scanned PDFs

If your PDF is a scan — a photo or image embedded in a PDF wrapper — libraries like PyMuPDF, pdfplumber, and PDFMiner return an empty string. They extract text from the PDF’s text layer, and scanned documents don’t have one. There’s nothing to extract.

The usual workaround is Tesseract: install Poppler to rasterize pages, install Tesseract with language packs, wire them together with pdf2image and pytesseract, and hope the output is usable. The result is a flat string with no structure — no tables, no headings, no formatting — that requires significant post-processing.

pdfToMarkdown handles scanned PDFs natively. The API uses a vision-language model that reads the page as an image, the same way a human would. It returns structured markdown with tables, headings, and lists preserved.

What the API produces from a scanned PDF

A Tesseract pipeline gives you something like this:

INVOICE 1042
Acme Corp
2024 01 15
API Pro Plan 1 299.00 299.00
Setup fee 1 49.00 49.00
Subtotal 348.00
Tax 8% 27.84
Total due 375.84

The same scanned page through pdfToMarkdown:

# Invoice #1042

**Vendor:** Acme Corp
**Date:** 2024-01-15

| Description  | Qty | Unit Price | Total   |
|--------------|-----|------------|---------|
| API Pro Plan |   1 |    $299.00 | $299.00 |
| Setup fee    |   1 |     $49.00 |  $49.00 |

**Subtotal:** $348.00
**Tax (8%):** $27.84
**Total due:** $375.84

The difference is layout understanding. Tesseract sees characters. The vision model sees a document.

Works on real-world scan quality

Production scans are rarely clean 300 DPI images. The vision model handles what Tesseract struggles with:

  • Low-resolution scans — 150 DPI or lower, common from older multifunction printers
  • Rotated and skewed pages — slightly crooked scans from flatbed scanners or phone cameras
  • Faded and low-contrast text — old documents, thermal paper receipts, carbon copies
  • Mixed native + scanned PDFs — documents where some pages have selectable text and others are images
  • Handwritten annotations — printed forms with handwritten fill-ins
  • Non-English text — scanned documents in German, Japanese, Arabic, and other languages

No pre-processing pipeline needed. No deskewing. No binarization. Send the PDF as-is.

One API for both native and scanned PDFs

You don’t need to detect whether a PDF is scanned or native, and you don’t need separate pipelines. The same endpoint handles both:

# Scanned PDFs are usually local files — use base64 upload
pdf_base64=$(base64 < scanned-document.pdf | tr -d '\n')

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d "{\"input\":{\"pdf_base64\":\"$pdf_base64\"}}"

The response is identical regardless of whether the source is a native PDF, an image-only scan, or a mix of both. Your downstream code doesn’t need to know or care.

No Tesseract dependency chain

A typical Tesseract-based OCR setup requires:

  1. Poppler or MuPDF — to rasterize PDF pages into images
  2. Tesseract — the OCR engine, installed as a system binary
  3. Language packstesseract-ocr-eng, tesseract-ocr-deu, etc., one per language
  4. Python bindingspytesseract, pdf2image, and their transitive dependencies
  5. Post-processing — custom code to reconstruct tables, headings, and structure from raw text

This breaks in Docker builds, fails silently when a language pack is missing, and produces different results across OS versions.

pdfToMarkdown is a single HTTP call. No system dependencies. No binaries to install. No language packs to manage.

from pdftomarkdown import convert

# Works on scanned PDFs, native PDFs, and mixed PDFs
result = convert("scanned-document.pdf")
print(result.markdown)

Compared to Tesseract-based OCR

pdfToMarkdownTesseract pipeline
Table extractionMarkdown tables with alignmentRaw text, columns lost
Heading detectionH1–H6 hierarchy preservedNo structure
Rotated pagesHandled automaticallyRequires deskew preprocessing
Mixed native + scannedSingle pipelineSeparate detection + routing
System dependenciesNone (HTTP API)Poppler, Tesseract, language packs
Setup timeOne curl commandHours of configuration

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

  • Public demo key — copy & paste
  • Only page 1 is processed
  • 1 request/min per IP
  • Watermark in output
View docs →

Developer

Free, GitHub login

  • Personal API key
  • 100 pages/month
  • Multi-page PDFs
  • No watermark
Get API key →

Try with a scanned document

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.