scanned pdf ocr api
Scanned PDF OCR API
One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.
$ curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}' {
"markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
"pages": 3,
"request_id": "req_abc123"
} Text extraction libraries don’t work on scanned PDFs
If your PDF is a scan — a photo or image embedded in a PDF wrapper — libraries like PyMuPDF, pdfplumber, and PDFMiner return an empty string. They extract text from the PDF’s text layer, and scanned documents don’t have one. There’s nothing to extract.
The usual workaround is Tesseract: install Poppler to rasterize pages, install Tesseract with language packs, wire them together with pdf2image and pytesseract, and hope the output is usable. The result is a flat string with no structure — no tables, no headings, no formatting — that requires significant post-processing.
pdfToMarkdown handles scanned PDFs natively. The API uses a vision-language model that reads the page as an image, the same way a human would. It returns structured markdown with tables, headings, and lists preserved.
What the API produces from a scanned PDF
A Tesseract pipeline gives you something like this:
INVOICE 1042
Acme Corp
2024 01 15
API Pro Plan 1 299.00 299.00
Setup fee 1 49.00 49.00
Subtotal 348.00
Tax 8% 27.84
Total due 375.84
The same scanned page through pdfToMarkdown:
# Invoice #1042
**Vendor:** Acme Corp
**Date:** 2024-01-15
| Description | Qty | Unit Price | Total |
|--------------|-----|------------|---------|
| API Pro Plan | 1 | $299.00 | $299.00 |
| Setup fee | 1 | $49.00 | $49.00 |
**Subtotal:** $348.00
**Tax (8%):** $27.84
**Total due:** $375.84
The difference is layout understanding. Tesseract sees characters. The vision model sees a document.
Works on real-world scan quality
Production scans are rarely clean 300 DPI images. The vision model handles what Tesseract struggles with:
- Low-resolution scans — 150 DPI or lower, common from older multifunction printers
- Rotated and skewed pages — slightly crooked scans from flatbed scanners or phone cameras
- Faded and low-contrast text — old documents, thermal paper receipts, carbon copies
- Mixed native + scanned PDFs — documents where some pages have selectable text and others are images
- Handwritten annotations — printed forms with handwritten fill-ins
- Non-English text — scanned documents in German, Japanese, Arabic, and other languages
No pre-processing pipeline needed. No deskewing. No binarization. Send the PDF as-is.
One API for both native and scanned PDFs
You don’t need to detect whether a PDF is scanned or native, and you don’t need separate pipelines. The same endpoint handles both:
# Scanned PDFs are usually local files — use base64 upload
pdf_base64=$(base64 < scanned-document.pdf | tr -d '\n')
curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d "{\"input\":{\"pdf_base64\":\"$pdf_base64\"}}"
The response is identical regardless of whether the source is a native PDF, an image-only scan, or a mix of both. Your downstream code doesn’t need to know or care.
No Tesseract dependency chain
A typical Tesseract-based OCR setup requires:
- Poppler or MuPDF — to rasterize PDF pages into images
- Tesseract — the OCR engine, installed as a system binary
- Language packs —
tesseract-ocr-eng,tesseract-ocr-deu, etc., one per language - Python bindings —
pytesseract,pdf2image, and their transitive dependencies - Post-processing — custom code to reconstruct tables, headings, and structure from raw text
This breaks in Docker builds, fails silently when a language pack is missing, and produces different results across OS versions.
pdfToMarkdown is a single HTTP call. No system dependencies. No binaries to install. No language packs to manage.
from pdftomarkdown import convert
# Works on scanned PDFs, native PDFs, and mixed PDFs
result = convert("scanned-document.pdf")
print(result.markdown)
Compared to Tesseract-based OCR
| pdfToMarkdown | Tesseract pipeline | |
|---|---|---|
| Table extraction | Markdown tables with alignment | Raw text, columns lost |
| Heading detection | H1–H6 hierarchy preserved | No structure |
| Rotated pages | Handled automatically | Requires deskew preprocessing |
| Mixed native + scanned | Single pipeline | Separate detection + routing |
| System dependencies | None (HTTP API) | Poppler, Tesseract, language packs |
| Setup time | One curl command | Hours of configuration |
Related pages
- OCR API for Developers — general overview of the OCR API
- PDF Parsing API — extracting structured data from native PDFs
- Invoice OCR API — vertical guide for invoice extraction
- API documentation — endpoint reference, response schema, error codes
Pricing
Both tiers are free. No credit card required.
Hacker
Free, no signup
- Public demo key — copy & paste
- Only page 1 is processed
- 1 request/min per IP
- Watermark in output
Developer
Free, GitHub login
- Personal API key
- 100 pages/month
- Multi-page PDFs
- No watermark
Try with a scanned document
Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.