The PDF parsing landscape has fragmented. In 2024, most developers chose between “use Tesseract” and “pay for a cloud API.” In 2026, there are at least a dozen viable approaches, and the right choice depends entirely on your documents, your budget, and how much infrastructure you want to maintain.

This post covers every major category: text-extraction libraries, traditional OCR engines, cloud OCR services, and vision-language models. For each tool, we cover what it does, where it shines, where it breaks, what it costs, and when you should reach for it.

Category 1: Text-extraction libraries

These tools parse the PDF file format directly. They read the internal structure of a PDF — the text streams, font definitions, and positioning data — and extract text without any image processing or OCR. They’re fast, free, and work well on text-native PDFs. They fail completely on scanned documents.

PyMuPDF (fitz)

PyMuPDF is the Python binding for MuPDF, a lightweight PDF rendering engine. It’s the fastest Python PDF library by a wide margin.

import pymupdf

doc = pymupdf.open("document.pdf")
for page in doc:
    text = page.get_text()
    # Returns raw text with basic positioning

Strengths:

Extremely fast — processes hundreds of pages per second
Reliable text extraction from text-native PDFs
Good table detection via page.find_tables()
Can extract images, links, annotations, and metadata
Actively maintained, solid documentation

Weaknesses:

Returns flat text with no semantic structure (no headings, no lists)
Table extraction is heuristic-based and breaks on complex layouts
Zero OCR capability — scanned PDFs return empty strings
Multi-column layouts often merge into garbled text

Pricing: Free, open source (AGPL or commercial license).

When to use it: You have text-native PDFs (born-digital), you need speed over structure, and you don’t need OCR. Good for bulk metadata extraction, page counting, or quick text dumps.

pdfplumber

pdfplumber is built on top of pdfminer.six and focuses on extracting tables and precise character-level positioning data.

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        table = page.extract_table()
        text = page.extract_text()

Strengths:

Best-in-class table extraction for text-native PDFs
Fine-grained control over text extraction (character-level bounding boxes)
Good at handling multi-column layouts with manual tuning
Pure Python — no compiled dependencies beyond pdfminer

Weaknesses:

Significantly slower than PyMuPDF (5-10x)
No OCR — scanned PDFs are a dead end
Table extraction requires tuning per document type
No semantic structure in output

Pricing: Free, open source (MIT).

When to use it: You need accurate table extraction from text-native PDFs and you’re willing to tune extraction parameters per document layout. Common in financial data pipelines and government data extraction.

PDFMiner / pdfminer.six

pdfminer.six is the maintained fork of PDFMiner, a pure-Python PDF parser. It provides the lowest-level access to PDF internals of any Python library.

from pdfminer.high_level import extract_text

text = extract_text("document.pdf")

Strengths:

Deepest access to PDF internals (CMap handling, font metrics, layout analysis)
Pure Python with no binary dependencies
Fine-grained layout analysis engine
Good foundation library (pdfplumber and others build on it)

Weaknesses:

Slow — the slowest option in this category
High-level API produces mediocre output for complex layouts
Verbose API for anything beyond basic text extraction
No OCR support
Less actively maintained than PyMuPDF

Pricing: Free, open source (MIT).

When to use it: You need low-level PDF parsing (font analysis, CMap extraction, precise glyph positioning) and are comfortable writing custom extraction logic. Most developers should use PyMuPDF or pdfplumber instead.

Text-extraction libraries: the bottom line

These libraries are the right tool when your PDFs are text-native and you need raw speed or fine-grained positional data. They cannot handle scanned documents, and they don’t produce structured output (headings, semantic lists, properly formatted tables). If you need structure, you need something higher up the stack.

Category 2: Traditional OCR engines

OCR (optical character recognition) engines convert images of text into machine-readable strings. They handle scanned PDFs by rendering pages to images first, then recognizing characters. They’ve been around for decades and are battle-tested, but they produce flat text with no document structure.

Tesseract

Tesseract is the open-source OCR engine maintained by Google. It’s the default choice for OCR and has been for over a decade.

# CLI usage
tesseract scanned_page.png output_text

# Python via pytesseract
import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open("page.png"))

Strengths:

Free and open source
Supports 100+ languages
Well-documented, massive community
Good accuracy on clean, high-resolution scans
Runs entirely locally — no data leaves your machine

Weaknesses:

Output is flat text — no tables, no headings, no structure
Accuracy degrades sharply on low-quality scans, skewed text, or complex layouts
Multi-column documents produce garbled output without preprocessing
Requires system-level installation (not a pip install)
Slow on CPU without tuning; no native GPU acceleration
LSTM model (v4+) is better but still struggles with mixed fonts and handwriting

Pricing: Free, open source (Apache 2.0).

When to use it: Budget is zero, documents are clean single-column scans, and you don’t need structured output. Also useful as a baseline to compare other tools against.

EasyOCR

EasyOCR is a Python OCR library built on PyTorch. It supports 80+ languages and is easier to install than Tesseract.

import easyocr

reader = easyocr.Reader(["en"])
results = reader.readtext("page.png")
# Returns list of (bbox, text, confidence) tuples

Strengths:

Simple Python installation (pip install easyocr)
Good accuracy on scene text (photos, signage) — better than Tesseract in some cases
Returns bounding boxes with confidence scores
GPU-accelerated via PyTorch
Handles curved and rotated text better than Tesseract

Weaknesses:

Slower than Tesseract on CPU
Large model downloads on first use (~1GB+)
Still flat text — no document structure understanding
Less accurate than Tesseract on clean document scans
Higher memory footprint

Pricing: Free, open source (Apache 2.0).

When to use it: You need OCR on varied image types (not just clean document scans), want a pure-Python install, or need bounding box coordinates for downstream processing. For standard document OCR, Tesseract is typically more accurate.

Traditional OCR: the bottom line

Tesseract and EasyOCR solve the character recognition problem, but that’s it. They give you a bag of text strings. Reconstructing a document’s structure — figuring out which text is a heading, which is a table cell, which is a footnote — is left entirely to you. For simple single-column documents, this is fine. For anything with tables, multi-column layouts, or mixed content, you need post-processing that’s often harder to build than the OCR itself.

Category 3: Cloud OCR services

Cloud OCR services add layout analysis on top of character recognition. They identify paragraphs, tables, key-value pairs, and form fields. They’re more capable than traditional OCR but come with per-page pricing and data-residency considerations.

Google Document AI

Google Document AI is Google Cloud’s document processing platform. It offers both general OCR and specialized “processors” for invoices, receipts, contracts, and other document types.

Strengths:

Excellent accuracy — arguably the best cloud OCR for general documents
Specialized processors for common document types (invoices, W-2s, bank statements)
Returns structured data: paragraphs, tables with row/column spans, form fields
Handles poor-quality scans well
Batch processing API for high volumes

Weaknesses:

Requires a Google Cloud account and project setup
Pricing is complex (per-page, varies by processor type)
Output is a complex protobuf/JSON schema — not markdown, not plain text
You need to write serialization code to get usable output
Cold start latency on first requests
Data is processed on Google’s infrastructure

Pricing: ~$0.001–$0.01 per page depending on processor type. 1,000 pages/month free for the general OCR processor.

When to use it: You’re already on Google Cloud, you need high-accuracy OCR with layout data, and you have the engineering time to parse Google’s output schema. The specialized processors (invoice parsing, form extraction) are genuinely best-in-class for their specific document types.

AWS Textract

Amazon Textract is AWS’s document extraction service. It focuses on forms, tables, and structured data extraction.

Strengths:

Strong table extraction — one of the best for structured tabular data
Form field extraction (key-value pairs) works well on standardized forms
Good integration with other AWS services (S3, Lambda, Step Functions)
Queries API lets you ask natural-language questions about a document
HIPAA-eligible for healthcare use cases

Weaknesses:

Requires an AWS account
Per-page pricing adds up quickly at scale
Raw text extraction is average — not as accurate as Google Document AI on complex layouts
Output schema is verbose and requires significant post-processing
No markdown output — you get JSON blocks that need serialization
Region-limited for some features

Pricing: $0.0015 per page (detect text), $0.015 per page (tables/forms), $0.01 per query. Free tier: 1,000 pages/month for first 3 months.

When to use it: You’re on AWS, your documents are forms or tables (tax forms, applications, structured reports), and you need to extract specific fields. The Queries feature is useful for targeted extraction from known document types.

Azure AI Document Intelligence

Azure AI Document Intelligence (formerly Form Recognizer) is Microsoft’s document extraction service. It bridges OCR and structured extraction.

Strengths:

Prebuilt models for invoices, receipts, IDs, tax forms
Custom model training — you can train on your own document types
Good markdown output option (added in recent API versions)
Studio UI for testing and labeling documents
Strong enterprise compliance (SOC, HIPAA, FedRAMP)

Weaknesses:

Requires Azure account and resource provisioning
Pricing is per-page and varies by model
Custom model training requires labeled data (minimum 5 documents)
API versioning can be confusing (frequent breaking changes)
Markdown output, while available, is still less clean than purpose-built tools

Pricing: $0.001–$0.01 per page depending on model. Free tier: 500 pages/month.

When to use it: You’re on Azure, need enterprise compliance, or want to train custom extraction models on your specific document types. The prebuilt invoice and receipt models are solid.

Cloud OCR: the bottom line

These services are accurate and feature-rich, but they come with cloud vendor lock-in, per-page costs, and complex output schemas. None of them return clean markdown out of the box — you’re writing serialization code no matter which one you choose. For teams already on a specific cloud platform with compliance requirements, they’re often the path of least resistance. For everyone else, the setup overhead and ongoing costs are hard to justify for simple “give me the text” use cases.

Category 4: Vision-language models

This is the newest category and the one changing fastest. Vision-language models (VLMs) look at a rendered image of each page and output structured text directly. They understand document layout the way a human does — visually — rather than trying to parse PDF internals or run character-by-character OCR. The result is structured output (headings, tables, lists) without post-processing.

PaddleOCR / PP-StructureV2

PaddleOCR is Baidu’s open-source OCR toolkit. PP-StructureV2, its layout analysis module, combines text detection, recognition, and document structure recovery.

Strengths:

Open source and self-hostable
Good accuracy on multilingual documents (especially CJK)
Layout analysis identifies headings, paragraphs, tables, and figures
Table structure recognition is competitive with cloud services
Actively maintained with regular model updates
GPU-accelerated

Weaknesses:

Complex installation — PaddlePaddle framework is heavy and less common than PyTorch
Documentation is primarily in Chinese; English docs are improving but still patchy
Output requires post-processing to get clean markdown
Model zoo is large but confusing to navigate
Higher memory requirements than Tesseract

Pricing: Free, open source (Apache 2.0).

When to use it: You need open-source document structure recovery, especially for multilingual or CJK documents, and you’re comfortable with the PaddlePaddle ecosystem. A strong choice for teams building self-hosted pipelines who need more than what Tesseract offers.

pdfToMarkdown

pdfToMarkdown is an API that converts PDFs to clean markdown using a vision-language model pipeline. It renders each page, runs a VLM that understands document structure, and returns markdown with proper headings, tables, lists, and formatting.

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'

Strengths:

Output is clean markdown — ready for LLM pipelines, rendering, or storage
Handles scanned PDFs, text-native PDFs, and mixed documents
Preserves document structure: headings, tables, lists, emphasis
Zero setup — no dependencies, no model downloads, no GPU
Works from any language or framework via HTTP
Free tier with no signup (public demo key)

Weaknesses:

Cloud-only — documents are processed on our servers
Not ideal for documents requiring LaTeX equation rendering (Mathpix is better for that)
Free tier is limited to 100 pages/month (with GitHub login)
Latency is higher than local text-extraction libraries (seconds, not milliseconds)
Single-format — PDFs only, not DOCX or PPTX (Unstructured handles multi-format)

Pricing: Free demo key (1 page per PDF, watermarked). GitHub login: 100 pages/month, no watermark, no credit card.

When to use it: You want structured markdown output from PDFs without maintaining OCR infrastructure. Ideal for developers building RAG pipelines, document processing features, or any application where you need to go from PDF to LLM-ready text quickly. If you need a simple OCR API that just works, this is the modern choice.

Summary comparison table

Tool	Type	Handles scanned PDFs	Structured output	Self-hosted	Pricing	Best for
PyMuPDF	Text extraction	No	No	Yes	Free (AGPL)	Fast text dumps from digital PDFs
pdfplumber	Text extraction	No	Tables only	Yes	Free (MIT)	Table extraction from digital PDFs
PDFMiner	Text extraction	No	No	Yes	Free (MIT)	Low-level PDF internals
Tesseract	Traditional OCR	Yes	No	Yes	Free (Apache 2.0)	Budget OCR on clean scans
EasyOCR	Traditional OCR	Yes	No	Yes	Free (Apache 2.0)	Scene text and multi-script OCR
Google Document AI	Cloud OCR	Yes	Partial (JSON)	No	~$0.001-0.01/page	High-accuracy OCR on Google Cloud
AWS Textract	Cloud OCR	Yes	Partial (JSON)	No	~$0.0015-0.015/page	Form and table extraction on AWS
Azure Doc Intelligence	Cloud OCR	Yes	Partial (JSON/MD)	No	~$0.001-0.01/page	Custom models and enterprise compliance
PaddleOCR	Vision model	Yes	Partial	Yes	Free (Apache 2.0)	Self-hosted multilingual OCR
pdfToMarkdown	Vision model	Yes	Yes (markdown)	No	Free tier, then paid	Developers who need clean markdown

How to choose

The decision tree is simpler than the table suggests:

Are your PDFs text-native (born-digital)? If yes and you just need raw text, use PyMuPDF. It’s the fastest option. If you need tables, use pdfplumber.

Do you need to self-host for data privacy? If yes and you need structure, PaddleOCR with PP-StructureV2 is the most capable open-source option. If you just need text, Tesseract works. If you need element-level control across multiple formats, Unstructured is worth evaluating.

Are you already on a cloud platform? If you’re on GCP, AWS, or Azure and need to stay there for compliance, use the respective cloud OCR service. Budget for the post-processing code to turn their JSON into something usable.

Do you want structured markdown without infrastructure work? Use pdfToMarkdown. One API call, clean markdown back. No models to host, no output schemas to parse, no post-processing pipeline to build. This is the approach we’d recommend for most developers building LLM applications, document parsing features, or any workflow where the goal is to get structured text from a PDF and move on.

The direction things are moving

The trend is clear: vision-language models are replacing the traditional OCR pipeline. Instead of detect-characters, then-recognize, then-reconstruct-layout, VLMs do it in one pass — the same way a human reads a page. The accuracy gap between VLMs and traditional OCR widens with every model generation, especially on documents with complex layouts, mixed content, or degraded scans.

Text-extraction libraries like PyMuPDF will remain relevant for text-native PDFs where speed matters. Tesseract will continue to serve as a free baseline. But for anything that requires understanding document structure — which is most real-world use cases — vision-language models are the present, not just the future.

Ready to try the modern approach? pdfToMarkdown’s demo key works right now — no signup, no credit card. Send a PDF, get back markdown.