· pdfToMarkdown team

Why We Chose PaddleOCR Over Tesseract (And You Should Too)

ocrpaddleocrtesseractvlmtechnical

Tesseract is the most widely deployed open-source OCR engine in the world. It’s been around since the 1980s (originally developed at HP, later maintained by Google), and for most developers, it’s the default answer to “how do I extract text from an image.”

We used to think that way too. Then we ran real-world documents through PaddleOCR-VL-1.5 and saw the difference.

This post explains why pdfToMarkdown is built on a vision-language model instead of a traditional OCR pipeline, what that actually means technically, and where Tesseract still has a role.

How traditional OCR works

Tesseract and similar engines follow a pipeline that hasn’t changed fundamentally in decades:

  1. Binarization — convert the image to black and white, separating text from background
  2. Layout segmentation — detect text blocks, columns, and reading order
  3. Line segmentation — split text blocks into individual lines
  4. Character recognition — classify each character using a trained model (historically LSTM-based in Tesseract 4+)
  5. Post-processing — apply dictionaries and language models to fix errors

Each step is independent. The character recognizer doesn’t know it’s reading a table. The layout segmenter doesn’t know what the text says. Information flows in one direction, and errors compound at each stage.

This architecture works well when documents are clean, single-column, high-resolution, and in a language you’ve installed the pack for. It breaks down everywhere else.

How vision-language models work

PaddleOCR-VL-1.5 is not an OCR engine in the traditional sense. It’s a vision-language model (VLM) — a neural network that takes an image as input and produces structured text as output, end to end.

There’s no separate binarization step. No hand-coded layout segmentation. The model sees the entire page at once — text, tables, figures, whitespace, headers, footers — and understands the document as a coherent unit, not as a sequence of isolated characters.

The difference is analogous to how early machine translation worked (word-by-word substitution with grammar rules) versus modern neural translation (understanding the full sentence, then generating the translation). Both produce text. One understands meaning.

Where the difference shows up

Layout understanding

Tesseract’s layout analysis is rule-based. It looks for rectangular text blocks, assumes reading order is left-to-right then top-to-bottom, and gets confused by multi-column layouts, sidebars, footnotes, and captions.

A VLM sees the page the way you do. Two-column academic papers, invoices with header blocks and line-item tables, legal contracts with margin annotations — the model understands the spatial relationships because it processes the visual layout directly.

Table handling

This is where the gap is widest. Tesseract extracts characters from a table and gives you a wall of text where column alignment is lost:

Item Qty Price Total
Widget A 10 5.00 50.00
Widget B 3 12.50 37.50

Which column does “10” belong to? Is “5.00” the price or the total? Without the visual grid lines and spatial positioning, the structure is ambiguous.

PaddleOCR-VL-1.5 sees the table as a table. It outputs structured data that maps directly to markdown tables with correct column alignment. That’s because it learned what tables look like from millions of document images, not from heuristic rules about character spacing.

Multi-language documents

Tesseract requires language packs — separate model files for each language. Processing a document with English headings, German body text, and Japanese annotations requires you to specify all three languages upfront:

tesseract input.png output -l eng+deu+jpn

Miss a language and that text comes back as garbage. Have a document in a script you didn’t anticipate? You need to install the pack, retrain, or accept the errors.

PaddleOCR-VL-1.5 handles multilingual content natively. The model was trained on documents in dozens of languages and scripts. No language packs, no upfront specification, no separate installation steps. It reads what’s on the page.

Degraded scans

Old photocopies, faded thermal paper receipts, low-resolution faxes, slightly rotated pages from a phone camera — this is what real-world OCR input looks like.

Tesseract’s binarization step is the bottleneck. If the algorithm can’t cleanly separate foreground text from background noise, everything downstream fails. You end up writing preprocessing pipelines — deskewing, contrast enhancement, adaptive thresholding — before Tesseract even starts.

A VLM handles degraded inputs more gracefully because it was trained on degraded inputs. Low contrast, noise, rotation, bleed-through from the other side of the page — these are part of the training distribution, not edge cases that break a fragile preprocessing pipeline.

The paradigm shift

This isn’t about PaddleOCR being a “better Tesseract.” It’s a different approach to the problem entirely.

Traditional OCR treats document understanding as a pipeline of independent signal processing steps. Vision-language models treat it as a single learned task: look at an image, understand what it says and how it’s structured, produce the output.

The same shift happened in speech recognition (hand-crafted acoustic models and language models gave way to end-to-end neural models), in machine translation (statistical phrase-based systems gave way to transformers), and in image classification (feature engineering gave way to convolutional networks). Every time, the end-to-end learned system eventually surpassed the hand-engineered pipeline.

Document OCR is going through that transition right now. PaddleOCR-VL-1.5 is one of the models leading it.

Being honest about Tesseract

Tesseract isn’t obsolete. It has real advantages:

  • Fully open source and self-hostable. No API calls, no cloud dependency, no usage limits. You can run it on an air-gapped server with zero network access.
  • Mature and well-documented. 30+ years of development, extensive community, wrappers in every language.
  • CPU-only inference. No GPU required. Runs on a $5/month VPS.
  • Deterministic. Same input, same output, every time. Useful for compliance and audit trails.
  • Small footprint. The English model is ~15 MB. VLMs are measured in gigabytes.

If you need offline OCR on a resource-constrained device, or you’re processing clean, single-language, well-formatted documents at scale and accuracy doesn’t need to be perfect, Tesseract is a solid choice.

But if you’re processing real-world documents — scans from different decades, mixed languages, complex layouts, tables you actually need to parse — a vision-language model is a generational improvement.

What this means for pdfToMarkdown

pdfToMarkdown uses PaddleOCR-VL-1.5 as part of its conversion pipeline. When you send a PDF to the API, the model sees each page as an image and produces structured markdown output — headings, tables, lists, emphasis — not a flat string of characters.

This is why the API handles scanned PDFs without a separate pipeline, why tables come back as actual markdown tables, and why you don’t need to specify what language your document is in.

The model does the heavy lifting. The API gives you the result as clean markdown over HTTP.

Try it

If you’re currently using Tesseract and spending time on preprocessing, post-processing, or table reconstruction, run the same document through pdfToMarkdown and compare the output.

The OCR API is a single endpoint. The free tier requires no signup — send a request with the demo key and see what comes back:

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://your-scanned-document-url.pdf"}}'

Or use the Python SDK:

from pdftomarkdown import convert

result = convert("scanned-document.pdf")
print(result.markdown)

No Tesseract. No Poppler. No language packs. Just structured markdown from any PDF.