· pdfToMarkdown team

Document AI Without Fine-Tuning: How Vision-Language Models Changed OCR

ocrvlmdocument-aipaddleocr

For a decade, “document AI” meant one of two things: hand-coded templates that told a system exactly where to look on a page, or fine-tuned models that needed thousands of labeled examples before they could extract a single field. Both approaches worked — for the specific documents they were trained on. Anything new required starting over.

Vision-language models (VLMs) broke that loop. A single model, with zero fine-tuning, can now read invoices, research papers, legal contracts, and handwritten forms — documents it has never seen before. This post covers how that shift happened and what it means for document processing pipelines.

The old paradigm: templates and fine-tuning

Template-based extraction

The earliest approach to structured document extraction was zone-based templates. You’d define rectangular regions on a page — “the invoice number is at coordinates (420, 85, 580, 105)” — and the system would OCR just those zones.

This works if every document looks identical. It breaks the moment someone moves a field, changes a font size, or sends a document from a different vendor. In practice, teams maintained hundreds of templates, one per document layout, with a classification step to pick the right template before extraction even started.

The maintenance burden scaled linearly with document variety. Every new vendor, every form revision, every edge case meant another template.

Fine-tuned document models

The next generation replaced templates with learned representations. Models like LayoutLM (Microsoft, 2020) and Donut (Naver, 2022) combined text embeddings with spatial position information, learning to associate field labels with locations on a page.

LayoutLM took tokenized OCR output and added 2D position embeddings — x, y, width, height for each token — then fine-tuned a BERT-like architecture on downstream tasks (form understanding, receipt parsing, document classification). Donut went further by eliminating the separate OCR step entirely, reading document images end-to-end with a Swin Transformer encoder and a BART decoder.

These models were genuinely impressive. But they shared a fundamental constraint: you needed labeled training data for every document type you wanted to handle. Fine-tuning LayoutLM on invoices didn’t help with medical records. Fine-tuning Donut on receipts didn’t transfer to legal contracts.

The data collection problem was the real bottleneck. Getting 500-1,000 annotated examples per document type — with bounding boxes and field labels — is expensive and slow. For many organizations, this made document AI a project that never left the pilot phase.

What changed: vision-language models

A VLM is a model that jointly processes images and text. Unlike LayoutLM (which operated on already-extracted OCR tokens plus their coordinates), a VLM takes the raw image as input and reasons about it visually — the same way a human reads a document by looking at it.

The key insight is that document understanding is, fundamentally, a vision task. When you read a table, you don’t parse the underlying PDF byte stream. You see grid lines, aligned columns, header rows in bold. You see that an indented block is a sub-item. You see that text in a smaller font at the bottom of the page is a footnote. Layout understanding is visual understanding.

What a VLM “sees” in a document

When a VLM processes a document image, it operates on visual features at multiple scales:

  • Typography and emphasis. Bold text, italic text, font size changes, underlines — all visible as pixel-level features. The model learns that large bold text at the top of a section is a heading.
  • Spatial relationships. Text that’s horizontally aligned belongs to the same row. Text that’s vertically aligned belongs to the same column. Indentation signals hierarchy.
  • Visual separators. Horizontal rules, box borders, alternating row shading, whitespace gaps — these all signal structure without needing explicit markup.
  • Reading order. Multi-column layouts, sidebars, footnotes, captions — the model infers the correct reading sequence from visual cues, not from the order bytes appear in a PDF.

None of this requires document-specific training. These visual patterns are consistent across virtually all documents because they follow universal typographic conventions that humans have used for centuries.

Why layout understanding emerges from vision pretraining

This is the non-obvious part. VLMs aren’t trained on document understanding tasks during pretraining. They’re trained on massive datasets of images paired with text descriptions — web pages, photographs, diagrams, screenshots. Yet document understanding comes almost for free.

The reason: documents are a subset of visual media. A model that can read text in a photograph, interpret a chart, or understand a screenshot of a web page already has most of the skills needed to parse a PDF. Table structure is just a special case of grid layout. Heading hierarchy is just a special case of visual salience. Reading order is just a special case of spatial reasoning.

This is why zero-shot document parsing works at all. The model isn’t generalizing from “invoices” to “contracts.” It’s applying general visual reasoning — learned from billions of image-text pairs — to a specific domain.

PaddleOCR-VL: architecture of a document-native VLM

PaddleOCR-VL-1.5 is a concrete example of this architecture applied specifically to document understanding. It’s built on the PaddlePaddle framework and designed for high-throughput OCR with layout awareness.

The architecture follows the standard VLM pattern with document-specific optimizations:

  1. Vision encoder. A vision transformer (ViT-based) processes the document image and produces a sequence of visual tokens. For documents, the model uses higher-resolution inputs than typical VLMs — document text requires fine-grained detail that a 224x224 image can’t provide. PaddleOCR-VL-1.5 processes images at resolutions that preserve the legibility of body text.

  2. Visual-text alignment. A projection layer maps visual tokens into the same embedding space as the language model. This is where the model bridges “what it sees” and “what it outputs.” For document images, this alignment must handle the density problem — a single document page contains far more text tokens than a typical photograph caption.

  3. Language model decoder. An autoregressive language model generates structured text output from the visual token sequence. The decoder handles both the content (what text appears on the page) and the structure (how that text is organized — headings, tables, lists, paragraphs).

The critical design choice in PaddleOCR-VL-1.5 is treating OCR and layout understanding as a single task rather than a pipeline. Traditional systems ran OCR first (extract text), then layout analysis second (figure out structure). PaddleOCR-VL produces structured output directly from the image — the model “reads” the page and emits markdown (or other structured formats) in one pass. This avoids the error propagation problem where OCR mistakes corrupt downstream layout analysis.

Zero-shot document parsing in practice

Zero-shot means the model handles documents it was never explicitly trained on. In practice, this looks like:

  • No template definition. You don’t tell the model where fields are. It finds them.
  • No training data collection. You don’t need labeled examples of your specific document type.
  • No per-document-type configuration. The same model, with the same parameters, processes invoices, contracts, research papers, and handwritten notes.

The practical impact is that document processing becomes a generic API call rather than a project. Instead of spending weeks collecting training data, annotating documents, fine-tuning a model, and deploying it — you send an image to an endpoint and get structured text back.

This doesn’t mean VLMs are perfect. They struggle with:

  • Extremely degraded scans — heavy noise, skew, or low resolution can defeat any vision model.
  • Dense mathematical notation — though this is improving rapidly.
  • Unusual scripts or languages with limited representation in pretraining data.

But for the vast majority of business documents — the invoices, reports, forms, and contracts that make up 95% of enterprise document processing — zero-shot VLMs work out of the box.

Implications for document processing pipelines

The shift from fine-tuned models to general-purpose VLMs has several downstream effects:

No templates to maintain. The entire category of “template management” disappears. You don’t need a template library, a template editor, or a classification step to pick the right template. One model handles everything.

No training data to collect. The most expensive part of traditional document AI — annotating thousands of examples per document type — is eliminated. This alone makes document processing accessible to teams that couldn’t justify the upfront investment.

Any document, immediately. A new vendor sends invoices in a format you’ve never seen? It works. A client uploads a scanned contract from 1998? It works. Someone submits a handwritten form? It probably works. The barrier to handling new document types drops to zero.

Simpler architecture. Instead of a pipeline with separate OCR, layout analysis, table detection, and field extraction stages — each with its own model and failure modes — you have a single model that does everything. Fewer moving parts means fewer bugs, easier debugging, and lower operational overhead.

Quality scales with the model, not with your effort. When the next version of PaddleOCR-VL ships with better table handling or improved handwriting recognition, you get those improvements by updating one dependency. You don’t retrain anything.

What this means for pdfToMarkdown

pdfToMarkdown is built on this architecture. When you send a PDF to the OCR API, here’s what happens:

  1. Each page is rendered as a high-resolution image.
  2. PaddleOCR-VL processes the image and produces structured markdown — headings, tables, lists, paragraphs — in a single pass.
  3. The markdown is returned via the API.

No templates. No fine-tuning. No per-document configuration. It works the same way on a scanned PDF from a 1990s fax machine as it does on a born-digital report from last week.

If you’re building a document processing pipeline and you’ve been putting it off because of the training data requirements, that blocker no longer exists. Try the API with a demo_public_key — no signup required:

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'

Read the docs to get started.