Invoice extraction that actually works

Invoice PDFs come in hundreds of different layouts. Template-based extraction systems — the kind that look for fixed field positions — break the moment a vendor changes their invoice format or uses a slightly different table structure.

pdfToMarkdown uses a vision-language model that reads invoices the way an accountant does. It recognizes that a column of numbers with a “Total” header is a line-item table, regardless of whether the vendor uses portrait or landscape, one column or three.

What you get back

Send an invoice PDF, receive structured markdown:

# Invoice #INV-2024-0842

**Vendor:** Acme Software Ltd
**Bill to:** Widgets Inc, 123 Main St
**Invoice date:** 2024-01-15
**Due date:** 2024-02-14

| Description              | Qty | Unit Price |    Total |
|--------------------------|-----|------------|----------|
| Enterprise License Q1    |   1 | $1,200.00  | $1,200.00|
| Additional seats (×5)    |   5 |   $49.00   |   $245.00|
| Professional services    |   8 |   $150.00  | $1,200.00|

**Subtotal:** $2,645.00
**Discount (10%):** -$264.50
**Tax (VAT 20%):** $476.10
**Total due:** $2,856.60

**Payment terms:** Net 30
**Bank:** Barclays — Sort: 20-00-00 — Acc: 12345678

Feed this directly into an LLM for field extraction, or parse the markdown table with a simple regex.

Downstream extraction example

from pdftomarkdown import convert
import openai

result = convert("invoice.pdf")

# Ask an LLM to extract structured fields
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract invoice fields as JSON."},
        {"role": "user", "content": result.markdown},
    ],
)

The markdown format is LLM-friendly — structured enough that models reliably pull the right values.

Handles real-world invoice variability

Multi-page invoices with line items spanning pages
Invoices with VAT breakdowns, discount rows, and subtotals
Non-English invoices (German, French, Spanish, Japanese)
Scanned invoices — image-only, no embedded text
Invoices with logos, letterheads, and complex headers
PO-based invoices referencing purchase order numbers

AP automation workflow

# Process a batch of invoices
for f in invoices/*.pdf; do
  pdf_base64=$(base64 < "$f" | tr -d '\n')
  curl -s -X POST https://pdftomarkdown.dev/v1/convert \
    -H "Authorization: Bearer $PDFTOMARKDOWN_API_KEY" \
    -H "Content-Type: application/json" \
    -d "{\"input\":{\"pdf_base64\":\"$pdf_base64\"}}" >> extracted.ndjson
done

OCR API for Developers — general OCR API overview
Legal Document OCR — extract structured data from contracts
API documentation — curl examples, response schema, error codes

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

Public demo key — copy & paste
Only page 1 is processed
1 request/min per IP
Watermark in output

View docs →

Developer

Free, GitHub login

Personal API key
100 pages/month
Multi-page PDFs
No watermark

Get API key →

Extract invoices free

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.

Sign in with GitHub Or read the docs first →

Invoice OCR API