invoice OCR

Invoice OCR API

One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.

$ curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
{
  "markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
  "pages": 3,
  "request_id": "req_abc123"
}

Invoice extraction that actually works

Invoice PDFs come in hundreds of different layouts. Template-based extraction systems — the kind that look for fixed field positions — break the moment a vendor changes their invoice format or uses a slightly different table structure.

pdfToMarkdown uses a vision-language model that reads invoices the way an accountant does. It recognizes that a column of numbers with a “Total” header is a line-item table, regardless of whether the vendor uses portrait or landscape, one column or three.

What you get back

Send an invoice PDF, receive structured markdown:

# Invoice #INV-2024-0842

**Vendor:** Acme Software Ltd
**Bill to:** Widgets Inc, 123 Main St
**Invoice date:** 2024-01-15
**Due date:** 2024-02-14

| Description              | Qty | Unit Price |    Total |
|--------------------------|-----|------------|----------|
| Enterprise License Q1    |   1 | $1,200.00  | $1,200.00|
| Additional seats (×5)    |   5 |   $49.00   |   $245.00|
| Professional services    |   8 |   $150.00  | $1,200.00|

**Subtotal:** $2,645.00
**Discount (10%):** -$264.50
**Tax (VAT 20%):** $476.10
**Total due:** $2,856.60

**Payment terms:** Net 30
**Bank:** Barclays — Sort: 20-00-00 — Acc: 12345678

Feed this directly into an LLM for field extraction, or parse the markdown table with a simple regex.

Downstream extraction example

from pdftomarkdown import convert
import openai

result = convert("invoice.pdf")

# Ask an LLM to extract structured fields
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract invoice fields as JSON."},
        {"role": "user", "content": result.markdown},
    ],
)

The markdown format is LLM-friendly — structured enough that models reliably pull the right values.

Handles real-world invoice variability

  • Multi-page invoices with line items spanning pages
  • Invoices with VAT breakdowns, discount rows, and subtotals
  • Non-English invoices (German, French, Spanish, Japanese)
  • Scanned invoices — image-only, no embedded text
  • Invoices with logos, letterheads, and complex headers
  • PO-based invoices referencing purchase order numbers

AP automation workflow

# Process a batch of invoices
for f in invoices/*.pdf; do
  pdf_base64=$(base64 < "$f" | tr -d '\n')
  curl -s -X POST https://pdftomarkdown.dev/v1/convert \
    -H "Authorization: Bearer $PDFTOMARKDOWN_API_KEY" \
    -H "Content-Type: application/json" \
    -d "{\"input\":{\"pdf_base64\":\"$pdf_base64\"}}" >> extracted.ndjson
done

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

  • Public demo key — copy & paste
  • Only page 1 is processed
  • 1 request/min per IP
  • Watermark in output
View docs →

Developer

Free, GitHub login

  • Personal API key
  • 100 pages/month
  • Multi-page PDFs
  • No watermark
Get API key →

Extract invoices free

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.