invoice OCR
Invoice OCR API
One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.
$ curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}' {
"markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
"pages": 3,
"request_id": "req_abc123"
} Invoice extraction that actually works
Invoice PDFs come in hundreds of different layouts. Template-based extraction systems — the kind that look for fixed field positions — break the moment a vendor changes their invoice format or uses a slightly different table structure.
pdfToMarkdown uses a vision-language model that reads invoices the way an accountant does. It recognizes that a column of numbers with a “Total” header is a line-item table, regardless of whether the vendor uses portrait or landscape, one column or three.
What you get back
Send an invoice PDF, receive structured markdown:
# Invoice #INV-2024-0842
**Vendor:** Acme Software Ltd
**Bill to:** Widgets Inc, 123 Main St
**Invoice date:** 2024-01-15
**Due date:** 2024-02-14
| Description | Qty | Unit Price | Total |
|--------------------------|-----|------------|----------|
| Enterprise License Q1 | 1 | $1,200.00 | $1,200.00|
| Additional seats (×5) | 5 | $49.00 | $245.00|
| Professional services | 8 | $150.00 | $1,200.00|
**Subtotal:** $2,645.00
**Discount (10%):** -$264.50
**Tax (VAT 20%):** $476.10
**Total due:** $2,856.60
**Payment terms:** Net 30
**Bank:** Barclays — Sort: 20-00-00 — Acc: 12345678
Feed this directly into an LLM for field extraction, or parse the markdown table with a simple regex.
Downstream extraction example
from pdftomarkdown import convert
import openai
result = convert("invoice.pdf")
# Ask an LLM to extract structured fields
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Extract invoice fields as JSON."},
{"role": "user", "content": result.markdown},
],
)
The markdown format is LLM-friendly — structured enough that models reliably pull the right values.
Handles real-world invoice variability
- Multi-page invoices with line items spanning pages
- Invoices with VAT breakdowns, discount rows, and subtotals
- Non-English invoices (German, French, Spanish, Japanese)
- Scanned invoices — image-only, no embedded text
- Invoices with logos, letterheads, and complex headers
- PO-based invoices referencing purchase order numbers
AP automation workflow
# Process a batch of invoices
for f in invoices/*.pdf; do
pdf_base64=$(base64 < "$f" | tr -d '\n')
curl -s -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer $PDFTOMARKDOWN_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"input\":{\"pdf_base64\":\"$pdf_base64\"}}" >> extracted.ndjson
done
Related pages
- OCR API for Developers — general OCR API overview
- Legal Document OCR — extract structured data from contracts
- API documentation — curl examples, response schema, error codes
Pricing
Both tiers are free. No credit card required.
Hacker
Free, no signup
- Public demo key — copy & paste
- Only page 1 is processed
- 1 request/min per IP
- Watermark in output
Developer
Free, GitHub login
- Personal API key
- 100 pages/month
- Multi-page PDFs
- No watermark
Extract invoices free
Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.