The invoice data extraction problem

Accounts payable teams process thousands of invoices per month. Each vendor sends a different layout — different table positions, different date formats, different line item structures. Manual data entry costs time and introduces errors. The usual automation approach is a template-based extractor that maps field coordinates for each vendor. This works until the vendor updates their invoice template, or you onboard a new supplier. Then it breaks.

Tesseract-based OCR makes this worse. It flattens the page into a character stream, losing table structure entirely. A line-items table becomes interleaved text where “Widget A 10 $25.00” is indistinguishable from three separate lines. Reconstructing the table means writing fragile heuristics per vendor.

pdfToMarkdown takes a different approach. A vision-language model reads the invoice the way a human AP clerk does — recognizing tables, headers, totals, and payment details by visual layout cues, not fixed coordinates. The output is structured markdown that preserves the full document hierarchy. No templates. No per-vendor configuration.

The full extraction pipeline

The pipeline is four steps:

PDF in — send the invoice PDF to the API (URL, base64, or file upload)
Markdown out — receive structured markdown with tables, headers, and key-value pairs intact
Parse fields — extract vendor, invoice number, date, line items, and totals from the markdown using regex or an LLM
Push to accounting — send structured data to QuickBooks, Xero, NetSuite, or your ERP via their API

Steps 1-2 are one API call. Step 3 is a few lines of Python. Step 4 depends on your system.

Step 1-2: Convert invoice PDF to markdown

curl

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://example.com/invoices/INV-2024-1337.pdf"}}'

Python SDK

from pdftomarkdown import convert

result = convert("INV-2024-1337.pdf")
print(result.markdown)

Example output

For a typical vendor invoice, the API returns:

# Invoice #INV-2024-1337

**Vendor:** Northwind Electronics GmbH
**Vendor address:** Berliner Str. 42, 10115 Berlin, Germany
**VAT ID:** DE123456789

**Bill to:** Acme Corp, 500 Startup Lane, San Francisco, CA 94107
**Invoice date:** 2024-06-12
**Due date:** 2024-07-12
**PO Number:** PO-8821

| Item                           | Qty |  Unit Price |      Total |
|--------------------------------|-----|-------------|------------|
| Industrial sensor module v3    |  50 |     $124.00 |  $6,200.00 |
| Calibration service (per unit) |  50 |      $18.50 |    $925.00 |
| Overnight shipping             |   1 |     $340.00 |    $340.00 |
| Extended warranty (12 mo)      |  50 |      $22.00 |  $1,100.00 |

**Subtotal:** $8,565.00
**Discount (5% volume):** -$428.25
**Tax (VAT 19%):** $1,545.98
**Total due:** $9,682.73

**Payment terms:** Net 30
**Bank:** Deutsche Bank — IBAN: DE89 3704 0044 0532 0130 00
**SWIFT:** COBADEFFXXX

Tables stay as tables. Key-value pairs stay on labeled lines. Multi-page invoices are concatenated into a single markdown document with page breaks preserved.

Step 3: Extract structured fields from markdown

Once you have the markdown, extracting fields is straightforward. Two approaches work well: regex for deterministic parsing, or an LLM for handling edge cases.

Approach A: Regex extraction

Good for pipelines where you want deterministic, auditable output with no LLM dependency.

import re
import json
from pdftomarkdown import convert

result = convert("INV-2024-1337.pdf")
md = result.markdown

# Extract header fields
def extract_field(pattern, text):
    match = re.search(pattern, text)
    return match.group(1).strip() if match else None

invoice = {
    "invoice_number": extract_field(r"#\s*(INV-[\w-]+)", md),
    "vendor": extract_field(r"\*\*Vendor:\*\*\s*(.+)", md),
    "invoice_date": extract_field(r"\*\*Invoice date:\*\*\s*(.+)", md),
    "due_date": extract_field(r"\*\*Due date:\*\*\s*(.+)", md),
    "po_number": extract_field(r"\*\*PO Number:\*\*\s*(.+)", md),
    "total_due": extract_field(r"\*\*Total due:\*\*\s*(.+)", md),
}

# Extract line items from the markdown table
line_items = []
table_pattern = re.findall(
    r"\|\s*(.+?)\s*\|\s*(\d+)\s*\|\s*\$([\d,.]+)\s*\|\s*\$([\d,.]+)\s*\|", md
)
for desc, qty, unit_price, total in table_pattern:
    if desc.startswith("-"):  # skip separator row
        continue
    line_items.append({
        "description": desc.strip(),
        "quantity": int(qty),
        "unit_price": float(unit_price.replace(",", "")),
        "total": float(total.replace(",", "")),
    })

invoice["line_items"] = line_items
print(json.dumps(invoice, indent=2))

Output:

{
  "invoice_number": "INV-2024-1337",
  "vendor": "Northwind Electronics GmbH",
  "invoice_date": "2024-06-12",
  "due_date": "2024-07-12",
  "po_number": "PO-8821",
  "total_due": "$9,682.73",
  "line_items": [
    {
      "description": "Industrial sensor module v3",
      "quantity": 50,
      "unit_price": 124.0,
      "total": 6200.0
    },
    {
      "description": "Calibration service (per unit)",
      "quantity": 50,
      "unit_price": 18.5,
      "total": 925.0
    },
    {
      "description": "Overnight shipping",
      "quantity": 1,
      "unit_price": 340.0,
      "total": 340.0
    },
    {
      "description": "Extended warranty (12 mo)",
      "quantity": 50,
      "unit_price": 22.0,
      "total": 1100.0
    }
  ]
}

Approach B: LLM-based extraction

Better when invoice formats vary widely and you want to handle edge cases (multiple tax lines, credit notes, partial payments) without writing regex for every variant.

import json
from pdftomarkdown import convert
import openai

result = convert("INV-2024-1337.pdf")

client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={"type": "json_object"},
    messages=[
        {
            "role": "system",
            "content": (
                "Extract invoice data as JSON with these fields: "
                "invoice_number, vendor, vendor_address, bill_to, "
                "invoice_date (ISO 8601), due_date (ISO 8601), "
                "po_number, currency, line_items (array of "
                "{description, quantity, unit_price, total}), "
                "subtotal, discount, tax, total_due, payment_terms."
            ),
        },
        {"role": "user", "content": result.markdown},
    ],
)

invoice_data = json.loads(response.choices[0].message.content)

The markdown format is dense enough for the LLM to extract reliably, but structured enough that it almost never hallucinates field values. This is significantly more accurate than passing raw OCR text or a base64 image directly to the LLM.

Step 4: Push to your accounting system

With structured JSON, posting to your ERP is standard API work:

import requests

def push_to_quickbooks(invoice_data, access_token, realm_id):
    """Example: create a bill in QuickBooks Online."""
    line_items_qb = [
        {
            "Amount": item["total"],
            "DetailType": "AccountBasedExpenseLineDetail",
            "Description": item["description"],
        }
        for item in invoice_data["line_items"]
    ]

    bill = {
        "VendorRef": {"name": invoice_data["vendor"]},
        "TxnDate": invoice_data["invoice_date"],
        "DueDate": invoice_data["due_date"],
        "Line": line_items_qb,
    }

    requests.post(
        f"https://quickbooks.api.intuit.com/v3/company/{realm_id}/bill",
        headers={"Authorization": f"Bearer {access_token}"},
        json=bill,
    )

Replace with the equivalent for Xero, NetSuite, SAP, or your internal system. The shape of the data is the same regardless of the invoice source.

Batch processing

Process an entire inbox of invoices with a shell loop:

for f in invoices/*.pdf; do
  pdf_base64=$(base64 < "$f" | tr -d '\n')
  curl -s -X POST https://pdftomarkdown.dev/v1/convert \
    -H "Authorization: Bearer demo_public_key" \
    -H "Content-Type: application/json" \
    -d "{\"input\":{\"pdf_base64\":\"$pdf_base64\"}}" \
    | jq -c '{file: "'"$f"'", markdown: .output.markdown}' >> results.ndjson
done

Or in Python with concurrency:

import os
from concurrent.futures import ThreadPoolExecutor
from pdftomarkdown import convert

pdf_files = [f for f in os.listdir("invoices") if f.endswith(".pdf")]

def process(filename):
    result = convert(f"invoices/{filename}")
    return {"file": filename, "markdown": result.markdown}

with ThreadPoolExecutor(max_workers=5) as pool:
    results = list(pool.map(process, pdf_files))

Why markdown beats other intermediate formats

Template-based extractors output key-value pairs tied to one vendor’s layout. If the vendor changes their template, the extractor breaks. Raw OCR text loses all structure — you cannot distinguish a line item row from an address line.

Markdown sits in between: structured enough to parse programmatically, flexible enough to represent any invoice layout. It is also the native input format for LLMs, so you can chain extraction with validation, classification, or anomaly detection without reformatting.

Approach	Handles new vendors	Preserves tables	LLM-compatible	Setup per vendor
Template-based	No	Partial	No	Hours
Tesseract + heuristics	Partially	No	No	Hours
pdfToMarkdown + regex	Yes	Yes	N/A	None
pdfToMarkdown + LLM	Yes	Yes	Yes	None

What it handles

Invoices from any vendor, any layout, any country
Multi-page invoices with line items spanning pages
Scanned PDFs — image-only, no selectable text
VAT invoices with multiple tax rates and breakdowns
Credit notes and adjustment memos
Invoices in English, German, French, Spanish, Japanese, and other languages
Complex headers with logos, PO references, and multiple addresses

Invoice OCR API — overview of invoice OCR capabilities
OCR API for Developers — general-purpose OCR API reference
PDF Parsing API — extract structured text from any PDF type
API documentation — full endpoint reference, response schema, error codes

Automate Invoice Data Extraction

The invoice data extraction problem

The full extraction pipeline

Step 1-2: Convert invoice PDF to markdown

curl

Python SDK

Example output

Step 3: Extract structured fields from markdown

Approach A: Regex extraction

Approach B: LLM-based extraction

Step 4: Push to your accounting system

Batch processing

Why markdown beats other intermediate formats

What it handles

Pricing

Try with an invoice PDF

Automate Invoice Data Extraction

The invoice data extraction problem

The full extraction pipeline

Step 1-2: Convert invoice PDF to markdown

curl

Python SDK

Example output

Step 3: Extract structured fields from markdown

Approach A: Regex extraction

Approach B: LLM-based extraction

Step 4: Push to your accounting system

Batch processing

Why markdown beats other intermediate formats

What it handles

Related pages

Pricing

Try with an invoice PDF