invoice data extraction api

Automate Invoice Data Extraction

One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.

$ curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
{
  "markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
  "pages": 3,
  "request_id": "req_abc123"
}

The invoice data extraction problem

Accounts payable teams process thousands of invoices per month. Each vendor sends a different layout — different table positions, different date formats, different line item structures. Manual data entry costs time and introduces errors. The usual automation approach is a template-based extractor that maps field coordinates for each vendor. This works until the vendor updates their invoice template, or you onboard a new supplier. Then it breaks.

Tesseract-based OCR makes this worse. It flattens the page into a character stream, losing table structure entirely. A line-items table becomes interleaved text where “Widget A 10 $25.00” is indistinguishable from three separate lines. Reconstructing the table means writing fragile heuristics per vendor.

pdfToMarkdown takes a different approach. A vision-language model reads the invoice the way a human AP clerk does — recognizing tables, headers, totals, and payment details by visual layout cues, not fixed coordinates. The output is structured markdown that preserves the full document hierarchy. No templates. No per-vendor configuration.

The full extraction pipeline

The pipeline is four steps:

  1. PDF in — send the invoice PDF to the API (URL, base64, or file upload)
  2. Markdown out — receive structured markdown with tables, headers, and key-value pairs intact
  3. Parse fields — extract vendor, invoice number, date, line items, and totals from the markdown using regex or an LLM
  4. Push to accounting — send structured data to QuickBooks, Xero, NetSuite, or your ERP via their API

Steps 1-2 are one API call. Step 3 is a few lines of Python. Step 4 depends on your system.

Step 1-2: Convert invoice PDF to markdown

curl

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://example.com/invoices/INV-2024-1337.pdf"}}'

Python SDK

from pdftomarkdown import convert

result = convert("INV-2024-1337.pdf")
print(result.markdown)

Example output

For a typical vendor invoice, the API returns:

# Invoice #INV-2024-1337

**Vendor:** Northwind Electronics GmbH
**Vendor address:** Berliner Str. 42, 10115 Berlin, Germany
**VAT ID:** DE123456789

**Bill to:** Acme Corp, 500 Startup Lane, San Francisco, CA 94107
**Invoice date:** 2024-06-12
**Due date:** 2024-07-12
**PO Number:** PO-8821

| Item                           | Qty |  Unit Price |      Total |
|--------------------------------|-----|-------------|------------|
| Industrial sensor module v3    |  50 |     $124.00 |  $6,200.00 |
| Calibration service (per unit) |  50 |      $18.50 |    $925.00 |
| Overnight shipping             |   1 |     $340.00 |    $340.00 |
| Extended warranty (12 mo)      |  50 |      $22.00 |  $1,100.00 |

**Subtotal:** $8,565.00
**Discount (5% volume):** -$428.25
**Tax (VAT 19%):** $1,545.98
**Total due:** $9,682.73

**Payment terms:** Net 30
**Bank:** Deutsche Bank — IBAN: DE89 3704 0044 0532 0130 00
**SWIFT:** COBADEFFXXX

Tables stay as tables. Key-value pairs stay on labeled lines. Multi-page invoices are concatenated into a single markdown document with page breaks preserved.

Step 3: Extract structured fields from markdown

Once you have the markdown, extracting fields is straightforward. Two approaches work well: regex for deterministic parsing, or an LLM for handling edge cases.

Approach A: Regex extraction

Good for pipelines where you want deterministic, auditable output with no LLM dependency.

import re
import json
from pdftomarkdown import convert

result = convert("INV-2024-1337.pdf")
md = result.markdown

# Extract header fields
def extract_field(pattern, text):
    match = re.search(pattern, text)
    return match.group(1).strip() if match else None

invoice = {
    "invoice_number": extract_field(r"#\s*(INV-[\w-]+)", md),
    "vendor": extract_field(r"\*\*Vendor:\*\*\s*(.+)", md),
    "invoice_date": extract_field(r"\*\*Invoice date:\*\*\s*(.+)", md),
    "due_date": extract_field(r"\*\*Due date:\*\*\s*(.+)", md),
    "po_number": extract_field(r"\*\*PO Number:\*\*\s*(.+)", md),
    "total_due": extract_field(r"\*\*Total due:\*\*\s*(.+)", md),
}

# Extract line items from the markdown table
line_items = []
table_pattern = re.findall(
    r"\|\s*(.+?)\s*\|\s*(\d+)\s*\|\s*\$([\d,.]+)\s*\|\s*\$([\d,.]+)\s*\|", md
)
for desc, qty, unit_price, total in table_pattern:
    if desc.startswith("-"):  # skip separator row
        continue
    line_items.append({
        "description": desc.strip(),
        "quantity": int(qty),
        "unit_price": float(unit_price.replace(",", "")),
        "total": float(total.replace(",", "")),
    })

invoice["line_items"] = line_items
print(json.dumps(invoice, indent=2))

Output:

{
  "invoice_number": "INV-2024-1337",
  "vendor": "Northwind Electronics GmbH",
  "invoice_date": "2024-06-12",
  "due_date": "2024-07-12",
  "po_number": "PO-8821",
  "total_due": "$9,682.73",
  "line_items": [
    {
      "description": "Industrial sensor module v3",
      "quantity": 50,
      "unit_price": 124.0,
      "total": 6200.0
    },
    {
      "description": "Calibration service (per unit)",
      "quantity": 50,
      "unit_price": 18.5,
      "total": 925.0
    },
    {
      "description": "Overnight shipping",
      "quantity": 1,
      "unit_price": 340.0,
      "total": 340.0
    },
    {
      "description": "Extended warranty (12 mo)",
      "quantity": 50,
      "unit_price": 22.0,
      "total": 1100.0
    }
  ]
}

Approach B: LLM-based extraction

Better when invoice formats vary widely and you want to handle edge cases (multiple tax lines, credit notes, partial payments) without writing regex for every variant.

import json
from pdftomarkdown import convert
import openai

result = convert("INV-2024-1337.pdf")

client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={"type": "json_object"},
    messages=[
        {
            "role": "system",
            "content": (
                "Extract invoice data as JSON with these fields: "
                "invoice_number, vendor, vendor_address, bill_to, "
                "invoice_date (ISO 8601), due_date (ISO 8601), "
                "po_number, currency, line_items (array of "
                "{description, quantity, unit_price, total}), "
                "subtotal, discount, tax, total_due, payment_terms."
            ),
        },
        {"role": "user", "content": result.markdown},
    ],
)

invoice_data = json.loads(response.choices[0].message.content)

The markdown format is dense enough for the LLM to extract reliably, but structured enough that it almost never hallucinates field values. This is significantly more accurate than passing raw OCR text or a base64 image directly to the LLM.

Step 4: Push to your accounting system

With structured JSON, posting to your ERP is standard API work:

import requests

def push_to_quickbooks(invoice_data, access_token, realm_id):
    """Example: create a bill in QuickBooks Online."""
    line_items_qb = [
        {
            "Amount": item["total"],
            "DetailType": "AccountBasedExpenseLineDetail",
            "Description": item["description"],
        }
        for item in invoice_data["line_items"]
    ]

    bill = {
        "VendorRef": {"name": invoice_data["vendor"]},
        "TxnDate": invoice_data["invoice_date"],
        "DueDate": invoice_data["due_date"],
        "Line": line_items_qb,
    }

    requests.post(
        f"https://quickbooks.api.intuit.com/v3/company/{realm_id}/bill",
        headers={"Authorization": f"Bearer {access_token}"},
        json=bill,
    )

Replace with the equivalent for Xero, NetSuite, SAP, or your internal system. The shape of the data is the same regardless of the invoice source.

Batch processing

Process an entire inbox of invoices with a shell loop:

for f in invoices/*.pdf; do
  pdf_base64=$(base64 < "$f" | tr -d '\n')
  curl -s -X POST https://pdftomarkdown.dev/v1/convert \
    -H "Authorization: Bearer demo_public_key" \
    -H "Content-Type: application/json" \
    -d "{\"input\":{\"pdf_base64\":\"$pdf_base64\"}}" \
    | jq -c '{file: "'"$f"'", markdown: .output.markdown}' >> results.ndjson
done

Or in Python with concurrency:

import os
from concurrent.futures import ThreadPoolExecutor
from pdftomarkdown import convert

pdf_files = [f for f in os.listdir("invoices") if f.endswith(".pdf")]

def process(filename):
    result = convert(f"invoices/{filename}")
    return {"file": filename, "markdown": result.markdown}

with ThreadPoolExecutor(max_workers=5) as pool:
    results = list(pool.map(process, pdf_files))

Why markdown beats other intermediate formats

Template-based extractors output key-value pairs tied to one vendor’s layout. If the vendor changes their template, the extractor breaks. Raw OCR text loses all structure — you cannot distinguish a line item row from an address line.

Markdown sits in between: structured enough to parse programmatically, flexible enough to represent any invoice layout. It is also the native input format for LLMs, so you can chain extraction with validation, classification, or anomaly detection without reformatting.

ApproachHandles new vendorsPreserves tablesLLM-compatibleSetup per vendor
Template-basedNoPartialNoHours
Tesseract + heuristicsPartiallyNoNoHours
pdfToMarkdown + regexYesYesN/ANone
pdfToMarkdown + LLMYesYesYesNone

What it handles

  • Invoices from any vendor, any layout, any country
  • Multi-page invoices with line items spanning pages
  • Scanned PDFs — image-only, no selectable text
  • VAT invoices with multiple tax rates and breakdowns
  • Credit notes and adjustment memos
  • Invoices in English, German, French, Spanish, Japanese, and other languages
  • Complex headers with logos, PO references, and multiple addresses

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

  • Public demo key — copy & paste
  • Only page 1 is processed
  • 1 request/min per IP
  • Watermark in output
View docs →

Developer

Free, GitHub login

  • Personal API key
  • 100 pages/month
  • Multi-page PDFs
  • No watermark
Get API key →

Try with an invoice PDF

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.