Automate Invoice Processing with Python: A Step-by-Step Guide
Accounts payable teams process invoices manually because existing automation tools are either too expensive, too rigid, or both. Template-based extractors break when a vendor updates their layout. OCR tools flatten tables into useless character streams. Enterprise platforms cost six figures.
This tutorial builds a complete invoice processing pipeline in Python. It watches a folder for new PDF invoices, extracts structured data via the pdftomarkdown API, parses key fields, and pushes the results to your accounting system. Every code block runs. No toy examples.
The pipeline
Folder watch (watchdog) → PDF detected → pdftomarkdown API → markdown → regex + LLM parsing → structured JSON → CSV / accounting API
By the end, you’ll have a script that runs in the background and automatically processes any invoice PDF dropped into a directory.
Prerequisites
Install the dependencies:
pip install pdftomarkdown watchdog openai
Set your OpenAI API key (used for LLM fallback parsing):
export OPENAI_API_KEY="sk-..."
You don’t need a pdftomarkdown account to follow along. The demo key works for single-page invoices.
Step 1: Watch a folder for new PDF invoices
The watchdog library monitors a directory for filesystem events. We’ll watch for new .pdf files and trigger processing when one appears.
import time
from pathlib import Path
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
class InvoiceHandler(FileSystemEventHandler):
def on_created(self, event):
if event.is_directory:
return
if event.src_path.endswith(".pdf"):
print(f"New invoice detected: {event.src_path}")
process_invoice(event.src_path)
def start_watching(folder: str):
"""Watch a folder for new PDF files."""
Path(folder).mkdir(parents=True, exist_ok=True)
observer = Observer()
observer.schedule(InvoiceHandler(), folder, recursive=False)
observer.start()
print(f"Watching {folder} for new invoices...")
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
Drop a PDF into the watched folder, and process_invoice() fires automatically. We’ll build that function next.
Step 2: Extract invoice content with pdftomarkdown
Send the PDF to the pdftomarkdown API and get back clean markdown with tables, headers, and key-value pairs intact.
from pdftomarkdown import convert
def extract_markdown(pdf_path: str, api_key: str = "demo_public_key") -> str:
"""Convert an invoice PDF to structured markdown."""
result = convert(pdf_path, api_key=api_key)
return result.markdown
If you prefer raw HTTP (useful for debugging or non-Python integrations):
import requests
import base64
from pathlib import Path
def extract_markdown_http(pdf_path: str, api_key: str = "demo_public_key") -> str:
"""Convert an invoice PDF to markdown via the REST API."""
pdf_bytes = Path(pdf_path).read_bytes()
pdf_base64 = base64.b64encode(pdf_bytes).decode("utf-8")
response = requests.post(
"https://pdftomarkdown.dev/v1/convert",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
},
json={
"input": {
"pdf_base64": pdf_base64,
}
},
)
response.raise_for_status()
return response.json()["output"]["markdown"]
Example markdown output
A typical invoice comes back looking like this:
# Invoice #INV-2024-0842
**Vendor:** Acme Software Ltd
**Bill to:** Widgets Inc, 123 Main St, Suite 400, San Francisco, CA 94102
**Invoice date:** 2024-01-15
**Due date:** 2024-02-14
**Payment terms:** Net 30
## Line Items
| Description | Qty | Unit Price | Total |
|--------------------------|-----|------------|----------|
| Enterprise License Q1 | 1 | $1,200.00 | $1,200.00|
| Additional seats (x5) | 5 | $49.00 | $245.00|
| Premium support | 1 | $299.00 | $299.00 |
**Subtotal:** $1,744.00
**Tax (8.5%):** $148.24
**Total:** $1,892.24
The table structure is preserved. The key-value pairs are on separate lines. This is what makes regex parsing reliable — you’re working with structured text, not a character stream.
For more on how this works with different invoice layouts, see the invoice data extraction guide and the invoice OCR API reference.
Step 3: Parse fields with regex + LLM fallback
Most invoice fields follow predictable patterns. Regex handles 80-90% of cases. For the rest, send the markdown to an LLM for structured extraction.
Regex parsing
import re
import json
from typing import Optional
def parse_with_regex(markdown: str) -> dict:
"""Extract invoice fields using regex patterns."""
data = {
"vendor": None,
"invoice_number": None,
"invoice_date": None,
"due_date": None,
"line_items": [],
"subtotal": None,
"tax": None,
"total": None,
}
# Vendor name
vendor_match = re.search(
r"\*\*Vendor:\*\*\s*(.+)",
markdown,
)
if vendor_match:
data["vendor"] = vendor_match.group(1).strip()
# Invoice number — handles "#INV-xxx", "Invoice #xxx", "Invoice Number: xxx"
inv_match = re.search(
r"(?:Invoice\s*#?|Invoice\s+Number:?\s*)([A-Z0-9\-]+)",
markdown,
re.IGNORECASE,
)
if inv_match:
data["invoice_number"] = inv_match.group(1).strip()
# Dates
date_pattern = r"\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{2,4}|\w+ \d{1,2},? \d{4}"
inv_date_match = re.search(
rf"\*\*Invoice\s+date:\*\*\s*({date_pattern})",
markdown,
re.IGNORECASE,
)
if inv_date_match:
data["invoice_date"] = inv_date_match.group(1).strip()
due_date_match = re.search(
rf"\*\*Due\s+date:\*\*\s*({date_pattern})",
markdown,
re.IGNORECASE,
)
if due_date_match:
data["due_date"] = due_date_match.group(1).strip()
# Line items from markdown table
table_rows = re.findall(
r"\|\s*([^|]+?)\s*\|\s*(\d+)\s*\|\s*\$?([\d,]+\.?\d*)\s*\|\s*\$?([\d,]+\.?\d*)\s*\|",
markdown,
)
for row in table_rows:
description, qty, unit_price, total = row
# Skip header rows and separator rows
if description.strip("-") == "" or "description" in description.lower():
continue
data["line_items"].append({
"description": description.strip(),
"quantity": int(qty),
"unit_price": float(unit_price.replace(",", "")),
"total": float(total.replace(",", "")),
})
# Totals
for field, pattern in [
("subtotal", r"\*\*Subtotal:\*\*\s*\$?([\d,]+\.?\d*)"),
("tax", r"\*\*Tax[^:]*:\*\*\s*\$?([\d,]+\.?\d*)"),
("total", r"\*\*Total:\*\*\s*\$?([\d,]+\.?\d*)"),
]:
match = re.search(pattern, markdown, re.IGNORECASE)
if match:
data[field] = float(match.group(1).replace(",", ""))
return data
LLM fallback
When regex misses fields (unusual formatting, non-English invoices, handwritten notes), fall back to an LLM:
from openai import OpenAI
def parse_with_llm(markdown: str) -> dict:
"""Extract invoice fields using GPT-4o as fallback."""
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
temperature=0,
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": (
"Extract invoice data from the following markdown. "
"Return a JSON object with these fields: "
"vendor, invoice_number, invoice_date, due_date, "
"line_items (array of {description, quantity, unit_price, total}), "
"subtotal, tax, total. "
"Use null for any field you cannot find."
),
},
{
"role": "user",
"content": markdown,
},
],
)
return json.loads(response.choices[0].message.content)
Combined parser
Use regex first, then fill in gaps with the LLM:
def parse_invoice(markdown: str) -> dict:
"""Parse invoice fields: regex first, LLM fallback for missing fields."""
data = parse_with_regex(markdown)
# Check which fields are still missing
missing = [k for k, v in data.items() if v is None or v == []]
if missing:
print(f"Regex missed: {missing}. Falling back to LLM...")
llm_data = parse_with_llm(markdown)
# Fill in only the missing fields
for field in missing:
if llm_data.get(field) is not None:
data[field] = llm_data[field]
return data
This two-pass approach keeps costs low. The LLM only runs when regex can’t handle the format, and even then it only fills gaps — it doesn’t re-extract fields you already have.
Step 4: Output structured JSON
Combine extraction and parsing into the process_invoice function:
from datetime import datetime
def process_invoice(pdf_path: str, api_key: str = "demo_public_key") -> dict:
"""Full pipeline: PDF → markdown → parsed data → JSON."""
print(f"Processing: {pdf_path}")
# Extract
markdown = extract_markdown(pdf_path, api_key=api_key)
print(f"Extracted {len(markdown)} chars of markdown")
# Parse
invoice_data = parse_invoice(markdown)
# Add metadata
invoice_data["source_file"] = pdf_path
invoice_data["processed_at"] = datetime.now().isoformat()
# Save JSON alongside the PDF
json_path = pdf_path.replace(".pdf", ".json")
with open(json_path, "w") as f:
json.dump(invoice_data, f, indent=2)
print(f"Saved: {json_path}")
print(json.dumps(invoice_data, indent=2))
return invoice_data
Example output:
{
"vendor": "Acme Software Ltd",
"invoice_number": "INV-2024-0842",
"invoice_date": "2024-01-15",
"due_date": "2024-02-14",
"line_items": [
{
"description": "Enterprise License Q1",
"quantity": 1,
"unit_price": 1200.00,
"total": 1200.00
},
{
"description": "Additional seats (x5)",
"quantity": 5,
"unit_price": 49.00,
"total": 245.00
},
{
"description": "Premium support",
"quantity": 1,
"unit_price": 299.00,
"total": 299.00
}
],
"subtotal": 1744.00,
"tax": 148.24,
"total": 1892.24,
"source_file": "/invoices/incoming/acme-jan-2024.pdf",
"processed_at": "2024-01-20T14:32:01.123456"
}
Step 5: Push to CSV or accounting API
Option A: Append to CSV
The simplest output. Good for small teams or as an intermediate step before import.
import csv
from pathlib import Path
def append_to_csv(invoice_data: dict, csv_path: str = "invoices.csv"):
"""Append parsed invoice data to a CSV file."""
file_exists = Path(csv_path).exists()
# Flatten line items into a summary
line_items_summary = "; ".join(
f"{item['description']} (x{item['quantity']}): ${item['total']:.2f}"
for item in invoice_data.get("line_items", [])
)
row = {
"vendor": invoice_data.get("vendor"),
"invoice_number": invoice_data.get("invoice_number"),
"invoice_date": invoice_data.get("invoice_date"),
"due_date": invoice_data.get("due_date"),
"subtotal": invoice_data.get("subtotal"),
"tax": invoice_data.get("tax"),
"total": invoice_data.get("total"),
"line_items": line_items_summary,
"source_file": invoice_data.get("source_file"),
"processed_at": invoice_data.get("processed_at"),
}
with open(csv_path, "a", newline="") as f:
writer = csv.DictWriter(f, fieldnames=row.keys())
if not file_exists:
writer.writeheader()
writer.writerow(row)
print(f"Appended to {csv_path}")
Option B: Push to QuickBooks Online
QuickBooks uses OAuth 2.0. Once authenticated, creating a bill (their term for a payable invoice) is a single API call:
import requests
def push_to_quickbooks(
invoice_data: dict,
access_token: str,
realm_id: str,
base_url: str = "https://quickbooks.api.intuit.com",
):
"""Create a bill in QuickBooks Online from parsed invoice data."""
# Build line items in QuickBooks format
qb_lines = []
for item in invoice_data.get("line_items", []):
qb_lines.append({
"DetailType": "AccountBasedExpenseLineDetail",
"Amount": item["total"],
"Description": item["description"],
"AccountBasedExpenseLineDetail": {
"AccountRef": {"value": "7"}, # Replace with your expense account ID
},
})
bill = {
"VendorRef": {
"name": invoice_data["vendor"],
},
"Line": qb_lines,
"DueDate": invoice_data.get("due_date"),
"TxnDate": invoice_data.get("invoice_date"),
"DocNumber": invoice_data.get("invoice_number"),
"TotalAmt": invoice_data.get("total"),
}
response = requests.post(
f"{base_url}/v3/company/{realm_id}/bill",
headers={
"Authorization": f"Bearer {access_token}",
"Content-Type": "application/json",
"Accept": "application/json",
},
json=bill,
)
response.raise_for_status()
result = response.json()
print(f"Created QuickBooks bill: {result['Bill']['Id']}")
return result
Option C: Push to Xero
Xero’s API uses a similar pattern. Invoices of type ACCPAY represent bills:
import requests
def push_to_xero(
invoice_data: dict,
access_token: str,
tenant_id: str,
):
"""Create a bill in Xero from parsed invoice data."""
xero_line_items = []
for item in invoice_data.get("line_items", []):
xero_line_items.append({
"Description": item["description"],
"Quantity": item["quantity"],
"UnitAmount": item["unit_price"],
"AccountCode": "400", # Replace with your expense account code
})
invoice_payload = {
"Type": "ACCPAY",
"Contact": {
"Name": invoice_data["vendor"],
},
"Date": invoice_data.get("invoice_date"),
"DueDate": invoice_data.get("due_date"),
"InvoiceNumber": invoice_data.get("invoice_number"),
"LineItems": xero_line_items,
}
response = requests.post(
"https://api.xero.com/api.xro/2.0/Invoices",
headers={
"Authorization": f"Bearer {access_token}",
"Content-Type": "application/json",
"Xero-Tenant-Id": tenant_id,
},
json={"Invoices": [invoice_payload]},
)
response.raise_for_status()
result = response.json()
invoice_id = result["Invoices"][0]["InvoiceID"]
print(f"Created Xero bill: {invoice_id}")
return result
Full pipeline: copy-paste and run
Here’s the complete script. Save it as invoice_pipeline.py, set your API keys, and point it at a folder:
"""
Invoice processing pipeline:
Watch folder → pdftomarkdown → regex + LLM parse → JSON → CSV
"""
import csv
import json
import re
import time
import base64
from datetime import datetime
from pathlib import Path
from pdftomarkdown import convert
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
from openai import OpenAI
# ── Configuration ──────────────────────────────────────────────
WATCH_FOLDER = "./invoices/incoming"
OUTPUT_CSV = "./invoices/processed.csv"
PDFTOMARKDOWN_API_KEY = "demo_public_key" # Replace with your key
# ───────────────────────────────────────────────────────────────
def extract_markdown(pdf_path: str) -> str:
result = convert(pdf_path, api_key=PDFTOMARKDOWN_API_KEY)
return result.markdown
def parse_with_regex(markdown: str) -> dict:
data = {
"vendor": None,
"invoice_number": None,
"invoice_date": None,
"due_date": None,
"line_items": [],
"subtotal": None,
"tax": None,
"total": None,
}
vendor_match = re.search(r"\*\*Vendor:\*\*\s*(.+)", markdown)
if vendor_match:
data["vendor"] = vendor_match.group(1).strip()
inv_match = re.search(
r"(?:Invoice\s*#?|Invoice\s+Number:?\s*)([A-Z0-9\-]+)",
markdown,
re.IGNORECASE,
)
if inv_match:
data["invoice_number"] = inv_match.group(1).strip()
date_pattern = r"\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{2,4}|\w+ \d{1,2},? \d{4}"
inv_date = re.search(
rf"\*\*Invoice\s+date:\*\*\s*({date_pattern})", markdown, re.IGNORECASE
)
if inv_date:
data["invoice_date"] = inv_date.group(1).strip()
due_date = re.search(
rf"\*\*Due\s+date:\*\*\s*({date_pattern})", markdown, re.IGNORECASE
)
if due_date:
data["due_date"] = due_date.group(1).strip()
table_rows = re.findall(
r"\|\s*([^|]+?)\s*\|\s*(\d+)\s*\|\s*\$?([\d,]+\.?\d*)\s*\|\s*\$?([\d,]+\.?\d*)\s*\|",
markdown,
)
for row in table_rows:
desc, qty, unit_price, total = row
if desc.strip("-") == "" or "description" in desc.lower():
continue
data["line_items"].append({
"description": desc.strip(),
"quantity": int(qty),
"unit_price": float(unit_price.replace(",", "")),
"total": float(total.replace(",", "")),
})
for field, pattern in [
("subtotal", r"\*\*Subtotal:\*\*\s*\$?([\d,]+\.?\d*)"),
("tax", r"\*\*Tax[^:]*:\*\*\s*\$?([\d,]+\.?\d*)"),
("total", r"\*\*Total:\*\*\s*\$?([\d,]+\.?\d*)"),
]:
match = re.search(pattern, markdown, re.IGNORECASE)
if match:
data[field] = float(match.group(1).replace(",", ""))
return data
def parse_with_llm(markdown: str) -> dict:
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
temperature=0,
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": (
"Extract invoice data from the following markdown. "
"Return JSON with: vendor, invoice_number, invoice_date, "
"due_date, line_items (array of {description, quantity, "
"unit_price, total}), subtotal, tax, total. "
"Use null for missing fields."
),
},
{"role": "user", "content": markdown},
],
)
return json.loads(response.choices[0].message.content)
def parse_invoice(markdown: str) -> dict:
data = parse_with_regex(markdown)
missing = [k for k, v in data.items() if v is None or v == []]
if missing:
print(f" Regex missed: {missing} — using LLM fallback")
llm_data = parse_with_llm(markdown)
for field in missing:
if llm_data.get(field) is not None:
data[field] = llm_data[field]
return data
def append_to_csv(invoice_data: dict, csv_path: str = OUTPUT_CSV):
Path(csv_path).parent.mkdir(parents=True, exist_ok=True)
file_exists = Path(csv_path).exists()
line_items_summary = "; ".join(
f"{item['description']} (x{item['quantity']}): ${item['total']:.2f}"
for item in invoice_data.get("line_items", [])
)
row = {
"vendor": invoice_data.get("vendor"),
"invoice_number": invoice_data.get("invoice_number"),
"invoice_date": invoice_data.get("invoice_date"),
"due_date": invoice_data.get("due_date"),
"subtotal": invoice_data.get("subtotal"),
"tax": invoice_data.get("tax"),
"total": invoice_data.get("total"),
"line_items": line_items_summary,
"source_file": invoice_data.get("source_file"),
"processed_at": invoice_data.get("processed_at"),
}
with open(csv_path, "a", newline="") as f:
writer = csv.DictWriter(f, fieldnames=row.keys())
if not file_exists:
writer.writeheader()
writer.writerow(row)
print(f" Appended to {csv_path}")
def process_invoice(pdf_path: str):
print(f"\nProcessing: {pdf_path}")
markdown = extract_markdown(pdf_path)
print(f" Extracted {len(markdown)} chars of markdown")
invoice_data = parse_invoice(markdown)
invoice_data["source_file"] = pdf_path
invoice_data["processed_at"] = datetime.now().isoformat()
# Save JSON
json_path = pdf_path.replace(".pdf", ".json")
with open(json_path, "w") as f:
json.dump(invoice_data, f, indent=2)
print(f" Saved JSON: {json_path}")
# Append to CSV
append_to_csv(invoice_data)
print(f" Done: {invoice_data.get('vendor')} — ${invoice_data.get('total')}")
class InvoiceHandler(FileSystemEventHandler):
def on_created(self, event):
if event.is_directory:
return
if event.src_path.endswith(".pdf"):
process_invoice(event.src_path)
def main():
Path(WATCH_FOLDER).mkdir(parents=True, exist_ok=True)
print(f"Invoice pipeline running. Drop PDFs into: {WATCH_FOLDER}")
print(f"Output CSV: {OUTPUT_CSV}")
print("Press Ctrl+C to stop.\n")
observer = Observer()
observer.schedule(InvoiceHandler(), WATCH_FOLDER, recursive=False)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
print("\nShutting down...")
observer.stop()
observer.join()
if __name__ == "__main__":
main()
Run it:
python invoice_pipeline.py
Then drop a PDF into ./invoices/incoming/ and watch the pipeline execute.
Production considerations
Error handling. The script above is linear for clarity. In production, wrap process_invoice in a try/except and move failed PDFs to an ./invoices/errors/ directory with the traceback logged alongside.
Deduplication. Check invoice_number against your CSV or database before inserting. Duplicate invoices are a real problem in AP automation — the same PDF gets emailed, downloaded, and forwarded.
Rate limits. The pdftomarkdown demo key is for testing. For production volumes, sign in with GitHub to get 100 pages/month free, or contact us for higher limits.
Concurrency. For high-volume processing, replace the watchdog loop with a queue (Redis, SQS, or even multiprocessing.Queue) and run multiple workers. The pdftomarkdown API handles concurrent requests.
Validation. Cross-check that sum(line_items.total) matches the subtotal, and that subtotal + tax matches the total. Flag discrepancies for human review.
Why this works better than template-based extraction
Traditional invoice extractors require you to define field coordinates per vendor. New vendor? New template. Vendor updates their layout? Template breaks.
pdftomarkdown uses a vision-language model that reads the document visually. It recognizes tables, headers, and key-value pairs by layout — not by fixed coordinates. The markdown output is consistent enough for regex to work across different invoice formats, and the LLM fallback catches edge cases.
The result is a pipeline that handles new vendors automatically, with no per-vendor configuration.
Next steps
- Read the invoice data extraction guide for more on handling complex invoice layouts.
- See the invoice OCR API reference for details on supported PDF formats and accuracy benchmarks.
- Read the API documentation for authentication, file upload, and batch processing options.
- Replace
demo_public_keywith a free API key — sign in with GitHub to get 100 pages/month, no credit card required.
The demo key works for testing single-page invoices. For production AP automation, get your API key and start processing.