· pdfToMarkdown team

Automate Invoice Processing with Python: A Step-by-Step Guide

tutorialpythoninvoicesautomationaccounting

Accounts payable teams process invoices manually because existing automation tools are either too expensive, too rigid, or both. Template-based extractors break when a vendor updates their layout. OCR tools flatten tables into useless character streams. Enterprise platforms cost six figures.

This tutorial builds a complete invoice processing pipeline in Python. It watches a folder for new PDF invoices, extracts structured data via the pdftomarkdown API, parses key fields, and pushes the results to your accounting system. Every code block runs. No toy examples.

The pipeline

Folder watch (watchdog) → PDF detected → pdftomarkdown API → markdown → regex + LLM parsing → structured JSON → CSV / accounting API

By the end, you’ll have a script that runs in the background and automatically processes any invoice PDF dropped into a directory.

Prerequisites

Install the dependencies:

pip install pdftomarkdown watchdog openai

Set your OpenAI API key (used for LLM fallback parsing):

export OPENAI_API_KEY="sk-..."

You don’t need a pdftomarkdown account to follow along. The demo key works for single-page invoices.

Step 1: Watch a folder for new PDF invoices

The watchdog library monitors a directory for filesystem events. We’ll watch for new .pdf files and trigger processing when one appears.

import time
from pathlib import Path
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler


class InvoiceHandler(FileSystemEventHandler):
    def on_created(self, event):
        if event.is_directory:
            return
        if event.src_path.endswith(".pdf"):
            print(f"New invoice detected: {event.src_path}")
            process_invoice(event.src_path)


def start_watching(folder: str):
    """Watch a folder for new PDF files."""
    Path(folder).mkdir(parents=True, exist_ok=True)
    observer = Observer()
    observer.schedule(InvoiceHandler(), folder, recursive=False)
    observer.start()
    print(f"Watching {folder} for new invoices...")
    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        observer.stop()
    observer.join()

Drop a PDF into the watched folder, and process_invoice() fires automatically. We’ll build that function next.

Step 2: Extract invoice content with pdftomarkdown

Send the PDF to the pdftomarkdown API and get back clean markdown with tables, headers, and key-value pairs intact.

from pdftomarkdown import convert


def extract_markdown(pdf_path: str, api_key: str = "demo_public_key") -> str:
    """Convert an invoice PDF to structured markdown."""
    result = convert(pdf_path, api_key=api_key)
    return result.markdown

If you prefer raw HTTP (useful for debugging or non-Python integrations):

import requests
import base64
from pathlib import Path


def extract_markdown_http(pdf_path: str, api_key: str = "demo_public_key") -> str:
    """Convert an invoice PDF to markdown via the REST API."""
    pdf_bytes = Path(pdf_path).read_bytes()
    pdf_base64 = base64.b64encode(pdf_bytes).decode("utf-8")

    response = requests.post(
        "https://pdftomarkdown.dev/v1/convert",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
        },
        json={
            "input": {
                "pdf_base64": pdf_base64,
            }
        },
    )
    response.raise_for_status()
    return response.json()["output"]["markdown"]

Example markdown output

A typical invoice comes back looking like this:

# Invoice #INV-2024-0842

**Vendor:** Acme Software Ltd
**Bill to:** Widgets Inc, 123 Main St, Suite 400, San Francisco, CA 94102
**Invoice date:** 2024-01-15
**Due date:** 2024-02-14
**Payment terms:** Net 30

## Line Items

| Description              | Qty | Unit Price |    Total |
|--------------------------|-----|------------|----------|
| Enterprise License Q1    |   1 | $1,200.00  | $1,200.00|
| Additional seats (x5)    |   5 |   $49.00   |   $245.00|
| Premium support           |   1 |  $299.00   |  $299.00 |

**Subtotal:** $1,744.00
**Tax (8.5%):** $148.24
**Total:** $1,892.24

The table structure is preserved. The key-value pairs are on separate lines. This is what makes regex parsing reliable — you’re working with structured text, not a character stream.

For more on how this works with different invoice layouts, see the invoice data extraction guide and the invoice OCR API reference.

Step 3: Parse fields with regex + LLM fallback

Most invoice fields follow predictable patterns. Regex handles 80-90% of cases. For the rest, send the markdown to an LLM for structured extraction.

Regex parsing

import re
import json
from typing import Optional


def parse_with_regex(markdown: str) -> dict:
    """Extract invoice fields using regex patterns."""
    data = {
        "vendor": None,
        "invoice_number": None,
        "invoice_date": None,
        "due_date": None,
        "line_items": [],
        "subtotal": None,
        "tax": None,
        "total": None,
    }

    # Vendor name
    vendor_match = re.search(
        r"\*\*Vendor:\*\*\s*(.+)",
        markdown,
    )
    if vendor_match:
        data["vendor"] = vendor_match.group(1).strip()

    # Invoice number — handles "#INV-xxx", "Invoice #xxx", "Invoice Number: xxx"
    inv_match = re.search(
        r"(?:Invoice\s*#?|Invoice\s+Number:?\s*)([A-Z0-9\-]+)",
        markdown,
        re.IGNORECASE,
    )
    if inv_match:
        data["invoice_number"] = inv_match.group(1).strip()

    # Dates
    date_pattern = r"\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{2,4}|\w+ \d{1,2},? \d{4}"

    inv_date_match = re.search(
        rf"\*\*Invoice\s+date:\*\*\s*({date_pattern})",
        markdown,
        re.IGNORECASE,
    )
    if inv_date_match:
        data["invoice_date"] = inv_date_match.group(1).strip()

    due_date_match = re.search(
        rf"\*\*Due\s+date:\*\*\s*({date_pattern})",
        markdown,
        re.IGNORECASE,
    )
    if due_date_match:
        data["due_date"] = due_date_match.group(1).strip()

    # Line items from markdown table
    table_rows = re.findall(
        r"\|\s*([^|]+?)\s*\|\s*(\d+)\s*\|\s*\$?([\d,]+\.?\d*)\s*\|\s*\$?([\d,]+\.?\d*)\s*\|",
        markdown,
    )
    for row in table_rows:
        description, qty, unit_price, total = row
        # Skip header rows and separator rows
        if description.strip("-") == "" or "description" in description.lower():
            continue
        data["line_items"].append({
            "description": description.strip(),
            "quantity": int(qty),
            "unit_price": float(unit_price.replace(",", "")),
            "total": float(total.replace(",", "")),
        })

    # Totals
    for field, pattern in [
        ("subtotal", r"\*\*Subtotal:\*\*\s*\$?([\d,]+\.?\d*)"),
        ("tax", r"\*\*Tax[^:]*:\*\*\s*\$?([\d,]+\.?\d*)"),
        ("total", r"\*\*Total:\*\*\s*\$?([\d,]+\.?\d*)"),
    ]:
        match = re.search(pattern, markdown, re.IGNORECASE)
        if match:
            data[field] = float(match.group(1).replace(",", ""))

    return data

LLM fallback

When regex misses fields (unusual formatting, non-English invoices, handwritten notes), fall back to an LLM:

from openai import OpenAI


def parse_with_llm(markdown: str) -> dict:
    """Extract invoice fields using GPT-4o as fallback."""
    client = OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": (
                    "Extract invoice data from the following markdown. "
                    "Return a JSON object with these fields: "
                    "vendor, invoice_number, invoice_date, due_date, "
                    "line_items (array of {description, quantity, unit_price, total}), "
                    "subtotal, tax, total. "
                    "Use null for any field you cannot find."
                ),
            },
            {
                "role": "user",
                "content": markdown,
            },
        ],
    )

    return json.loads(response.choices[0].message.content)

Combined parser

Use regex first, then fill in gaps with the LLM:

def parse_invoice(markdown: str) -> dict:
    """Parse invoice fields: regex first, LLM fallback for missing fields."""
    data = parse_with_regex(markdown)

    # Check which fields are still missing
    missing = [k for k, v in data.items() if v is None or v == []]

    if missing:
        print(f"Regex missed: {missing}. Falling back to LLM...")
        llm_data = parse_with_llm(markdown)

        # Fill in only the missing fields
        for field in missing:
            if llm_data.get(field) is not None:
                data[field] = llm_data[field]

    return data

This two-pass approach keeps costs low. The LLM only runs when regex can’t handle the format, and even then it only fills gaps — it doesn’t re-extract fields you already have.

Step 4: Output structured JSON

Combine extraction and parsing into the process_invoice function:

from datetime import datetime


def process_invoice(pdf_path: str, api_key: str = "demo_public_key") -> dict:
    """Full pipeline: PDF → markdown → parsed data → JSON."""
    print(f"Processing: {pdf_path}")

    # Extract
    markdown = extract_markdown(pdf_path, api_key=api_key)
    print(f"Extracted {len(markdown)} chars of markdown")

    # Parse
    invoice_data = parse_invoice(markdown)

    # Add metadata
    invoice_data["source_file"] = pdf_path
    invoice_data["processed_at"] = datetime.now().isoformat()

    # Save JSON alongside the PDF
    json_path = pdf_path.replace(".pdf", ".json")
    with open(json_path, "w") as f:
        json.dump(invoice_data, f, indent=2)

    print(f"Saved: {json_path}")
    print(json.dumps(invoice_data, indent=2))

    return invoice_data

Example output:

{
  "vendor": "Acme Software Ltd",
  "invoice_number": "INV-2024-0842",
  "invoice_date": "2024-01-15",
  "due_date": "2024-02-14",
  "line_items": [
    {
      "description": "Enterprise License Q1",
      "quantity": 1,
      "unit_price": 1200.00,
      "total": 1200.00
    },
    {
      "description": "Additional seats (x5)",
      "quantity": 5,
      "unit_price": 49.00,
      "total": 245.00
    },
    {
      "description": "Premium support",
      "quantity": 1,
      "unit_price": 299.00,
      "total": 299.00
    }
  ],
  "subtotal": 1744.00,
  "tax": 148.24,
  "total": 1892.24,
  "source_file": "/invoices/incoming/acme-jan-2024.pdf",
  "processed_at": "2024-01-20T14:32:01.123456"
}

Step 5: Push to CSV or accounting API

Option A: Append to CSV

The simplest output. Good for small teams or as an intermediate step before import.

import csv
from pathlib import Path


def append_to_csv(invoice_data: dict, csv_path: str = "invoices.csv"):
    """Append parsed invoice data to a CSV file."""
    file_exists = Path(csv_path).exists()

    # Flatten line items into a summary
    line_items_summary = "; ".join(
        f"{item['description']} (x{item['quantity']}): ${item['total']:.2f}"
        for item in invoice_data.get("line_items", [])
    )

    row = {
        "vendor": invoice_data.get("vendor"),
        "invoice_number": invoice_data.get("invoice_number"),
        "invoice_date": invoice_data.get("invoice_date"),
        "due_date": invoice_data.get("due_date"),
        "subtotal": invoice_data.get("subtotal"),
        "tax": invoice_data.get("tax"),
        "total": invoice_data.get("total"),
        "line_items": line_items_summary,
        "source_file": invoice_data.get("source_file"),
        "processed_at": invoice_data.get("processed_at"),
    }

    with open(csv_path, "a", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=row.keys())
        if not file_exists:
            writer.writeheader()
        writer.writerow(row)

    print(f"Appended to {csv_path}")

Option B: Push to QuickBooks Online

QuickBooks uses OAuth 2.0. Once authenticated, creating a bill (their term for a payable invoice) is a single API call:

import requests


def push_to_quickbooks(
    invoice_data: dict,
    access_token: str,
    realm_id: str,
    base_url: str = "https://quickbooks.api.intuit.com",
):
    """Create a bill in QuickBooks Online from parsed invoice data."""
    # Build line items in QuickBooks format
    qb_lines = []
    for item in invoice_data.get("line_items", []):
        qb_lines.append({
            "DetailType": "AccountBasedExpenseLineDetail",
            "Amount": item["total"],
            "Description": item["description"],
            "AccountBasedExpenseLineDetail": {
                "AccountRef": {"value": "7"},  # Replace with your expense account ID
            },
        })

    bill = {
        "VendorRef": {
            "name": invoice_data["vendor"],
        },
        "Line": qb_lines,
        "DueDate": invoice_data.get("due_date"),
        "TxnDate": invoice_data.get("invoice_date"),
        "DocNumber": invoice_data.get("invoice_number"),
        "TotalAmt": invoice_data.get("total"),
    }

    response = requests.post(
        f"{base_url}/v3/company/{realm_id}/bill",
        headers={
            "Authorization": f"Bearer {access_token}",
            "Content-Type": "application/json",
            "Accept": "application/json",
        },
        json=bill,
    )
    response.raise_for_status()
    result = response.json()
    print(f"Created QuickBooks bill: {result['Bill']['Id']}")
    return result

Option C: Push to Xero

Xero’s API uses a similar pattern. Invoices of type ACCPAY represent bills:

import requests


def push_to_xero(
    invoice_data: dict,
    access_token: str,
    tenant_id: str,
):
    """Create a bill in Xero from parsed invoice data."""
    xero_line_items = []
    for item in invoice_data.get("line_items", []):
        xero_line_items.append({
            "Description": item["description"],
            "Quantity": item["quantity"],
            "UnitAmount": item["unit_price"],
            "AccountCode": "400",  # Replace with your expense account code
        })

    invoice_payload = {
        "Type": "ACCPAY",
        "Contact": {
            "Name": invoice_data["vendor"],
        },
        "Date": invoice_data.get("invoice_date"),
        "DueDate": invoice_data.get("due_date"),
        "InvoiceNumber": invoice_data.get("invoice_number"),
        "LineItems": xero_line_items,
    }

    response = requests.post(
        "https://api.xero.com/api.xro/2.0/Invoices",
        headers={
            "Authorization": f"Bearer {access_token}",
            "Content-Type": "application/json",
            "Xero-Tenant-Id": tenant_id,
        },
        json={"Invoices": [invoice_payload]},
    )
    response.raise_for_status()
    result = response.json()
    invoice_id = result["Invoices"][0]["InvoiceID"]
    print(f"Created Xero bill: {invoice_id}")
    return result

Full pipeline: copy-paste and run

Here’s the complete script. Save it as invoice_pipeline.py, set your API keys, and point it at a folder:

"""
Invoice processing pipeline:
  Watch folder → pdftomarkdown → regex + LLM parse → JSON → CSV
"""
import csv
import json
import re
import time
import base64
from datetime import datetime
from pathlib import Path

from pdftomarkdown import convert
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
from openai import OpenAI

# ── Configuration ──────────────────────────────────────────────
WATCH_FOLDER = "./invoices/incoming"
OUTPUT_CSV = "./invoices/processed.csv"
PDFTOMARKDOWN_API_KEY = "demo_public_key"  # Replace with your key
# ───────────────────────────────────────────────────────────────


def extract_markdown(pdf_path: str) -> str:
    result = convert(pdf_path, api_key=PDFTOMARKDOWN_API_KEY)
    return result.markdown


def parse_with_regex(markdown: str) -> dict:
    data = {
        "vendor": None,
        "invoice_number": None,
        "invoice_date": None,
        "due_date": None,
        "line_items": [],
        "subtotal": None,
        "tax": None,
        "total": None,
    }

    vendor_match = re.search(r"\*\*Vendor:\*\*\s*(.+)", markdown)
    if vendor_match:
        data["vendor"] = vendor_match.group(1).strip()

    inv_match = re.search(
        r"(?:Invoice\s*#?|Invoice\s+Number:?\s*)([A-Z0-9\-]+)",
        markdown,
        re.IGNORECASE,
    )
    if inv_match:
        data["invoice_number"] = inv_match.group(1).strip()

    date_pattern = r"\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{2,4}|\w+ \d{1,2},? \d{4}"

    inv_date = re.search(
        rf"\*\*Invoice\s+date:\*\*\s*({date_pattern})", markdown, re.IGNORECASE
    )
    if inv_date:
        data["invoice_date"] = inv_date.group(1).strip()

    due_date = re.search(
        rf"\*\*Due\s+date:\*\*\s*({date_pattern})", markdown, re.IGNORECASE
    )
    if due_date:
        data["due_date"] = due_date.group(1).strip()

    table_rows = re.findall(
        r"\|\s*([^|]+?)\s*\|\s*(\d+)\s*\|\s*\$?([\d,]+\.?\d*)\s*\|\s*\$?([\d,]+\.?\d*)\s*\|",
        markdown,
    )
    for row in table_rows:
        desc, qty, unit_price, total = row
        if desc.strip("-") == "" or "description" in desc.lower():
            continue
        data["line_items"].append({
            "description": desc.strip(),
            "quantity": int(qty),
            "unit_price": float(unit_price.replace(",", "")),
            "total": float(total.replace(",", "")),
        })

    for field, pattern in [
        ("subtotal", r"\*\*Subtotal:\*\*\s*\$?([\d,]+\.?\d*)"),
        ("tax", r"\*\*Tax[^:]*:\*\*\s*\$?([\d,]+\.?\d*)"),
        ("total", r"\*\*Total:\*\*\s*\$?([\d,]+\.?\d*)"),
    ]:
        match = re.search(pattern, markdown, re.IGNORECASE)
        if match:
            data[field] = float(match.group(1).replace(",", ""))

    return data


def parse_with_llm(markdown: str) -> dict:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": (
                    "Extract invoice data from the following markdown. "
                    "Return JSON with: vendor, invoice_number, invoice_date, "
                    "due_date, line_items (array of {description, quantity, "
                    "unit_price, total}), subtotal, tax, total. "
                    "Use null for missing fields."
                ),
            },
            {"role": "user", "content": markdown},
        ],
    )
    return json.loads(response.choices[0].message.content)


def parse_invoice(markdown: str) -> dict:
    data = parse_with_regex(markdown)
    missing = [k for k, v in data.items() if v is None or v == []]
    if missing:
        print(f"  Regex missed: {missing} — using LLM fallback")
        llm_data = parse_with_llm(markdown)
        for field in missing:
            if llm_data.get(field) is not None:
                data[field] = llm_data[field]
    return data


def append_to_csv(invoice_data: dict, csv_path: str = OUTPUT_CSV):
    Path(csv_path).parent.mkdir(parents=True, exist_ok=True)
    file_exists = Path(csv_path).exists()

    line_items_summary = "; ".join(
        f"{item['description']} (x{item['quantity']}): ${item['total']:.2f}"
        for item in invoice_data.get("line_items", [])
    )

    row = {
        "vendor": invoice_data.get("vendor"),
        "invoice_number": invoice_data.get("invoice_number"),
        "invoice_date": invoice_data.get("invoice_date"),
        "due_date": invoice_data.get("due_date"),
        "subtotal": invoice_data.get("subtotal"),
        "tax": invoice_data.get("tax"),
        "total": invoice_data.get("total"),
        "line_items": line_items_summary,
        "source_file": invoice_data.get("source_file"),
        "processed_at": invoice_data.get("processed_at"),
    }

    with open(csv_path, "a", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=row.keys())
        if not file_exists:
            writer.writeheader()
        writer.writerow(row)

    print(f"  Appended to {csv_path}")


def process_invoice(pdf_path: str):
    print(f"\nProcessing: {pdf_path}")

    markdown = extract_markdown(pdf_path)
    print(f"  Extracted {len(markdown)} chars of markdown")

    invoice_data = parse_invoice(markdown)
    invoice_data["source_file"] = pdf_path
    invoice_data["processed_at"] = datetime.now().isoformat()

    # Save JSON
    json_path = pdf_path.replace(".pdf", ".json")
    with open(json_path, "w") as f:
        json.dump(invoice_data, f, indent=2)
    print(f"  Saved JSON: {json_path}")

    # Append to CSV
    append_to_csv(invoice_data)

    print(f"  Done: {invoice_data.get('vendor')} — ${invoice_data.get('total')}")


class InvoiceHandler(FileSystemEventHandler):
    def on_created(self, event):
        if event.is_directory:
            return
        if event.src_path.endswith(".pdf"):
            process_invoice(event.src_path)


def main():
    Path(WATCH_FOLDER).mkdir(parents=True, exist_ok=True)
    print(f"Invoice pipeline running. Drop PDFs into: {WATCH_FOLDER}")
    print(f"Output CSV: {OUTPUT_CSV}")
    print("Press Ctrl+C to stop.\n")

    observer = Observer()
    observer.schedule(InvoiceHandler(), WATCH_FOLDER, recursive=False)
    observer.start()

    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        print("\nShutting down...")
        observer.stop()
    observer.join()


if __name__ == "__main__":
    main()

Run it:

python invoice_pipeline.py

Then drop a PDF into ./invoices/incoming/ and watch the pipeline execute.

Production considerations

Error handling. The script above is linear for clarity. In production, wrap process_invoice in a try/except and move failed PDFs to an ./invoices/errors/ directory with the traceback logged alongside.

Deduplication. Check invoice_number against your CSV or database before inserting. Duplicate invoices are a real problem in AP automation — the same PDF gets emailed, downloaded, and forwarded.

Rate limits. The pdftomarkdown demo key is for testing. For production volumes, sign in with GitHub to get 100 pages/month free, or contact us for higher limits.

Concurrency. For high-volume processing, replace the watchdog loop with a queue (Redis, SQS, or even multiprocessing.Queue) and run multiple workers. The pdftomarkdown API handles concurrent requests.

Validation. Cross-check that sum(line_items.total) matches the subtotal, and that subtotal + tax matches the total. Flag discrepancies for human review.

Why this works better than template-based extraction

Traditional invoice extractors require you to define field coordinates per vendor. New vendor? New template. Vendor updates their layout? Template breaks.

pdftomarkdown uses a vision-language model that reads the document visually. It recognizes tables, headers, and key-value pairs by layout — not by fixed coordinates. The markdown output is consistent enough for regex to work across different invoice formats, and the LLM fallback catches edge cases.

The result is a pipeline that handles new vendors automatically, with no per-vendor configuration.

Next steps

The demo key works for testing single-page invoices. For production AP automation, get your API key and start processing.