pdf table extraction api

PDF Table Extraction API

One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.

$ curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
{
  "markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
  "pages": 3,
  "request_id": "req_abc123"
}

Tables are where every other PDF tool falls apart

Text extraction from PDFs is a mostly solved problem. Tables are not. The moment your document has a multi-level header, merged cells, or a column of right-aligned numbers with a subtotal row, rule-based tools start guessing — and guessing wrong.

pdfToMarkdown uses a vision-language model that looks at the rendered page, not the raw character stream. It sees the grid lines, the alignment, the header hierarchy. It reconstructs the table the way you would if you were reading the page yourself.

What garbled table extraction looks like

Here is what a typical text extraction library produces from a financial statement table:

Revenue 2023 2022 Change
Product revenue 142,300 118,500 20.1%
Service revenue 38,700 35,200 9.9%
Total revenue 181,000 153,700 17.8%
Cost of revenue 94,100 82,600
Gross profit 86,900 71,100
Gross margin 48.0% 46.3%

No cell boundaries. No alignment. Headers and data rows are indistinguishable. Good luck feeding that into a downstream parser.

Here is the same table from pdfToMarkdown:

| | 2023 | 2022 | Change |
|---|---|---|---|
| **Product revenue** | $142,300 | $118,500 | 20.1% |
| **Service revenue** | $38,700 | $35,200 | 9.9% |
| **Total revenue** | **$181,000** | **$153,700** | **17.8%** |
| Cost of revenue | $94,100 | $82,600 | |
| **Gross profit** | **$86,900** | **$71,100** | |
| Gross margin | 48.0% | 46.3% | |

Pipe-delimited markdown. Every cell in the right column. Subtotal rows distinguished with bold. Ready for LLM extraction, pandas, or direct rendering.

The tables that break rule-based tools

Multi-header tables

Financial reports and spec sheets commonly use two or three levels of column headers — “Q1 / Revenue / Actual vs Budget”. Libraries like tabula and camelot flatten these into a single header row or silently drop the top level. pdfToMarkdown preserves the hierarchy.

Merged cells and spanning rows

A product comparison table where a category label spans three rows? pdfplumber gives you the label once and empty strings for the next two rows, or worse, shifts every subsequent cell. The vision model understands the span and repeats or nests the label correctly.

Tables without visible gridlines

Many professional documents use whitespace-only alignment with no drawn borders. Rule-based tools rely on detecting lines or character alignment thresholds. When columns are close together or font sizes vary, detection fails. A vision-language model does not depend on detecting lines — it reads the visual layout directly.

Tables that span multiple pages

When a table continues across a page break, most tools treat each page as an independent extraction. You get two separate fragments with the header repeated (or not). pdfToMarkdown handles continuation tables and produces a single coherent markdown table.

Complex table examples

Product spec sheet

| Parameter | Unit | Model A | Model B | Model C |
|---|---|---|---|---|
| Operating voltage | V | 3.3–5.0 | 1.8–3.3 | 3.3 |
| Current draw (active) | mA | 12 | 8 | 22 |
| Current draw (sleep) | µA | 15 | 3 | 120 |
| Temperature range | °C | -40 to +85 | -40 to +125 | 0 to +70 |
| Interface | — | SPI, I²C | SPI | UART, SPI |
| Package | — | QFN-24 | WLCSP-16 | SOIC-8 |

Financial balance sheet

| Assets (in thousands) | Dec 31, 2023 | Dec 31, 2022 |
|---|---|---|
| **Current assets** | | |
| Cash and equivalents | $45,200 | $38,100 |
| Accounts receivable, net | $22,800 | $19,400 |
| Inventory | $8,300 | $7,100 |
| **Total current assets** | **$76,300** | **$64,600** |
| **Non-current assets** | | |
| Property and equipment, net | $31,400 | $28,900 |
| Goodwill | $12,600 | $12,600 |
| **Total assets** | **$120,300** | **$106,100** |

Why camelot, tabula, and pdfplumber struggle

These are good libraries. They work well on simple, well-formed tables with visible gridlines and a single header row. But they share the same fundamental limitation: they work from the raw PDF character stream and try to infer table structure from character positions and line objects.

This breaks when:

  • The PDF was generated from a scan — there are no character positions, only an image. Tabula and camelot cannot process scanned PDFs at all without a separate OCR step.
  • Column alignment is ambiguous — when two columns have similar x-coordinates, heuristic-based splitting produces wrong cell assignments.
  • The table uses visual formatting instead of lines — alternating row colors, bold headers, indentation for sub-rows. None of these are “lines” in the PDF spec.
  • Headers span multiple rowscamelot flattens them. tabula sometimes drops them entirely.

pdfToMarkdown sidesteps all of this. The vision-language model processes the rendered page image. It does not parse PDF operators or guess at column boundaries. It reads the table.

Extract tables from any PDF in one API call

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'

Or with the Python SDK:

from pdftomarkdown import convert

result = convert("financial-report.pdf")

# Every table is valid pipe-delimited markdown
for line in result.markdown.split("\n"):
    if line.startswith("|"):
        print(line)

The response contains the full document as markdown. Tables are returned as standard pipe-delimited markdown tables that render correctly in GitHub, Notion, Obsidian, and any markdown viewer.

Feeding extracted tables into a data pipeline

The markdown table format is easy to parse programmatically:

import io
import pandas as pd
from pdftomarkdown import convert

result = convert("quarterly-report.pdf")

# Extract the first markdown table as a DataFrame
tables = []
current_table = []
for line in result.markdown.split("\n"):
    if line.startswith("|"):
        current_table.append(line)
    elif current_table:
        tables.append("\n".join(current_table))
        current_table = []

# Parse pipe-delimited markdown into pandas
df = pd.read_table(io.StringIO(tables[0]), sep="|", skipinitialspace=True)
df = df.dropna(axis=1, how="all").iloc[1:]  # drop separator row

No custom parsing logic for each document layout. The table structure is already solved.

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

  • Public demo key — copy & paste
  • Only page 1 is processed
  • 1 request/min per IP
  • Watermark in output
View docs →

Developer

Free, GitHub login

  • Personal API key
  • 100 pages/month
  • Multi-page PDFs
  • No watermark
Get API key →

Try it with a table-heavy PDF

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.