· pdfToMarkdown team

How to Extract Tables from PDFs in Python

pythontablesguidespdf

Extracting tables from PDFs in Python sounds like it should be a solved problem. There are at least three well-known libraries for it. But if you’ve tried any of them on a real-world document — a financial statement with subtotals, a product catalog with merged category rows, a spec sheet with multi-level headers — you know how quickly they fall apart.

This post walks through the three main Python table extraction tools, shows where each one breaks, and demonstrates an approach that handles the hard cases.

The test documents

To make this comparison concrete, we’ll use two types of tables that appear constantly in real PDFs:

1. A financial income statement with multi-level column headers (Year > Quarter), bold subtotal rows, and right-aligned dollar amounts:

                        2023                    2022
                   Q3          Q4          Q3          Q4
Revenue
  Product      $42,100     $48,300     $35,200     $39,800
  Services     $11,400     $13,200     $9,800      $10,500
Total Revenue  $53,500     $61,500     $45,000     $50,300
COGS           $27,800     $31,200     $23,400     $26,100
Gross Profit   $25,700     $30,300     $21,600     $24,200
Gross Margin    48.0%       49.3%       48.0%       48.1%

2. A product catalog table with merged cells — a category label spanning multiple product rows, with specs across several columns:

Category    | Product       | Voltage | Current | Interface
------------+---------------+---------+---------+----------
Sensors     | TMP-102       | 1.4–3.6V|  10 µA  | I²C
            | BMP-380       | 1.7–3.6V|  3.4 µA | SPI, I²C
            | LIS3DH        | 1.71–3.6V| 11 µA  | SPI, I²C
Wireless    | nRF52840      | 1.7–5.5V| 4.8 mA  | BLE 5.0
            | ESP32-C3      | 3.0–3.6V| 80 mA   | Wi-Fi, BLE
Actuators   | DRV8833       | 2.7–10.8V| 1.8 A  | PWM

Both are common layouts. Both are hard for rule-based parsers.

Attempt 1: Camelot

Camelot is one of the most popular PDF table extraction libraries. It works by detecting line segments (lattice mode) or text alignment patterns (stream mode) to find table boundaries.

import camelot

# Lattice mode — looks for drawn gridlines
tables = camelot.read_pdf("financial_report.pdf", pages="1", flavor="lattice")
print(f"Found {len(tables)} tables")

if len(tables) > 0:
    print(tables[0].df.to_string())

Result on the financial statement

Camelot’s lattice mode finds nothing — the table uses whitespace alignment and alternating row shading, not drawn gridlines. Switching to stream mode:

tables = camelot.read_pdf("financial_report.pdf", pages="1", flavor="stream")
print(tables[0].df.to_string())

Output:

     0              1          2          3          4
0           2023                  2022
1              Q3        Q4        Q3        Q4
2  Revenue
3  Product  42,100    48,300    35,200    39,800
4  Services 11,400    13,200     9,800    10,500
5  Total Revenue53,500  61,500   45,000    50,300
6  COGS     27,800    31,200    23,400    26,100
7  Gross Profit25,700  30,300   21,600    24,200
8  Gross Margin48.0%    49.3%    48.0%     48.1%

The multi-level header (“2023” spanning Q3/Q4) is flattened into separate rows with no relationship between them. “Total Revenue” and its value merge into a single cell. Dollar signs are stripped. The hierarchical indentation of “Product” and “Services” under “Revenue” is lost.

Result on the product catalog

Stream mode produces:

     0           1          2        3         4
0  SensorsTMP-102    1.4–3.6V  10 µA     I²C
1          BMP-380   1.7–3.6V  3.4 µA    SPI, I²C
2          LIS3DH    1.71–3.6V 11 µA     SPI, I²C
3  WirelessnRF52840  1.7–5.5V  4.8 mA    BLE 5.0
4          ESP32-C3  3.0–3.6V  80 mA     Wi-Fi, BLE
5  ActuatorsDRV8833  2.7–10.8V 1.8 A     PWM

The merged “Category” cell contents get concatenated with the first product name: “SensorsTMP-102”, “WirelessnRF52840”, “ActuatorsDRV8833”. Rows 1–2 and 4 have empty category columns, which is technically correct, but the concatenation in the first row of each group makes downstream parsing unreliable.

Attempt 2: tabula-py

tabula-py is the Python wrapper for Tabula, a Java-based PDF table extractor.

import tabula

# Extract all tables from the first page
tables = tabula.read_pdf("financial_report.pdf", pages=1, multiple_tables=True)

for i, table in enumerate(tables):
    print(f"--- Table {i} ---")
    print(table.to_string())

Result on the financial statement

--- Table 0 ---
        Unnamed: 0  2023 Unnamed: 1  2022 Unnamed: 2
0              NaN    Q3         Q4    Q3         Q4
1          Revenue   NaN        NaN   NaN        NaN
2  Product          42100      48300 35200      39800
3  Services         11400      13200  9800      10500
4  Total Revenue    53500      61500 45000      50300
5             COGS  27800      31200 23400      26100
6     Gross Profit  25700      30300 21600      24200
7     Gross Margin  48.0%      49.3% 48.0%      48.1%

The multi-header is split across two rows, but the parent-child relationship between “2023” and its Q3/Q4 columns is lost — they’re separate unnamed columns. pandas reads “2023” and “2022” as column headers but “Q3”/“Q4” as data row 0, so the DataFrame has the wrong shape. Feeding this into any automated pipeline requires manual header reconstruction.

Result on the product catalog

tabula drops the merged category cells entirely:

--- Table 0 ---
  Unnamed: 0    Product   Voltage  Current  Interface
0        NaN    TMP-102  1.4–3.6V    10 µA        I²C
1        NaN    BMP-380  1.7–3.6V   3.4 µA   SPI, I²C
2        NaN     LIS3DH 1.71–3.6V    11 µA   SPI, I²C
3        NaN   nRF52840  1.7–5.5V   4.8 mA    BLE 5.0
4        NaN   ESP32-C3  3.0–3.6V    80 mA  Wi-Fi, BLE
5        NaN    DRV8833 2.7–10.8V    1.8 A        PWM

Every row in the “Category” column is NaN. The merged cells are invisible to tabula because they span multiple rows in the PDF layout, and tabula’s heuristics treat them as belonging only to the first row — then discard the content because it can’t determine which column it belongs to.

Attempt 3: pdfplumber

pdfplumber takes a different approach — it exposes low-level PDF objects (characters, lines, rectangles) and provides table-finding heuristics on top.

import pdfplumber

with pdfplumber.open("financial_report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

    for i, table in enumerate(tables):
        print(f"--- Table {i} ---")
        for row in table:
            print(row)

Result on the financial statement

--- Table 0 ---
['', '2023', '', '2022', '']
['', 'Q3', 'Q4', 'Q3', 'Q4']
['Revenue', '', '', '', '']
['Product', '$42,100', '$48,300', '$35,200', '$39,800']
['Services', '$11,400', '$13,200', '$9,800', '$10,500']
['Total Revenue', '$53,500', '$61,500', '$45,000', '$50,300']
['COGS', '$27,800', '$31,200', '$23,400', '$26,100']
['Gross Profit', '$25,700', '$30,300', '$21,600', '$24,200']
['Gross Margin', '48.0%', '49.3%', '48.0%', '48.1%']

pdfplumber does the best job of the three. It preserves dollar signs and keeps cells separate. But the multi-level header is still two flat rows. The “2023” cell spans two columns visually, but pdfplumber returns it in one cell with the adjacent cell empty. You need custom code to figure out that “Q3” in position 1 belongs under “2023”, not under the empty string in position 0.

This is workable for one document. It’s not workable if you’re processing hundreds of documents with different header layouts.

Result on the product catalog

--- Table 0 ---
['Sensors', 'TMP-102', '1.4–3.6V', '10 µA', 'I²C']
['', 'BMP-380', '1.7–3.6V', '3.4 µA', 'SPI, I²C']
['', 'LIS3DH', '1.71–3.6V', '11 µA', 'SPI, I²C']
['Wireless', 'nRF52840', '1.7–5.5V', '4.8 mA', 'BLE 5.0']
['', 'ESP32-C3', '3.0–3.6V', '80 mA', 'Wi-Fi, BLE']
['Actuators', 'DRV8833', '2.7–10.8V', '1.8 A', 'PWM']

The category label appears on the first row of each group, then empty strings for subsequent rows. This is the “correct” raw extraction, but it means the category-product relationship is implicit. If you need to know that BMP-380 is a sensor, you have to write forward-fill logic. If the table structure is different in the next PDF — say, the category is in a separate header row instead of a merged cell — that forward-fill logic breaks.

Why all three tools share the same limitation

Camelot, tabula, and pdfplumber all work from the raw PDF character stream. They parse the internal PDF operators to find character positions, then apply heuristics to detect column boundaries, row separators, and cell groupings.

This approach works on simple, well-formed tables with visible gridlines and single-row headers. It breaks on:

  • Multi-level headers — the tools see characters at different y-coordinates and have no way to infer parent-child header relationships.
  • Merged cells — a cell spanning multiple rows is a visual concept. In the PDF character stream, it’s just text at a position. There’s no “merge” instruction.
  • Tables without gridlines — camelot’s lattice mode requires drawn lines. Stream mode and pdfplumber’s heuristics guess at boundaries from whitespace gaps, which fail when column spacing is tight or inconsistent.
  • Scanned PDFs — all three tools require text-native PDFs. Scanned documents produce zero output without a separate OCR step.

pdfToMarkdown: vision-based table extraction

pdfToMarkdown takes a fundamentally different approach. Instead of parsing PDF internals, it renders each page as an image and processes it with a vision-language model that reads the table the way a human would.

The model sees the grid structure, the alignment, the header hierarchy, the bold subtotals. It doesn’t need drawn gridlines or character-position heuristics.

Using the API directly

import requests

response = requests.post(
    "https://pdftomarkdown.dev/v1/convert",
    headers={
        "Authorization": "Bearer demo_public_key",
        "Content-Type": "application/json",
    },
    json={
        "input": {
            "pdf_url": "https://example.com/financial_report.pdf"
        }
    },
)

result = response.json()
print(result["output"]["markdown"])

Using the Python SDK

from pdftomarkdown import convert

result = convert("financial_report.pdf", api_key="demo_public_key")
print(result.markdown)

Result on the financial statement

| | 2023 | | 2022 | |
|---|---|---|---|---|
| | **Q3** | **Q4** | **Q3** | **Q4** |
| Revenue | | | | |
| Product | $42,100 | $48,300 | $35,200 | $39,800 |
| Services | $11,400 | $13,200 | $9,800 | $10,500 |
| **Total Revenue** | **$53,500** | **$61,500** | **$45,000** | **$50,300** |
| COGS | $27,800 | $31,200 | $23,400 | $26,100 |
| **Gross Profit** | **$25,700** | **$30,300** | **$21,600** | **$24,200** |
| Gross Margin | 48.0% | 49.3% | 48.0% | 48.1% |

The multi-level header is preserved as two rows, with the parent year labels in the correct column positions. Subtotal rows are distinguished with bold. Dollar signs, percentages, and alignment are all intact. This is valid pipe-delimited markdown that renders correctly anywhere.

Result on the product catalog

| Category | Product | Voltage | Current | Interface |
|---|---|---|---|---|
| **Sensors** | TMP-102 | 1.4–3.6V | 10 µA | I²C |
| | BMP-380 | 1.7–3.6V | 3.4 µA | SPI, I²C |
| | LIS3DH | 1.71–3.6V | 11 µA | SPI, I²C |
| **Wireless** | nRF52840 | 1.7–5.5V | 4.8 mA | BLE 5.0 |
| | ESP32-C3 | 3.0–3.6V | 80 mA | Wi-Fi, BLE |
| **Actuators** | DRV8833 | 2.7–10.8V | 1.8 A | PWM |

Category labels are bold, correctly placed in the first row of each group, and subsequent rows leave the category cell empty — matching the visual structure of the source document. The table has correct column headers, consistent cell boundaries, and renders as a proper table in any markdown viewer.

Parsing the output into pandas

The markdown table format is easy to convert into a DataFrame:

import io
import pandas as pd
from pdftomarkdown import convert

result = convert("financial_report.pdf", api_key="demo_public_key")

# Extract markdown tables from the full document
tables = []
current_table = []
for line in result.markdown.split("\n"):
    if line.strip().startswith("|"):
        current_table.append(line.strip())
    elif current_table:
        tables.append("\n".join(current_table))
        current_table = []
if current_table:
    tables.append("\n".join(current_table))

# Parse the first table into a DataFrame
rows = []
for line in tables[0].split("\n"):
    cells = [c.strip() for c in line.split("|")[1:-1]]  # strip outer pipes
    rows.append(cells)

# Skip the separator row (row index 1, the ---|--- line)
header = rows[0]
data = [r for r in rows[1:] if not all(c.startswith("-") for c in r)]

df = pd.DataFrame(data, columns=header)
print(df)

No custom column-position logic. No forward-fill heuristics. The table structure is already solved by the time you get the response.

Side-by-side summary

Camelottabula-pypdfplumberpdfToMarkdown
Multi-level headersFlattenedSplit into data rowsTwo flat rows, no parent-child linkPreserved with correct column positions
Merged cellsConcatenated with adjacent textDropped (NaN)Empty strings, requires forward-fillCorrectly placed with visual grouping
Tables without gridlinesLattice fails; stream guessesHeuristic detection, inconsistentHeuristic detection, better than tabulaVision-based, no gridlines needed
Scanned PDFsNot supportedNot supportedNot supportedFull OCR built in
Dollar signs / formattingOften strippedOften strippedPreservedPreserved with bold for subtotals
Output formatpandas DataFramepandas DataFrameList of listsMarkdown (pipe-delimited)

When to use the rule-based tools

Camelot, tabula, and pdfplumber are still useful for:

  • Simple tables with drawn gridlines and single-row headers — they’re fast, free, and run locally.
  • High-volume batch jobs where you control the PDF format and can guarantee consistent table layouts.
  • Environments where you can’t make external API calls — all three run entirely on your machine.

If your tables are complex, inconsistent across documents, or come from scanned PDFs, the rule-based approach will cost you more time in debugging heuristics than the API call costs in latency.

Get started

The demo key (demo_public_key) works immediately — no signup, no account creation. It’s limited to 1 page per PDF with a watermark, but it’s enough to test the table extraction quality on your own documents.

For production use, sign in with GitHub to get 100 pages/month free, no credit card required.

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://your-pdf-url-here.com/report.pdf"}}'

Read more about table extraction capabilities on the PDF Table Extraction API page, or check the API documentation for the full endpoint reference.