How to Extract Tables from PDFs in Python
Extracting tables from PDFs in Python sounds like it should be a solved problem. There are at least three well-known libraries for it. But if you’ve tried any of them on a real-world document — a financial statement with subtotals, a product catalog with merged category rows, a spec sheet with multi-level headers — you know how quickly they fall apart.
This post walks through the three main Python table extraction tools, shows where each one breaks, and demonstrates an approach that handles the hard cases.
The test documents
To make this comparison concrete, we’ll use two types of tables that appear constantly in real PDFs:
1. A financial income statement with multi-level column headers (Year > Quarter), bold subtotal rows, and right-aligned dollar amounts:
2023 2022
Q3 Q4 Q3 Q4
Revenue
Product $42,100 $48,300 $35,200 $39,800
Services $11,400 $13,200 $9,800 $10,500
Total Revenue $53,500 $61,500 $45,000 $50,300
COGS $27,800 $31,200 $23,400 $26,100
Gross Profit $25,700 $30,300 $21,600 $24,200
Gross Margin 48.0% 49.3% 48.0% 48.1%
2. A product catalog table with merged cells — a category label spanning multiple product rows, with specs across several columns:
Category | Product | Voltage | Current | Interface
------------+---------------+---------+---------+----------
Sensors | TMP-102 | 1.4–3.6V| 10 µA | I²C
| BMP-380 | 1.7–3.6V| 3.4 µA | SPI, I²C
| LIS3DH | 1.71–3.6V| 11 µA | SPI, I²C
Wireless | nRF52840 | 1.7–5.5V| 4.8 mA | BLE 5.0
| ESP32-C3 | 3.0–3.6V| 80 mA | Wi-Fi, BLE
Actuators | DRV8833 | 2.7–10.8V| 1.8 A | PWM
Both are common layouts. Both are hard for rule-based parsers.
Attempt 1: Camelot
Camelot is one of the most popular PDF table extraction libraries. It works by detecting line segments (lattice mode) or text alignment patterns (stream mode) to find table boundaries.
import camelot
# Lattice mode — looks for drawn gridlines
tables = camelot.read_pdf("financial_report.pdf", pages="1", flavor="lattice")
print(f"Found {len(tables)} tables")
if len(tables) > 0:
print(tables[0].df.to_string())
Result on the financial statement
Camelot’s lattice mode finds nothing — the table uses whitespace alignment and alternating row shading, not drawn gridlines. Switching to stream mode:
tables = camelot.read_pdf("financial_report.pdf", pages="1", flavor="stream")
print(tables[0].df.to_string())
Output:
0 1 2 3 4
0 2023 2022
1 Q3 Q4 Q3 Q4
2 Revenue
3 Product 42,100 48,300 35,200 39,800
4 Services 11,400 13,200 9,800 10,500
5 Total Revenue53,500 61,500 45,000 50,300
6 COGS 27,800 31,200 23,400 26,100
7 Gross Profit25,700 30,300 21,600 24,200
8 Gross Margin48.0% 49.3% 48.0% 48.1%
The multi-level header (“2023” spanning Q3/Q4) is flattened into separate rows with no relationship between them. “Total Revenue” and its value merge into a single cell. Dollar signs are stripped. The hierarchical indentation of “Product” and “Services” under “Revenue” is lost.
Result on the product catalog
Stream mode produces:
0 1 2 3 4
0 SensorsTMP-102 1.4–3.6V 10 µA I²C
1 BMP-380 1.7–3.6V 3.4 µA SPI, I²C
2 LIS3DH 1.71–3.6V 11 µA SPI, I²C
3 WirelessnRF52840 1.7–5.5V 4.8 mA BLE 5.0
4 ESP32-C3 3.0–3.6V 80 mA Wi-Fi, BLE
5 ActuatorsDRV8833 2.7–10.8V 1.8 A PWM
The merged “Category” cell contents get concatenated with the first product name: “SensorsTMP-102”, “WirelessnRF52840”, “ActuatorsDRV8833”. Rows 1–2 and 4 have empty category columns, which is technically correct, but the concatenation in the first row of each group makes downstream parsing unreliable.
Attempt 2: tabula-py
tabula-py is the Python wrapper for Tabula, a Java-based PDF table extractor.
import tabula
# Extract all tables from the first page
tables = tabula.read_pdf("financial_report.pdf", pages=1, multiple_tables=True)
for i, table in enumerate(tables):
print(f"--- Table {i} ---")
print(table.to_string())
Result on the financial statement
--- Table 0 ---
Unnamed: 0 2023 Unnamed: 1 2022 Unnamed: 2
0 NaN Q3 Q4 Q3 Q4
1 Revenue NaN NaN NaN NaN
2 Product 42100 48300 35200 39800
3 Services 11400 13200 9800 10500
4 Total Revenue 53500 61500 45000 50300
5 COGS 27800 31200 23400 26100
6 Gross Profit 25700 30300 21600 24200
7 Gross Margin 48.0% 49.3% 48.0% 48.1%
The multi-header is split across two rows, but the parent-child relationship between “2023” and its Q3/Q4 columns is lost — they’re separate unnamed columns. pandas reads “2023” and “2022” as column headers but “Q3”/“Q4” as data row 0, so the DataFrame has the wrong shape. Feeding this into any automated pipeline requires manual header reconstruction.
Result on the product catalog
tabula drops the merged category cells entirely:
--- Table 0 ---
Unnamed: 0 Product Voltage Current Interface
0 NaN TMP-102 1.4–3.6V 10 µA I²C
1 NaN BMP-380 1.7–3.6V 3.4 µA SPI, I²C
2 NaN LIS3DH 1.71–3.6V 11 µA SPI, I²C
3 NaN nRF52840 1.7–5.5V 4.8 mA BLE 5.0
4 NaN ESP32-C3 3.0–3.6V 80 mA Wi-Fi, BLE
5 NaN DRV8833 2.7–10.8V 1.8 A PWM
Every row in the “Category” column is NaN. The merged cells are invisible to tabula because they span multiple rows in the PDF layout, and tabula’s heuristics treat them as belonging only to the first row — then discard the content because it can’t determine which column it belongs to.
Attempt 3: pdfplumber
pdfplumber takes a different approach — it exposes low-level PDF objects (characters, lines, rectangles) and provides table-finding heuristics on top.
import pdfplumber
with pdfplumber.open("financial_report.pdf") as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
for i, table in enumerate(tables):
print(f"--- Table {i} ---")
for row in table:
print(row)
Result on the financial statement
--- Table 0 ---
['', '2023', '', '2022', '']
['', 'Q3', 'Q4', 'Q3', 'Q4']
['Revenue', '', '', '', '']
['Product', '$42,100', '$48,300', '$35,200', '$39,800']
['Services', '$11,400', '$13,200', '$9,800', '$10,500']
['Total Revenue', '$53,500', '$61,500', '$45,000', '$50,300']
['COGS', '$27,800', '$31,200', '$23,400', '$26,100']
['Gross Profit', '$25,700', '$30,300', '$21,600', '$24,200']
['Gross Margin', '48.0%', '49.3%', '48.0%', '48.1%']
pdfplumber does the best job of the three. It preserves dollar signs and keeps cells separate. But the multi-level header is still two flat rows. The “2023” cell spans two columns visually, but pdfplumber returns it in one cell with the adjacent cell empty. You need custom code to figure out that “Q3” in position 1 belongs under “2023”, not under the empty string in position 0.
This is workable for one document. It’s not workable if you’re processing hundreds of documents with different header layouts.
Result on the product catalog
--- Table 0 ---
['Sensors', 'TMP-102', '1.4–3.6V', '10 µA', 'I²C']
['', 'BMP-380', '1.7–3.6V', '3.4 µA', 'SPI, I²C']
['', 'LIS3DH', '1.71–3.6V', '11 µA', 'SPI, I²C']
['Wireless', 'nRF52840', '1.7–5.5V', '4.8 mA', 'BLE 5.0']
['', 'ESP32-C3', '3.0–3.6V', '80 mA', 'Wi-Fi, BLE']
['Actuators', 'DRV8833', '2.7–10.8V', '1.8 A', 'PWM']
The category label appears on the first row of each group, then empty strings for subsequent rows. This is the “correct” raw extraction, but it means the category-product relationship is implicit. If you need to know that BMP-380 is a sensor, you have to write forward-fill logic. If the table structure is different in the next PDF — say, the category is in a separate header row instead of a merged cell — that forward-fill logic breaks.
Why all three tools share the same limitation
Camelot, tabula, and pdfplumber all work from the raw PDF character stream. They parse the internal PDF operators to find character positions, then apply heuristics to detect column boundaries, row separators, and cell groupings.
This approach works on simple, well-formed tables with visible gridlines and single-row headers. It breaks on:
- Multi-level headers — the tools see characters at different y-coordinates and have no way to infer parent-child header relationships.
- Merged cells — a cell spanning multiple rows is a visual concept. In the PDF character stream, it’s just text at a position. There’s no “merge” instruction.
- Tables without gridlines — camelot’s lattice mode requires drawn lines. Stream mode and pdfplumber’s heuristics guess at boundaries from whitespace gaps, which fail when column spacing is tight or inconsistent.
- Scanned PDFs — all three tools require text-native PDFs. Scanned documents produce zero output without a separate OCR step.
pdfToMarkdown: vision-based table extraction
pdfToMarkdown takes a fundamentally different approach. Instead of parsing PDF internals, it renders each page as an image and processes it with a vision-language model that reads the table the way a human would.
The model sees the grid structure, the alignment, the header hierarchy, the bold subtotals. It doesn’t need drawn gridlines or character-position heuristics.
Using the API directly
import requests
response = requests.post(
"https://pdftomarkdown.dev/v1/convert",
headers={
"Authorization": "Bearer demo_public_key",
"Content-Type": "application/json",
},
json={
"input": {
"pdf_url": "https://example.com/financial_report.pdf"
}
},
)
result = response.json()
print(result["output"]["markdown"])
Using the Python SDK
from pdftomarkdown import convert
result = convert("financial_report.pdf", api_key="demo_public_key")
print(result.markdown)
Result on the financial statement
| | 2023 | | 2022 | |
|---|---|---|---|---|
| | **Q3** | **Q4** | **Q3** | **Q4** |
| Revenue | | | | |
| Product | $42,100 | $48,300 | $35,200 | $39,800 |
| Services | $11,400 | $13,200 | $9,800 | $10,500 |
| **Total Revenue** | **$53,500** | **$61,500** | **$45,000** | **$50,300** |
| COGS | $27,800 | $31,200 | $23,400 | $26,100 |
| **Gross Profit** | **$25,700** | **$30,300** | **$21,600** | **$24,200** |
| Gross Margin | 48.0% | 49.3% | 48.0% | 48.1% |
The multi-level header is preserved as two rows, with the parent year labels in the correct column positions. Subtotal rows are distinguished with bold. Dollar signs, percentages, and alignment are all intact. This is valid pipe-delimited markdown that renders correctly anywhere.
Result on the product catalog
| Category | Product | Voltage | Current | Interface |
|---|---|---|---|---|
| **Sensors** | TMP-102 | 1.4–3.6V | 10 µA | I²C |
| | BMP-380 | 1.7–3.6V | 3.4 µA | SPI, I²C |
| | LIS3DH | 1.71–3.6V | 11 µA | SPI, I²C |
| **Wireless** | nRF52840 | 1.7–5.5V | 4.8 mA | BLE 5.0 |
| | ESP32-C3 | 3.0–3.6V | 80 mA | Wi-Fi, BLE |
| **Actuators** | DRV8833 | 2.7–10.8V | 1.8 A | PWM |
Category labels are bold, correctly placed in the first row of each group, and subsequent rows leave the category cell empty — matching the visual structure of the source document. The table has correct column headers, consistent cell boundaries, and renders as a proper table in any markdown viewer.
Parsing the output into pandas
The markdown table format is easy to convert into a DataFrame:
import io
import pandas as pd
from pdftomarkdown import convert
result = convert("financial_report.pdf", api_key="demo_public_key")
# Extract markdown tables from the full document
tables = []
current_table = []
for line in result.markdown.split("\n"):
if line.strip().startswith("|"):
current_table.append(line.strip())
elif current_table:
tables.append("\n".join(current_table))
current_table = []
if current_table:
tables.append("\n".join(current_table))
# Parse the first table into a DataFrame
rows = []
for line in tables[0].split("\n"):
cells = [c.strip() for c in line.split("|")[1:-1]] # strip outer pipes
rows.append(cells)
# Skip the separator row (row index 1, the ---|--- line)
header = rows[0]
data = [r for r in rows[1:] if not all(c.startswith("-") for c in r)]
df = pd.DataFrame(data, columns=header)
print(df)
No custom column-position logic. No forward-fill heuristics. The table structure is already solved by the time you get the response.
Side-by-side summary
| Camelot | tabula-py | pdfplumber | pdfToMarkdown | |
|---|---|---|---|---|
| Multi-level headers | Flattened | Split into data rows | Two flat rows, no parent-child link | Preserved with correct column positions |
| Merged cells | Concatenated with adjacent text | Dropped (NaN) | Empty strings, requires forward-fill | Correctly placed with visual grouping |
| Tables without gridlines | Lattice fails; stream guesses | Heuristic detection, inconsistent | Heuristic detection, better than tabula | Vision-based, no gridlines needed |
| Scanned PDFs | Not supported | Not supported | Not supported | Full OCR built in |
| Dollar signs / formatting | Often stripped | Often stripped | Preserved | Preserved with bold for subtotals |
| Output format | pandas DataFrame | pandas DataFrame | List of lists | Markdown (pipe-delimited) |
When to use the rule-based tools
Camelot, tabula, and pdfplumber are still useful for:
- Simple tables with drawn gridlines and single-row headers — they’re fast, free, and run locally.
- High-volume batch jobs where you control the PDF format and can guarantee consistent table layouts.
- Environments where you can’t make external API calls — all three run entirely on your machine.
If your tables are complex, inconsistent across documents, or come from scanned PDFs, the rule-based approach will cost you more time in debugging heuristics than the API call costs in latency.
Get started
The demo key (demo_public_key) works immediately — no signup, no account creation. It’s limited to 1 page per PDF with a watermark, but it’s enough to test the table extraction quality on your own documents.
For production use, sign in with GitHub to get 100 pages/month free, no credit card required.
curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://your-pdf-url-here.com/report.pdf"}}'
Read more about table extraction capabilities on the PDF Table Extraction API page, or check the API documentation for the full endpoint reference.