Legal PDFs are uniquely hard to parse

Legal documents have structure that matters: section numbers, defined terms, cross-references, schedules, and exhibits. A flat text extraction loses all of this. You end up with a wall of text where “Section 12.3(b)(i)” is indistinguishable from ordinary paragraph text.

pdfToMarkdown preserves the structure of legal documents. Section headings become markdown headings. Numbered lists and sub-clauses become nested lists. Tables in schedules and annexes are preserved as markdown tables.

What a parsed contract looks like

Input: a standard NDA in PDF format.

Output:

# Non-Disclosure Agreement

**Effective Date:** January 15, 2024
**Between:** Acme Corp ("Disclosing Party") and Widgets Inc ("Receiving Party")

## 1. Definitions

**1.1 "Confidential Information"** means any information disclosed by the Disclosing Party
to the Receiving Party, either directly or indirectly, in writing, orally or by inspection of
tangible objects, that is designated as "Confidential"...

## 2. Obligations of Receiving Party

The Receiving Party agrees to:

1. Hold Confidential Information in strict confidence
2. Not disclose Confidential Information to third parties without prior written consent
3. Use Confidential Information solely for the Purpose described in Section 3

## 3. Purpose

The Receiving Party may use the Confidential Information solely for the purpose of
evaluating a potential business relationship between the parties ("Purpose").

...

## Schedule A — Permitted Recipients

| Name          | Role              | Department       |
|---------------|-------------------|------------------|
| Jane Smith    | VP Engineering    | Product          |
| John Doe      | Legal Counsel     | Legal            |

Clause numbering, defined terms in bold, scheduled tables — all preserved.

Use cases in legal tech

Contract review and summarization

Feed the markdown output to an LLM to identify key dates, obligations, termination clauses, and defined terms. The preserved structure helps models locate specific provisions reliably.

from pdftomarkdown import convert

result = convert("nda.pdf")

# Section headings make it easy to split by clause
clauses = [s for s in result.markdown.split("\n## ") if s]

E-discovery document review

Process large batches of filings, depositions, and exhibits. Because headings and page structure are preserved, full-text search becomes more precise — you can search within specific sections.

Regulatory filing extraction

Extract data from SEC filings, planning applications, or patent documents. Tables in exhibits and schedules are converted to markdown tables that can be further parsed.

Legal database ingestion

Convert a library of contracts to markdown, chunk by section, and index in a vector database for similarity search and contract comparison.

Handles complex legal document formats

Multi-page contracts with running headers and footers (stripped cleanly)
Exhibits and schedules with dense tables
Court filings with caption blocks and docket information
Patent documents with claims in numbered list format
Scanned legacy contracts — image-only PDFs
Mixed documents with both native text and scanned pages

PDF Parsing API — general structured text extraction
OCR API for Developers — technical overview of the API
API documentation — full endpoint reference with Python and curl examples

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

Public demo key — copy & paste
Only page 1 is processed
1 request/min per IP
Watermark in output

View docs →

Developer

Free, GitHub login

Personal API key
100 pages/month
Multi-page PDFs
No watermark

Get API key →

Start parsing contracts free

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.

Sign in with GitHub Or read the docs first →

Legal Document OCR API