legal document OCR
Legal Document OCR API
One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.
$ curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}' {
"markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
"pages": 3,
"request_id": "req_abc123"
} Legal PDFs are uniquely hard to parse
Legal documents have structure that matters: section numbers, defined terms, cross-references, schedules, and exhibits. A flat text extraction loses all of this. You end up with a wall of text where “Section 12.3(b)(i)” is indistinguishable from ordinary paragraph text.
pdfToMarkdown preserves the structure of legal documents. Section headings become markdown headings. Numbered lists and sub-clauses become nested lists. Tables in schedules and annexes are preserved as markdown tables.
What a parsed contract looks like
Input: a standard NDA in PDF format.
Output:
# Non-Disclosure Agreement
**Effective Date:** January 15, 2024
**Between:** Acme Corp ("Disclosing Party") and Widgets Inc ("Receiving Party")
## 1. Definitions
**1.1 "Confidential Information"** means any information disclosed by the Disclosing Party
to the Receiving Party, either directly or indirectly, in writing, orally or by inspection of
tangible objects, that is designated as "Confidential"...
## 2. Obligations of Receiving Party
The Receiving Party agrees to:
1. Hold Confidential Information in strict confidence
2. Not disclose Confidential Information to third parties without prior written consent
3. Use Confidential Information solely for the Purpose described in Section 3
## 3. Purpose
The Receiving Party may use the Confidential Information solely for the purpose of
evaluating a potential business relationship between the parties ("Purpose").
...
## Schedule A — Permitted Recipients
| Name | Role | Department |
|---------------|-------------------|------------------|
| Jane Smith | VP Engineering | Product |
| John Doe | Legal Counsel | Legal |
Clause numbering, defined terms in bold, scheduled tables — all preserved.
Use cases in legal tech
Contract review and summarization
Feed the markdown output to an LLM to identify key dates, obligations, termination clauses, and defined terms. The preserved structure helps models locate specific provisions reliably.
from pdftomarkdown import convert
result = convert("nda.pdf")
# Section headings make it easy to split by clause
clauses = [s for s in result.markdown.split("\n## ") if s]
E-discovery document review
Process large batches of filings, depositions, and exhibits. Because headings and page structure are preserved, full-text search becomes more precise — you can search within specific sections.
Regulatory filing extraction
Extract data from SEC filings, planning applications, or patent documents. Tables in exhibits and schedules are converted to markdown tables that can be further parsed.
Legal database ingestion
Convert a library of contracts to markdown, chunk by section, and index in a vector database for similarity search and contract comparison.
Handles complex legal document formats
- Multi-page contracts with running headers and footers (stripped cleanly)
- Exhibits and schedules with dense tables
- Court filings with caption blocks and docket information
- Patent documents with claims in numbered list format
- Scanned legacy contracts — image-only PDFs
- Mixed documents with both native text and scanned pages
Related pages
- PDF Parsing API — general structured text extraction
- OCR API for Developers — technical overview of the API
- API documentation — full endpoint reference with Python and curl examples
Pricing
Both tiers are free. No credit card required.
Hacker
Free, no signup
- Public demo key — copy & paste
- Only page 1 is processed
- 1 request/min per IP
- Watermark in output
Developer
Free, GitHub login
- Personal API key
- 100 pages/month
- Multi-page PDFs
- No watermark
Start parsing contracts free
Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.