legal document OCR

Legal Document OCR API

One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.

$ curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
{
  "markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
  "pages": 3,
  "request_id": "req_abc123"
}

Legal documents have structure that matters: section numbers, defined terms, cross-references, schedules, and exhibits. A flat text extraction loses all of this. You end up with a wall of text where “Section 12.3(b)(i)” is indistinguishable from ordinary paragraph text.

pdfToMarkdown preserves the structure of legal documents. Section headings become markdown headings. Numbered lists and sub-clauses become nested lists. Tables in schedules and annexes are preserved as markdown tables.

What a parsed contract looks like

Input: a standard NDA in PDF format.

Output:

# Non-Disclosure Agreement

**Effective Date:** January 15, 2024
**Between:** Acme Corp ("Disclosing Party") and Widgets Inc ("Receiving Party")

## 1. Definitions

**1.1 "Confidential Information"** means any information disclosed by the Disclosing Party
to the Receiving Party, either directly or indirectly, in writing, orally or by inspection of
tangible objects, that is designated as "Confidential"...

## 2. Obligations of Receiving Party

The Receiving Party agrees to:

1. Hold Confidential Information in strict confidence
2. Not disclose Confidential Information to third parties without prior written consent
3. Use Confidential Information solely for the Purpose described in Section 3

## 3. Purpose

The Receiving Party may use the Confidential Information solely for the purpose of
evaluating a potential business relationship between the parties ("Purpose").

...

## Schedule A — Permitted Recipients

| Name          | Role              | Department       |
|---------------|-------------------|------------------|
| Jane Smith    | VP Engineering    | Product          |
| John Doe      | Legal Counsel     | Legal            |

Clause numbering, defined terms in bold, scheduled tables — all preserved.

Contract review and summarization

Feed the markdown output to an LLM to identify key dates, obligations, termination clauses, and defined terms. The preserved structure helps models locate specific provisions reliably.

from pdftomarkdown import convert

result = convert("nda.pdf")

# Section headings make it easy to split by clause
clauses = [s for s in result.markdown.split("\n## ") if s]

E-discovery document review

Process large batches of filings, depositions, and exhibits. Because headings and page structure are preserved, full-text search becomes more precise — you can search within specific sections.

Regulatory filing extraction

Extract data from SEC filings, planning applications, or patent documents. Tables in exhibits and schedules are converted to markdown tables that can be further parsed.

Convert a library of contracts to markdown, chunk by section, and index in a vector database for similarity search and contract comparison.

  • Multi-page contracts with running headers and footers (stripped cleanly)
  • Exhibits and schedules with dense tables
  • Court filings with caption blocks and docket information
  • Patent documents with claims in numbered list format
  • Scanned legacy contracts — image-only PDFs
  • Mixed documents with both native text and scanned pages

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

  • Public demo key — copy & paste
  • Only page 1 is processed
  • 1 request/min per IP
  • Watermark in output
View docs →

Developer

Free, GitHub login

  • Personal API key
  • 100 pages/month
  • Multi-page PDFs
  • No watermark
Get API key →

Start parsing contracts free

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.