# pdfToMarkdown

> Convert any PDF to clean, structured markdown with one API call.

pdfToMarkdown is a REST API that converts PDF files to markdown text. It is built for developers who need to extract structured content from PDFs for use in LLM pipelines, RAG systems, document automation, and data extraction workflows.

The API uses a vision-language model (not a Tesseract/OCR wrapper) to understand document layout, preserve tables, handle multi-column text, extract math expressions, and return clean markdown.

Website: https://pdftomarkdown.dev
API base URL: https://pdftomarkdown.dev
Docs: https://pdftomarkdown.dev/docs

---

## What pdfToMarkdown handles

- Invoices and receipts
- Research papers (including math, equations)
- Legal contracts and agreements
- Financial reports and tables
- Technical manuals and documentation
- Scanned documents (image-based PDFs)
- Multi-column layouts
- Complex tables

---

## API endpoint

```
POST https://pdftomarkdown.dev/v1/convert
```

**Authentication:** Bearer token in the `Authorization` header.

**Request (JSON, URL input):**
```
curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
```

**Request (JSON, base64 input):**
```
curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_base64":"JVBERi0xLjcKJ..."}}'
```

**Response:**
```json
{
  "markdown": "# Invoice\n\nDate: 2024-01-15\nInvoice #: INV-2024-0042\n\n| Item | Qty | Price |\n|---|---|---|\n| API Pro Plan | 1 | $49.00 |\n\n**Total: $49.00**",
  "pages": 3,
  "request_id": "req_abc123"
}
```

Successful responses always use the same top-level shape: `markdown`, `pages`, and `request_id`.

If you set `input.include_raw=true`, the API adds one extra `raw` field for debugging. Otherwise model internals are omitted.

---

## Pricing tiers

### Tier 1: Hacker (no signup required)

- **API key:** public demo key (displayed on the website)
- **Limit:** page 1 only; multi-page PDFs are truncated server-side
- **Rate limit:** 1 request per minute per IP
- **Watermark:** output markdown includes `> Processed by pdfToMarkdown.dev` footer
- **Response header:** `X-PdfToMarkdown-Page-Cap: 1`
- **Use case:** instant testing, no account needed

### Tier 2: Developer (free, GitHub login)

- **API key:** personal key issued after GitHub OAuth login
- **Limit:** 100 pages per month
- **Multi-page PDFs:** supported
- **Watermark:** none
- **Quota reset:** 1st of each month
- **Use case:** Free Developer Preview — real projects, no credit card

To get a Developer API key: visit https://pdftomarkdown.dev/auth/github

---

## Python SDK

```
pip install pdftomarkdown
```

```python
from pdftomarkdown import convert

# Tier 1: no API key needed
result = convert("document.pdf")
print(result.markdown)

# Tier 2: use env var PDFTOMARKDOWN_API_KEY, or pass directly
result = convert("document.pdf", api_key="YOUR_API_KEY")
print(result.markdown)
print(f"Processed {result.pages} pages")

# Convert from URL
result = convert(url="https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf")
print(result.markdown)
```

The SDK reads `PDFTOMARKDOWN_API_KEY` from the environment automatically.

---

## Error responses

| HTTP status | Meaning |
|---|---|
| 400 | Request payload rejected upstream |
| 401 | Missing or invalid API key |
| 422 | Invalid source URL, TLS failure, unreachable source PDF, or unreadable PDF |
| 429 | `rate_limited` (Tier 1) or `quota_exceeded` (Tier 2); includes `Retry-After` |
| 502 | Upstream worker unreachable, invalid, or failed while processing |
| 504 | Upstream timeout |

All API errors use the same JSON shape: `error`, `message`, and `request_id`.

Any `429` also includes `retry_after_seconds`, and quota exhaustion also includes `reset_at`.

Example error:
```json
{
  "error": "rate_limited",
  "message": "The public Hacker tier allows 1 request per IP every 60 seconds.",
  "request_id": "req_rate123",
  "retry_after_seconds": 60
}
```

---

## Comparisons

- **vs Mathpix:** pdfToMarkdown is simpler, no per-page pricing for the free tier, no LaTeX-specific focus — clean markdown for general-purpose use.
- **vs LlamaParse:** pdfToMarkdown is framework-agnostic (plain REST + Python), no LlamaIndex lock-in.
- **vs Unstructured:** pdfToMarkdown is zero-setup cloud API, no local Docker install required.

---

## Links

- Main page: https://pdftomarkdown.dev
- API docs: https://pdftomarkdown.dev/docs
- Blog: https://pdftomarkdown.dev/blog
- Get API key (GitHub login): https://pdftomarkdown.dev/auth/github
- OCR API overview: https://pdftomarkdown.dev/ocr-api
- PDF parsing: https://pdftomarkdown.dev/pdf-parsing
- Invoice OCR: https://pdftomarkdown.dev/invoice-ocr
- Legal document OCR: https://pdftomarkdown.dev/legal-doc-ocr
- Research paper OCR: https://pdftomarkdown.dev/research-paper-ocr