research paper OCR
Research Paper to Markdown API
One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.
$ curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}' {
"markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
"pages": 3,
"request_id": "req_abc123"
} Academic PDFs are the hardest documents to parse
Research papers are typeset in two-column layouts with inline equations, figure captions that span columns, and dense citation lists. No text extraction library handles this correctly — they either merge columns into a mangled stream, drop equations entirely, or lose table structure.
pdfToMarkdown uses a vision-language model trained on document understanding. It reads the page as a layout — distinguishing columns, equations, and captions — and produces markdown that faithfully represents the document’s structure.
What a parsed paper looks like
Input: a PDF of a machine learning paper.
Output:
# Attention Is All You Need
**Authors:** Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
**Abstract:** The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best performing
models also connect the encoder and decoder through an attention mechanism...
## 3. Model Architecture
### 3.1 Encoder and Decoder Stacks
The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a
simple, position-wise fully connected feed-forward network.
**Equation (1):**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
### 3.2 Attention
| Model | N | d_model | d_ff | h | d_k | d_v | P_drop | ε_ls | Train steps |
|-------------|-----|---------|------|----|-----|-----|--------|------|-------------|
| base | 6 | 512 | 2048 | 8 | 64 | 64 | 0.1 | 0.1 | 100K |
| big | 6 | 1024 | 4096 | 16 | 64 | 64 | 0.3 | 0.1 | 300K |
Multi-column layout collapsed correctly. Equations preserved in LaTeX syntax. Tables intact.
Use cases in research tooling
Building a paper Q&A system
Parse papers to markdown, chunk by section, and index in a vector database. Section headings make chunking natural and improve retrieval quality.
from pdftomarkdown import convert
result = convert("attention_is_all_you_need.pdf")
# Split cleanly on section boundaries
sections = result.markdown.split("\n## ")
for section in sections:
heading, *body = section.split("\n", 1)
text = body[0] if body else ""
# index(heading, text) into your vector db
Literature review automation
Feed a directory of PDFs, extract abstracts and conclusions, and use an LLM to synthesize the content across papers.
Dataset creation
Convert a corpus of scientific papers to markdown for fine-tuning or evaluation datasets. The structured output is cleaner than raw text extraction.
Citation extraction
The markdown output preserves numbered reference lists at the end of papers in a consistent format that’s easy to parse downstream.
Handles the full range of academic documents
- Two-column and single-column layouts
- Inline LaTeX equations (preserved in $…$ format)
- Figure captions and table captions
- Author affiliation blocks and abstract formatting
- Appendices and supplementary materials
- Conference and journal paper formats (NeurIPS, ICML, ACL, Nature, arXiv)
- Scanned historical papers and low-resolution PDFs
- Non-English papers (French, German, Chinese, Japanese)
Process an arXiv paper in one command
curl -s -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer $PDFTOMARKDOWN_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://arxiv.org/pdf/1706.03762"}}' | jq -r '.markdown'
Related pages
- PDF Parsing API — general document parsing overview
- OCR API for Developers — technical API reference
- Blog: Why PDF to Markdown — how vision models outperform traditional OCR
- API documentation — complete endpoint reference
Pricing
Both tiers are free. No credit card required.
Hacker
Free, no signup
- Public demo key — copy & paste
- Only page 1 is processed
- 1 request/min per IP
- Watermark in output
Developer
Free, GitHub login
- Personal API key
- 100 pages/month
- Multi-page PDFs
- No watermark
Parse academic PDFs free
Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.