research paper OCR

Research Paper to Markdown API

One endpoint. POST a PDF, get clean markdown back — tables, headings, and lists preserved exactly as they appear on the page.

$ curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
{
  "markdown": "# Document\n\n| Column A | Column B |\n|---|---|\n| Value 1 | Value 2 |\n\n**Summary text here**",
  "pages": 3,
  "request_id": "req_abc123"
}

Academic PDFs are the hardest documents to parse

Research papers are typeset in two-column layouts with inline equations, figure captions that span columns, and dense citation lists. No text extraction library handles this correctly — they either merge columns into a mangled stream, drop equations entirely, or lose table structure.

pdfToMarkdown uses a vision-language model trained on document understanding. It reads the page as a layout — distinguishing columns, equations, and captions — and produces markdown that faithfully represents the document’s structure.

What a parsed paper looks like

Input: a PDF of a machine learning paper.

Output:

# Attention Is All You Need

**Authors:** Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

**Abstract:** The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best performing
models also connect the encoder and decoder through an attention mechanism...

## 3. Model Architecture

### 3.1 Encoder and Decoder Stacks

The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a
simple, position-wise fully connected feed-forward network.

**Equation (1):**

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

### 3.2 Attention

| Model       | N   | d_model | d_ff | h  | d_k | d_v | P_drop | ε_ls | Train steps |
|-------------|-----|---------|------|----|-----|-----|--------|------|-------------|
| base        | 6   | 512     | 2048 | 8  | 64  | 64  | 0.1    | 0.1  | 100K        |
| big         | 6   | 1024    | 4096 | 16 | 64  | 64  | 0.3    | 0.1  | 300K        |

Multi-column layout collapsed correctly. Equations preserved in LaTeX syntax. Tables intact.

Use cases in research tooling

Building a paper Q&A system

Parse papers to markdown, chunk by section, and index in a vector database. Section headings make chunking natural and improve retrieval quality.

from pdftomarkdown import convert

result = convert("attention_is_all_you_need.pdf")

# Split cleanly on section boundaries
sections = result.markdown.split("\n## ")
for section in sections:
    heading, *body = section.split("\n", 1)
    text = body[0] if body else ""
    # index(heading, text) into your vector db

Literature review automation

Feed a directory of PDFs, extract abstracts and conclusions, and use an LLM to synthesize the content across papers.

Dataset creation

Convert a corpus of scientific papers to markdown for fine-tuning or evaluation datasets. The structured output is cleaner than raw text extraction.

Citation extraction

The markdown output preserves numbered reference lists at the end of papers in a consistent format that’s easy to parse downstream.

Handles the full range of academic documents

  • Two-column and single-column layouts
  • Inline LaTeX equations (preserved in $…$ format)
  • Figure captions and table captions
  • Author affiliation blocks and abstract formatting
  • Appendices and supplementary materials
  • Conference and journal paper formats (NeurIPS, ICML, ACL, Nature, arXiv)
  • Scanned historical papers and low-resolution PDFs
  • Non-English papers (French, German, Chinese, Japanese)

Process an arXiv paper in one command

curl -s -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer $PDFTOMARKDOWN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://arxiv.org/pdf/1706.03762"}}' | jq -r '.markdown'

Pricing

Both tiers are free. No credit card required.

Hacker

Free, no signup

  • Public demo key — copy & paste
  • Only page 1 is processed
  • 1 request/min per IP
  • Watermark in output
View docs →

Developer

Free, GitHub login

  • Personal API key
  • 100 pages/month
  • Multi-page PDFs
  • No watermark
Get API key →

Parse academic PDFs free

Free tier — no account needed. It converts page 1 only and adds a watermark. Upgrade to developer to remove the watermark and unlock full multi-page PDFs.