pdfToMarkdown vs Unstructured: The Right Tool for Your Pipeline
Unstructured.io has become a popular choice for document parsing in LLM pipelines — it’s open source, handles many file formats, and has good community support. But “open source” and “best choice” aren’t the same thing. This post breaks down where Unstructured fits well and where a simpler API like pdfToMarkdown is the better option.
What is Unstructured?
Unstructured is an open-source Python library (with an optional hosted API) that partitions documents into structured elements. Given a PDF, it returns a list of typed objects — Title, NarrativeText, Table, ListItem, etc. — rather than a single markdown string.
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("document.pdf")
for el in elements:
print(type(el).__name__, el.text[:80])
# NarrativeText This agreement is entered into between...
# Table | Column A | Column B |
# Title Section 2. Definitions
It supports PDF, DOCX, PPTX, HTML, images, and more. You can run it locally with no API calls, or use Unstructured’s cloud API.
What is pdfToMarkdown?
pdfToMarkdown is an HTTP API that takes a PDF and returns a clean markdown string. There’s no SDK required, no local dependencies to install, no compute to provision. Send a file, get back text.
curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'
The fundamental trade-off: control vs. simplicity
The difference between these tools comes down to one question: do you need to control document structure at the element level, or do you just need clean text?
Unstructured gives you typed elements with metadata — useful if your pipeline needs to handle different content types differently (e.g., skip tables, only process narrative text, handle captions separately).
pdfToMarkdown gives you a single markdown string — useful if you want to feed the whole document into an LLM, embed it in a vector store, or render it in a UI.
For most LLM use cases, the markdown string is what you actually want.
Installation and setup
Unstructured:
# Basic install — limited format support
pip install unstructured
# Full install with all format dependencies (heavy)
pip install "unstructured[all-docs]"
# Additional system dependencies required:
# - poppler-utils (PDF rendering)
# - tesseract-ocr (OCR)
# - libmagic (file type detection)
# On macOS:
brew install poppler tesseract libmagic
For scanned PDFs with OCR, you also need:
pip install "unstructured[paddlepaddle]"
# or
pip install "unstructured[tesseract]"
Getting Unstructured running correctly on a new machine — especially in Docker or CI — often takes 30-60 minutes. The full Docker image is over 8GB.
pdfToMarkdown:
pip install pdftomarkdown
Done. No system dependencies, no model downloads, no GPU required.
Local vs. cloud
Unstructured’s main advantage is that you can run it entirely locally. For teams with strict data privacy requirements — healthcare, legal, finance — this matters. Your documents never leave your infrastructure.
pdfToMarkdown processes documents on our servers. If your documents contain sensitive data, you should evaluate whether that fits your security requirements.
If you need local processing and have the engineering bandwidth to set it up, Unstructured’s local mode is the right choice.
Output format comparison
Unstructured returns a list of element objects:
elements = partition_pdf("invoice.pdf")
# You get: [Title, NarrativeText, Table, ListItem, ...]
# To get text, you iterate and serialize yourself:
text = "\n\n".join(str(el) for el in elements)
Tables are particularly tricky — Unstructured returns Table elements, but reconstructing proper markdown tables from them requires extra work.
pdfToMarkdown returns markdown directly:
{
"markdown": "# Invoice #12345\n\n**Date:** 2024-01-15\n\n## Line Items\n\n| Description | Qty | Price |\n|---|---|---|\n| Widget A | 5 | $50.00 |\n\n**Total: $250.00**",
"pages": 1,
"request_id": "req_abc123"
}
The markdown is immediately usable in LLM prompts, vector embeddings, or UI rendering.
Chunking strategy
One area where Unstructured’s element model shines is semantic chunking. Because it knows the type of each element, you can implement smarter chunking strategies:
from unstructured.chunking.title import chunk_by_title
chunks = chunk_by_title(elements)
# Chunks respect section boundaries — never splits mid-paragraph
With pdfToMarkdown, you chunk the markdown yourself. But markdown’s header hierarchy makes this straightforward:
from langchain.text_splitter import MarkdownHeaderTextSplitter
result = convert("document.pdf")
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "H1"), ("##", "H2"), ("###", "H3")]
)
chunks = splitter.split_text(result.markdown)
Heading-based chunking is semantically equivalent to Unstructured’s title-based chunking for most documents — and cleaner to implement.
Performance and throughput
Running Unstructured locally means your throughput depends on your hardware. On a CPU-only machine, processing a 20-page scanned PDF can take 30+ seconds. With a GPU, it’s faster but requires more infrastructure.
pdfToMarkdown offloads compute to our infrastructure. You get GPU-accelerated OCR without managing any compute. Response times are typically under 10 seconds for a 10-page document.
Cost of ownership
| Unstructured (local) | Unstructured API | pdfToMarkdown | |
|---|---|---|---|
| Infrastructure | You pay | Per-page billing | Free to 100 pages/month |
| Setup time | 30-60 min | Account + key | Zero (public demo key) |
| Maintenance | Ongoing | None | None |
| Data privacy | Full control | Docs leave your infra | Docs leave your infra |
| Docker image size | 8GB+ | N/A | N/A |
For small teams and side projects, the local Unstructured setup is often abandoned after the initial pain. The hosted Unstructured API has similar pricing friction to Mathpix.
Side-by-side: building a PDF Q&A tool
Here’s the same task with each tool:
With Unstructured:
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
from langchain.schema import Document
# 30-60 min setup to get here ^^^
elements = partition_pdf("report.pdf", strategy="hi_res") # slow on CPU
chunks = chunk_by_title(elements, max_characters=1500)
docs = [Document(page_content=str(c)) for c in chunks]
With pdfToMarkdown:
from pdftomarkdown import convert
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.schema import Document
result = convert("report.pdf") # API call, ~5s
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[
("#", "H1"), ("##", "H2"), ("###", "H3")
])
chunks = splitter.split_text(result.markdown)
docs = [Document(page_content=c.page_content, metadata=c.metadata) for c in chunks]
Both approaches produce usable document chunks. The pdfToMarkdown approach takes 5 seconds to get started vs. 60 minutes, and the chunks carry header metadata automatically.
When Unstructured wins
- You need full data privacy — documents can’t leave your infrastructure
- You’re processing many non-PDF formats — DOCX, PPTX, HTML, EPUB all in one library
- You need element-level control — filtering by element type, custom handling per block
- You have engineering bandwidth to maintain the infrastructure
- You’re already running GPU instances for model inference anyway
When pdfToMarkdown wins
- You want to be productive immediately — no setup, no dependencies
- Your documents are PDFs — optimized pipeline for this format
- You’re prototyping or in early product stages — validate before you invest in infrastructure
- You want clean markdown to feed into LLMs or render in a UI
- Team size is small and maintaining OCR infrastructure is a poor use of time
Bottom line
| Unstructured | pdfToMarkdown | |
|---|---|---|
| Best for | Privacy-constrained, multi-format, element-level control | PDF-first, quick integration, LLM pipelines |
| Setup | 30-60 min + system deps | Zero |
| Output | Typed element list | Clean markdown string |
| Self-hosted | Yes | No |
| Data privacy | Full (local mode) | Docs processed on our servers |
| Free tier | Open source (self-hosted) | Demo key + 100 pages/month |
| Chunking | Title-based, element-aware | Heading-based markdown chunking |
If your documents are sensitive and you have the infra, run Unstructured locally. If you want to go from PDF to an LLM-ready string in 30 seconds, try pdfToMarkdown — the demo key works right now, no signup.