Unstructured.io has become a popular choice for document parsing in LLM pipelines — it’s open source, handles many file formats, and has good community support. But “open source” and “best choice” aren’t the same thing. This post breaks down where Unstructured fits well and where a simpler API like pdfToMarkdown is the better option.

What is Unstructured?

Unstructured is an open-source Python library (with an optional hosted API) that partitions documents into structured elements. Given a PDF, it returns a list of typed objects — Title, NarrativeText, Table, ListItem, etc. — rather than a single markdown string.

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("document.pdf")
for el in elements:
    print(type(el).__name__, el.text[:80])
# NarrativeText  This agreement is entered into between...
# Table           | Column A | Column B |
# Title           Section 2. Definitions

It supports PDF, DOCX, PPTX, HTML, images, and more. You can run it locally with no API calls, or use Unstructured’s cloud API.

What is pdfToMarkdown?

pdfToMarkdown is an HTTP API that takes a PDF and returns a clean markdown string. There’s no SDK required, no local dependencies to install, no compute to provision. Send a file, get back text.

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'

The fundamental trade-off: control vs. simplicity

The difference between these tools comes down to one question: do you need to control document structure at the element level, or do you just need clean text?

Unstructured gives you typed elements with metadata — useful if your pipeline needs to handle different content types differently (e.g., skip tables, only process narrative text, handle captions separately).

pdfToMarkdown gives you a single markdown string — useful if you want to feed the whole document into an LLM, embed it in a vector store, or render it in a UI.

For most LLM use cases, the markdown string is what you actually want.

Installation and setup

Unstructured:

# Basic install — limited format support
pip install unstructured

# Full install with all format dependencies (heavy)
pip install "unstructured[all-docs]"

# Additional system dependencies required:
# - poppler-utils (PDF rendering)
# - tesseract-ocr (OCR)
# - libmagic (file type detection)
# On macOS:
brew install poppler tesseract libmagic

For scanned PDFs with OCR, you also need:

pip install "unstructured[paddlepaddle]"
# or
pip install "unstructured[tesseract]"

Getting Unstructured running correctly on a new machine — especially in Docker or CI — often takes 30-60 minutes. The full Docker image is over 8GB.

pdfToMarkdown:

pip install pdftomarkdown

Done. No system dependencies, no model downloads, no GPU required.

Local vs. cloud

Unstructured’s main advantage is that you can run it entirely locally. For teams with strict data privacy requirements — healthcare, legal, finance — this matters. Your documents never leave your infrastructure.

pdfToMarkdown processes documents on our servers. If your documents contain sensitive data, you should evaluate whether that fits your security requirements.

If you need local processing and have the engineering bandwidth to set it up, Unstructured’s local mode is the right choice.

Output format comparison

Unstructured returns a list of element objects:

elements = partition_pdf("invoice.pdf")
# You get: [Title, NarrativeText, Table, ListItem, ...]
# To get text, you iterate and serialize yourself:
text = "\n\n".join(str(el) for el in elements)

Tables are particularly tricky — Unstructured returns Table elements, but reconstructing proper markdown tables from them requires extra work.

pdfToMarkdown returns markdown directly:

{
  "markdown": "# Invoice #12345\n\n**Date:** 2024-01-15\n\n## Line Items\n\n| Description | Qty | Price |\n|---|---|---|\n| Widget A | 5 | $50.00 |\n\n**Total: $250.00**",
  "pages": 1,
  "request_id": "req_abc123"
}

The markdown is immediately usable in LLM prompts, vector embeddings, or UI rendering.

Chunking strategy

One area where Unstructured’s element model shines is semantic chunking. Because it knows the type of each element, you can implement smarter chunking strategies:

from unstructured.chunking.title import chunk_by_title

chunks = chunk_by_title(elements)
# Chunks respect section boundaries — never splits mid-paragraph

With pdfToMarkdown, you chunk the markdown yourself. But markdown’s header hierarchy makes this straightforward:

from langchain.text_splitter import MarkdownHeaderTextSplitter

result = convert("document.pdf")
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "H1"), ("##", "H2"), ("###", "H3")]
)
chunks = splitter.split_text(result.markdown)

Heading-based chunking is semantically equivalent to Unstructured’s title-based chunking for most documents — and cleaner to implement.

Performance and throughput

Running Unstructured locally means your throughput depends on your hardware. On a CPU-only machine, processing a 20-page scanned PDF can take 30+ seconds. With a GPU, it’s faster but requires more infrastructure.

pdfToMarkdown offloads compute to our infrastructure. You get GPU-accelerated OCR without managing any compute. Response times are typically under 10 seconds for a 10-page document.

Cost of ownership

	Unstructured (local)	Unstructured API	pdfToMarkdown
Infrastructure	You pay	Per-page billing	Free to 100 pages/month
Setup time	30-60 min	Account + key	Zero (public demo key)
Maintenance	Ongoing	None	None
Data privacy	Full control	Docs leave your infra	Docs leave your infra
Docker image size	8GB+	N/A	N/A

For small teams and side projects, the local Unstructured setup is often abandoned after the initial pain. The hosted Unstructured API has similar pricing friction to Mathpix.

Side-by-side: building a PDF Q&A tool

Here’s the same task with each tool:

With Unstructured:

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
from langchain.schema import Document

# 30-60 min setup to get here ^^^

elements = partition_pdf("report.pdf", strategy="hi_res")  # slow on CPU
chunks = chunk_by_title(elements, max_characters=1500)
docs = [Document(page_content=str(c)) for c in chunks]

With pdfToMarkdown:

from pdftomarkdown import convert
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.schema import Document

result = convert("report.pdf")  # API call, ~5s
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[
    ("#", "H1"), ("##", "H2"), ("###", "H3")
])
chunks = splitter.split_text(result.markdown)
docs = [Document(page_content=c.page_content, metadata=c.metadata) for c in chunks]

Both approaches produce usable document chunks. The pdfToMarkdown approach takes 5 seconds to get started vs. 60 minutes, and the chunks carry header metadata automatically.

When Unstructured wins

You need full data privacy — documents can’t leave your infrastructure
You’re processing many non-PDF formats — DOCX, PPTX, HTML, EPUB all in one library
You need element-level control — filtering by element type, custom handling per block
You have engineering bandwidth to maintain the infrastructure
You’re already running GPU instances for model inference anyway

When pdfToMarkdown wins

You want to be productive immediately — no setup, no dependencies
Your documents are PDFs — optimized pipeline for this format
You’re prototyping or in early product stages — validate before you invest in infrastructure
You want clean markdown to feed into LLMs or render in a UI
Team size is small and maintaining OCR infrastructure is a poor use of time

Bottom line

	Unstructured	pdfToMarkdown
Best for	Privacy-constrained, multi-format, element-level control	PDF-first, quick integration, LLM pipelines
Setup	30-60 min + system deps	Zero
Output	Typed element list	Clean markdown string
Self-hosted	Yes	No
Data privacy	Full (local mode)	Docs processed on our servers
Free tier	Open source (self-hosted)	Demo key + 100 pages/month
Chunking	Title-based, element-aware	Heading-based markdown chunking

If your documents are sensitive and you have the infra, run Unstructured locally. If you want to go from PDF to an LLM-ready string in 30 seconds, try pdfToMarkdown — the demo key works right now, no signup.