· pdfToMarkdown team

pdfToMarkdown vs Unstructured: The Right Tool for Your Pipeline

comparisonunstructuredopen-sourcedocument-parsing

Unstructured.io has become a popular choice for document parsing in LLM pipelines — it’s open source, handles many file formats, and has good community support. But “open source” and “best choice” aren’t the same thing. This post breaks down where Unstructured fits well and where a simpler API like pdfToMarkdown is the better option.

What is Unstructured?

Unstructured is an open-source Python library (with an optional hosted API) that partitions documents into structured elements. Given a PDF, it returns a list of typed objects — Title, NarrativeText, Table, ListItem, etc. — rather than a single markdown string.

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("document.pdf")
for el in elements:
    print(type(el).__name__, el.text[:80])
# NarrativeText  This agreement is entered into between...
# Table           | Column A | Column B |
# Title           Section 2. Definitions

It supports PDF, DOCX, PPTX, HTML, images, and more. You can run it locally with no API calls, or use Unstructured’s cloud API.

What is pdfToMarkdown?

pdfToMarkdown is an HTTP API that takes a PDF and returns a clean markdown string. There’s no SDK required, no local dependencies to install, no compute to provision. Send a file, get back text.

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'

The fundamental trade-off: control vs. simplicity

The difference between these tools comes down to one question: do you need to control document structure at the element level, or do you just need clean text?

Unstructured gives you typed elements with metadata — useful if your pipeline needs to handle different content types differently (e.g., skip tables, only process narrative text, handle captions separately).

pdfToMarkdown gives you a single markdown string — useful if you want to feed the whole document into an LLM, embed it in a vector store, or render it in a UI.

For most LLM use cases, the markdown string is what you actually want.

Installation and setup

Unstructured:

# Basic install — limited format support
pip install unstructured

# Full install with all format dependencies (heavy)
pip install "unstructured[all-docs]"

# Additional system dependencies required:
# - poppler-utils (PDF rendering)
# - tesseract-ocr (OCR)
# - libmagic (file type detection)
# On macOS:
brew install poppler tesseract libmagic

For scanned PDFs with OCR, you also need:

pip install "unstructured[paddlepaddle]"
# or
pip install "unstructured[tesseract]"

Getting Unstructured running correctly on a new machine — especially in Docker or CI — often takes 30-60 minutes. The full Docker image is over 8GB.

pdfToMarkdown:

pip install pdftomarkdown

Done. No system dependencies, no model downloads, no GPU required.

Local vs. cloud

Unstructured’s main advantage is that you can run it entirely locally. For teams with strict data privacy requirements — healthcare, legal, finance — this matters. Your documents never leave your infrastructure.

pdfToMarkdown processes documents on our servers. If your documents contain sensitive data, you should evaluate whether that fits your security requirements.

If you need local processing and have the engineering bandwidth to set it up, Unstructured’s local mode is the right choice.

Output format comparison

Unstructured returns a list of element objects:

elements = partition_pdf("invoice.pdf")
# You get: [Title, NarrativeText, Table, ListItem, ...]
# To get text, you iterate and serialize yourself:
text = "\n\n".join(str(el) for el in elements)

Tables are particularly tricky — Unstructured returns Table elements, but reconstructing proper markdown tables from them requires extra work.

pdfToMarkdown returns markdown directly:

{
  "markdown": "# Invoice #12345\n\n**Date:** 2024-01-15\n\n## Line Items\n\n| Description | Qty | Price |\n|---|---|---|\n| Widget A | 5 | $50.00 |\n\n**Total: $250.00**",
  "pages": 1,
  "request_id": "req_abc123"
}

The markdown is immediately usable in LLM prompts, vector embeddings, or UI rendering.

Chunking strategy

One area where Unstructured’s element model shines is semantic chunking. Because it knows the type of each element, you can implement smarter chunking strategies:

from unstructured.chunking.title import chunk_by_title

chunks = chunk_by_title(elements)
# Chunks respect section boundaries — never splits mid-paragraph

With pdfToMarkdown, you chunk the markdown yourself. But markdown’s header hierarchy makes this straightforward:

from langchain.text_splitter import MarkdownHeaderTextSplitter

result = convert("document.pdf")
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "H1"), ("##", "H2"), ("###", "H3")]
)
chunks = splitter.split_text(result.markdown)

Heading-based chunking is semantically equivalent to Unstructured’s title-based chunking for most documents — and cleaner to implement.

Performance and throughput

Running Unstructured locally means your throughput depends on your hardware. On a CPU-only machine, processing a 20-page scanned PDF can take 30+ seconds. With a GPU, it’s faster but requires more infrastructure.

pdfToMarkdown offloads compute to our infrastructure. You get GPU-accelerated OCR without managing any compute. Response times are typically under 10 seconds for a 10-page document.

Cost of ownership

Unstructured (local)Unstructured APIpdfToMarkdown
InfrastructureYou payPer-page billingFree to 100 pages/month
Setup time30-60 minAccount + keyZero (public demo key)
MaintenanceOngoingNoneNone
Data privacyFull controlDocs leave your infraDocs leave your infra
Docker image size8GB+N/AN/A

For small teams and side projects, the local Unstructured setup is often abandoned after the initial pain. The hosted Unstructured API has similar pricing friction to Mathpix.

Side-by-side: building a PDF Q&A tool

Here’s the same task with each tool:

With Unstructured:

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
from langchain.schema import Document

# 30-60 min setup to get here ^^^

elements = partition_pdf("report.pdf", strategy="hi_res")  # slow on CPU
chunks = chunk_by_title(elements, max_characters=1500)
docs = [Document(page_content=str(c)) for c in chunks]

With pdfToMarkdown:

from pdftomarkdown import convert
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.schema import Document

result = convert("report.pdf")  # API call, ~5s
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[
    ("#", "H1"), ("##", "H2"), ("###", "H3")
])
chunks = splitter.split_text(result.markdown)
docs = [Document(page_content=c.page_content, metadata=c.metadata) for c in chunks]

Both approaches produce usable document chunks. The pdfToMarkdown approach takes 5 seconds to get started vs. 60 minutes, and the chunks carry header metadata automatically.

When Unstructured wins

  • You need full data privacy — documents can’t leave your infrastructure
  • You’re processing many non-PDF formats — DOCX, PPTX, HTML, EPUB all in one library
  • You need element-level control — filtering by element type, custom handling per block
  • You have engineering bandwidth to maintain the infrastructure
  • You’re already running GPU instances for model inference anyway

When pdfToMarkdown wins

  • You want to be productive immediately — no setup, no dependencies
  • Your documents are PDFs — optimized pipeline for this format
  • You’re prototyping or in early product stages — validate before you invest in infrastructure
  • You want clean markdown to feed into LLMs or render in a UI
  • Team size is small and maintaining OCR infrastructure is a poor use of time

Bottom line

UnstructuredpdfToMarkdown
Best forPrivacy-constrained, multi-format, element-level controlPDF-first, quick integration, LLM pipelines
Setup30-60 min + system depsZero
OutputTyped element listClean markdown string
Self-hostedYesNo
Data privacyFull (local mode)Docs processed on our servers
Free tierOpen source (self-hosted)Demo key + 100 pages/month
ChunkingTitle-based, element-awareHeading-based markdown chunking

If your documents are sensitive and you have the infra, run Unstructured locally. If you want to go from PDF to an LLM-ready string in 30 seconds, try pdfToMarkdown — the demo key works right now, no signup.