· pdfToMarkdown team

pdfToMarkdown vs LlamaParse: PDF Parsing for LLM Pipelines

comparisonllamaparseragllm

If you’re building a RAG pipeline or LLM application that ingests PDFs, you’ve likely encountered LlamaParse — the PDF parsing service from the LlamaIndex team. It’s a capable tool, but it comes with trade-offs that make it the wrong choice for many projects.

This comparison covers what both tools do, where each shines, and how to choose.

What is LlamaParse?

LlamaParse is a cloud-based document parsing service built by LlamaIndex (now called llama_index in the Python package). It’s designed primarily for RAG (retrieval-augmented generation) workflows — taking PDFs and converting them into formats suitable for vector embedding and retrieval.

It launched in 2024 as a paid add-on to the LlamaIndex ecosystem, with a free tier of 1,000 pages/day.

What is pdfToMarkdown?

pdfToMarkdown is a standalone API that converts PDFs to clean markdown. It’s not tied to any particular framework. You send a PDF, you get back markdown. Use it with LlamaIndex, LangChain, raw OpenAI calls, or anything else.

The core difference: ecosystem lock-in

LlamaParse is designed to work inside the LlamaIndex framework. The native Python interface is:

from llama_parse import LlamaParse

parser = LlamaParse(result_type="markdown")
documents = parser.load_data("document.pdf")
# returns LlamaIndex Document objects

This is convenient if you’re already using LlamaIndex. But if you’re not — or if you want to switch frameworks later — you’re dealing with an extra dependency and a specific object model.

pdfToMarkdown returns plain text. It works with any stack:

from pdftomarkdown import convert

result = convert("document.pdf")
markdown = result.markdown  # just a string

# Use it with anything
from langchain.text_splitter import MarkdownTextSplitter
chunks = MarkdownTextSplitter().split_text(markdown)

Pricing comparison

LlamaParsepdfToMarkdown
Free tier1,000 pages/day100 pages/month
AuthenticationAPI key (account required)Public key (no account) or GitHub login
Paid plansCredits-basedTBD
Credit card requiredYes for paidNo

LlamaParse’s free tier is more generous in raw page count, but requires account creation and is structured around LlamaCloud credits. pdfToMarkdown’s free tier (via GitHub login) has lower limits but no credit card friction and works immediately with the public demo key.

Output quality

Both tools use modern vision-language models to understand document layout, so both handle tables, columns, and mixed content better than traditional OCR.

LlamaParse has a feature called “instruction following” — you can give natural-language instructions alongside your document:

parser = LlamaParse(
    result_type="markdown",
    parsing_instruction="Extract only the financial tables, ignore boilerplate text."
)

This is genuinely powerful for specific extraction tasks where you know exactly what you want.

pdfToMarkdown takes the opposite approach: return a clean, complete markdown representation of the document and let you process it however you want. For most use cases — especially when you don’t know the document structure in advance — this is more flexible.

API design

LlamaParse’s API requires the LlamaIndex Python SDK. There’s no simple curl interface for quick testing:

# LlamaParse — you need to pip install llama-parse
import nest_asyncio
nest_asyncio.apply()

from llama_parse import LlamaParse
parser = LlamaParse(api_key="llx-...", result_type="markdown")
docs = parser.load_data("file.pdf")

pdfToMarkdown works from a single HTTP request:

curl -X POST https://pdftomarkdown.dev/v1/convert \
  -H "Authorization: Bearer demo_public_key" \
  -H "Content-Type: application/json" \
  -d '{"input":{"pdf_url":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}}'

For prototyping, debugging, or use in non-Python environments (Node.js, Go, Ruby, shell scripts), the HTTP API is much more accessible.

When to use LlamaParse

  • You’re already using LlamaIndex and want seamless integration
  • You need instruction-following extraction (specific fields, filtered content)
  • You’re processing large volumes and the free tier limits fit your usage
  • You want LlamaCloud’s other features (vector stores, managed indexes)

When to use pdfToMarkdown

  • You want framework-agnostic markdown output
  • You’re testing a new project and don’t want to commit to an ecosystem
  • You need to integrate with LangChain, raw OpenAI, or a custom stack
  • You want to test the API instantly without account creation
  • You prefer a simpler pricing model without ecosystem coupling

Framework agnosticism in practice

Here’s how pdfToMarkdown markdown integrates with LangChain — something LlamaParse makes awkward:

from pdftomarkdown import convert
from langchain.schema import Document
from langchain.text_splitter import MarkdownHeaderTextSplitter

# Convert
result = convert("report.pdf", api_key="your-key")

# Split by markdown headers for better chunking
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
chunks = splitter.split_text(result.markdown)

# Each chunk is a standard LangChain Document
documents = [Document(page_content=c.page_content, metadata=c.metadata)
             for c in chunks]

And with raw OpenAI:

from pdftomarkdown import convert
from openai import OpenAI

client = OpenAI()
result = convert("contract.pdf", api_key="your-key")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": f"Summarize this contract:\n\n{result.markdown}"}
    ]
)

No special adapters, no framework types — just text.

Bottom line

LlamaParsepdfToMarkdown
Best forLlamaIndex pipelinesAny stack, framework-agnostic use
Free tier1,000 pages/day (account required)Demo key (no signup) + 100/month with GitHub
Instruction followingYesNo
API simplicitySDK-firstHTTP-first
OutputLlamaIndex DocumentsPlain markdown string
Ecosystem couplingHigh (LlamaCloud)None

If you’re building inside the LlamaIndex ecosystem, LlamaParse is the obvious choice. If you want a clean HTTP API that returns standard markdown and works with any framework, pdfToMarkdown is one curl command away.