pdfToMarkdown vs LlamaParse for RAG: A Deeper Comparison
We published a general comparison of pdfToMarkdown and LlamaParse earlier this year. That post covers API design, pricing tiers, and when each tool makes sense. This post goes deeper on one specific use case: RAG pipelines.
If you’re converting PDFs into chunks, embedding those chunks, storing them in a vector database, and retrieving them at query time — the choice of PDF parser affects every downstream step. Here’s how these two tools compare when RAG is the goal.
The framework lock-in problem
RAG pipelines evolve. You might start with LangChain, switch to LlamaIndex, then move to a custom orchestration layer once you understand your requirements. The parser you choose shouldn’t constrain that evolution.
LlamaParse returns LlamaIndex Document objects by default:
from llama_parse import LlamaParse
parser = LlamaParse(api_key="llx-...", result_type="markdown")
documents = parser.load_data("quarterly_report.pdf")
# documents is a list of llama_index.core.schema.Document
# To use with LangChain or anything else, you need to convert:
texts = [doc.text for doc in documents]
You can extract the raw text, but now you’re paying for a LlamaIndex-coupled dependency to get a string. Your CI installs llama-parse, which pulls in llama-index-core, which pulls in its own dependency tree. On a clean venv:
pip install llama-parse
# Installs llama-index-core, llama-cloud, and 15+ transitive dependencies
pdfToMarkdown returns a string over HTTP. No SDK required, no framework opinions:
curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://example.com/quarterly_report.pdf"}}'
from pdftomarkdown import convert
result = convert("quarterly_report.pdf", api_key="demo_public_key")
markdown = result.markdown # str — use it anywhere
This matters for RAG specifically because RAG pipelines already have heavy dependency trees (vector DB clients, embedding models, orchestration frameworks). Adding LlamaIndex as a transitive dependency for your parser creates version conflicts and bloat you don’t need.
Markdown quality and its effect on embeddings
The quality of your PDF-to-text conversion directly determines your retrieval quality. Bad parsing produces bad chunks, bad chunks produce bad embeddings, and bad embeddings produce irrelevant retrieval. There’s no fixing this downstream.
Here’s the same page from a financial report converted by both tools:
LlamaParse output:
# Q3 2024 Financial Highlights
Revenue for the third quarter was $142.8M, representing a
15% increase year-over-year. Operating expenses were $98.3M.
|Metric|Q3 2024|Q3 2023|Change|
|---|---|---|---|
|Revenue|$142.8M|$124.2M|+15%|
|Operating Expenses|$98.3M|$91.1M|+8%|
|Net Income|$31.2M|$22.8M|+37%|
|EPS|$1.24|$0.91|+36%|
Adjusted EBITDA margin expanded to 31.1% from 26.7% in the
prior year period.
pdfToMarkdown output:
# Q3 2024 Financial Highlights
Revenue for the third quarter was $142.8M, representing a 15% increase year-over-year. Operating expenses were $98.3M.
| Metric | Q3 2024 | Q3 2023 | Change |
|---|---|---|---|
| Revenue | $142.8M | $124.2M | +15% |
| Operating Expenses | $98.3M | $91.1M | +8% |
| Net Income | $31.2M | $22.8M | +37% |
| EPS | $1.24 | $0.91 | +36% |
Adjusted EBITDA margin expanded to 31.1% from 26.7% in the prior year period.
Both handle this case well. The structural differences are minor — spacing in table cells, line wrapping in paragraphs. But these small differences compound during chunking.
Where it matters for embeddings: pdfToMarkdown preserves paragraph boundaries more consistently, which means heading-based chunking produces cleaner splits. LlamaParse sometimes introduces line breaks mid-paragraph (artifacts from the underlying extraction), which can cause text splitters to create fragments that embed poorly.
When a chunk contains a complete thought — “Revenue was $142.8M, up 15% YoY” — the embedding captures the semantic relationship between the metric and its context. When a chunk ends with “Revenue for the third quarter was $142.8M, representing a” — the embedding is degraded.
Side-by-side: the same RAG pipeline
Here’s a complete RAG pipeline using each tool. Same PDF, same vector store, same retrieval query.
With LlamaParse + LlamaIndex:
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
# Parse
parser = LlamaParse(api_key="llx-...", result_type="markdown")
documents = parser.load_data("quarterly_report.pdf")
# Chunk (LlamaIndex-specific node parser)
node_parser = MarkdownNodeParser()
nodes = node_parser.get_nodes_from_documents(documents)
# Index and query (LlamaIndex-specific)
index = VectorStoreIndex(nodes, embed_model=OpenAIEmbedding())
query_engine = index.as_query_engine()
response = query_engine.query("What was the Q3 revenue?")
print(response)
This works, but every component is a LlamaIndex class. Switching to LangChain, Haystack, or a custom pipeline means rewriting the chunking, indexing, and query layers.
With pdfToMarkdown + any framework:
import requests
from openai import OpenAI
client = OpenAI()
# Parse — plain HTTP, no framework dependency
result = requests.post(
"https://pdftomarkdown.dev/v1/convert",
headers={"Authorization": "Bearer demo_public_key"},
json={"input": {"pdf_url": "https://example.com/quarterly_report.pdf"}}
).json()
markdown = result["output"]["markdown"]
# Chunk by markdown headers
sections = []
current = ""
for line in markdown.split("\n"):
if line.startswith("## ") and current:
sections.append(current.strip())
current = line + "\n"
else:
current += line + "\n"
if current.strip():
sections.append(current.strip())
# Embed with OpenAI directly
embeddings = client.embeddings.create(
model="text-embedding-3-small",
input=sections
).data
# Store in any vector DB — Pinecone, Weaviate, pgvector, Qdrant, whatever
vectors = [
{"id": f"chunk_{i}", "values": e.embedding, "metadata": {"text": sections[i]}}
for i, e in enumerate(embeddings)
]
No framework lock-in. Swap the vector DB client, change the embedding model, restructure the chunking — the parser doesn’t care.
With pdfToMarkdown + LangChain (if you prefer a framework):
from pdftomarkdown import convert
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
# Parse
result = convert("quarterly_report.pdf", api_key="demo_public_key")
# Chunk
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
chunks = splitter.split_text(result.markdown)
# Embed and store
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
# Query
docs = vectorstore.similarity_search("What was the Q3 revenue?")
print(docs[0].page_content)
Same parser, different framework. That’s the point.
Pricing at scale
RAG pipelines process volume. A single knowledge base might contain thousands of PDFs totaling hundreds of thousands of pages. Pricing per page is the metric that matters.
| LlamaParse | pdfToMarkdown | |
|---|---|---|
| Free tier | 1,000 pages/day | 100 pages/month (GitHub login) |
| Free (no account) | N/A | Demo key (1 page/PDF, watermark) |
| Paid pricing | $0.003/page (LlamaCloud credits) | TBD |
| Account required | Yes | No (demo key) or GitHub login |
| Credit card required | Yes (for paid) | No |
For prototyping and development, both tools are free. But the experience is different:
- LlamaParse requires creating a LlamaCloud account, generating an API key, and managing credits. If you hit the daily limit, your pipeline stops.
- pdfToMarkdown works immediately with
demo_public_key. No signup, no credit management. For production volume beyond the free tier, sign in with GitHub for 100 pages/month.
For high-volume production pipelines (10,000+ pages/month), you’ll be on a paid plan with either tool. LlamaParse’s credit-based pricing adds cognitive overhead — you’re converting between credits and pages and tracking burn rate. A per-page or flat-rate model is simpler to budget.
Chunking strategies for RAG
How you chunk parsed output is critical for retrieval quality. The parser’s output format constrains your chunking options.
LlamaParse output + LlamaIndex chunking:
LlamaParse is optimized for LlamaIndex’s MarkdownNodeParser, which splits on heading boundaries and assigns metadata. This works well, but you’re locked into LlamaIndex’s node abstraction. If you want to use LangChain’s RecursiveCharacterTextSplitter or a custom chunker, you first need to extract raw text from the LlamaIndex Document objects.
pdfToMarkdown output + any chunker:
Because pdfToMarkdown returns a plain markdown string, you can use any chunking strategy:
from pdftomarkdown import convert
result = convert("document.pdf", api_key="demo_public_key")
# Option 1: Heading-based (best for structured documents)
from langchain.text_splitter import MarkdownHeaderTextSplitter
chunks = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "h1"), ("##", "h2")]
).split_text(result.markdown)
# Option 2: Recursive character splitting (best for long prose)
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunks = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200
).split_text(result.markdown)
# Option 3: Semantic chunking with an embedding model
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
chunker = SemanticChunker(OpenAIEmbeddings())
chunks = chunker.split_text(result.markdown)
# Option 4: Your own logic
chunks = result.markdown.split("\n## ")
The parser doesn’t constrain the chunker. This flexibility matters because the best chunking strategy depends on your documents and your queries — it’s not something you can decide before you’ve experimented.
Metadata preservation
Good RAG systems store metadata alongside chunks for filtering and hybrid search. What metadata does each parser give you?
LlamaParse attaches metadata via LlamaIndex’s Document model:
documents = parser.load_data("report.pdf")
print(documents[0].metadata)
# {'file_name': 'report.pdf', 'file_type': 'application/pdf', ...}
The metadata is LlamaIndex-shaped. Useful inside LlamaIndex, awkward outside it.
pdfToMarkdown returns conversion metadata in the API response:
{
"output": {
"markdown": "...",
"pages": 12
},
"request_id": "req_abc123"
}
It’s your responsibility to attach document-level metadata (filename, source URL, date) to your chunks. This is more work, but it means you control the metadata schema exactly — which is what you want when you’re building filtered retrieval across diverse document sources.
When LlamaParse is the better choice for RAG
- You’re building your entire RAG stack on LlamaIndex and plan to stay there
- You need LlamaParse’s instruction-following mode to extract specific sections before embedding (e.g., “Only extract the Methods section from research papers”)
- You’re using LlamaCloud’s managed vector store and want the tightest integration
- Your pipeline is Python-only and you don’t mind the dependency tree
When pdfToMarkdown is the better choice for RAG
- You want to choose your chunking strategy, embedding model, and vector store independently
- You’re building in a non-Python environment (Node.js, Go, or polyglot)
- You want to prototype a RAG pipeline in minutes without account creation
- You’re evaluating multiple parsers and want the simplest integration to A/B test
- You need the same parser to work across LangChain, LlamaIndex, Haystack, or a custom stack
- Your team doesn’t want to own the LlamaIndex dependency for a single parsing step
Bottom line
For RAG specifically, the parser choice comes down to whether you want an integrated experience within one framework or a composable building block that works with any stack.
LlamaParse gives you a smooth experience if you’ve committed to LlamaIndex. pdfToMarkdown gives you a string that works everywhere.
| LlamaParse | pdfToMarkdown | |
|---|---|---|
| RAG framework | LlamaIndex only (native) | Any framework or no framework |
| Chunking | LlamaIndex node parsers | Any text splitter |
| Embedding workflow | LlamaIndex embedding classes | Direct API calls to any provider |
| Instruction following | Yes | No |
| Setup to first RAG query | 15-30 min (account, SDK, config) | 2 min (curl or pip install) |
| Dependency footprint | Heavy (llama-index-core + tree) | Minimal (requests or zero with curl) |
If you’re building RAG and haven’t committed to a framework yet, start with the tool that doesn’t force you to commit. Try pdfToMarkdown with the demo key right now — no signup, no SDK:
curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://your-pdf-url.com/document.pdf"}}'
Pipe the markdown into your chunker of choice and see how it embeds. Read the docs to go further.