How to Build a RAG Pipeline with PDF Documents
Most RAG tutorials start with clean text files. Real projects start with PDFs — financial reports, research papers, legal contracts, technical manuals. The gap between “RAG tutorial” and “RAG that works on actual documents” is PDF parsing.
This tutorial builds a complete RAG pipeline: PDF in, accurate answers out. No toy examples. Every code block runs.
The pipeline
PDF → pdftomarkdown (convert) → Markdown → chunk by headings → embed with OpenAI → store in ChromaDB → retrieve → LLM answer
Each step matters. Bad parsing at the start means bad retrieval at the end. We’ll go through each one.
Prerequisites
Install the dependencies:
pip install pdftomarkdown langchain langchain-openai langchain-chroma chromadb openai
Set your OpenAI API key:
export OPENAI_API_KEY="sk-..."
You don’t need a pdftomarkdown account to follow along — the demo key works for single-page documents.
Step 1: Convert PDF to markdown
The first step is the most important. If your PDF extraction loses table structure, mangles headings, or flattens the document into a wall of text, your chunks will be incoherent and your retrieval will fail.
pdftomarkdown returns clean markdown with preserved heading hierarchy, tables, and lists — exactly what you need for structured chunking.
from pdftomarkdown import convert
# Convert a PDF to markdown via the API
result = convert(
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
api_key="demo_public_key"
)
markdown = result.markdown
print(markdown[:500])
If you prefer raw HTTP:
import requests
response = requests.post(
"https://pdftomarkdown.dev/v1/convert",
headers={
"Authorization": "Bearer demo_public_key",
"Content-Type": "application/json"
},
json={
"input": {
"pdf_url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
}
}
)
data = response.json()
markdown = data["output"]["markdown"]
Both approaches return the same markdown. The SDK is more convenient; the HTTP call is useful for debugging or non-Python environments.
Step 2: Chunk by markdown headings
This is where markdown parsing pays off. Instead of naive fixed-size chunking (which splits mid-sentence and mid-table), we split on heading boundaries. Each chunk corresponds to a logical section of the document.
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown)
for chunk in chunks:
print(f"Section: {chunk.metadata}")
print(f"Content: {chunk.page_content[:100]}...")
print("---")
Each chunk carries metadata about which section it came from. This metadata flows through to retrieval, so when the LLM answers a question, you know where in the document the answer came from.
Handling long sections
Some document sections are longer than your embedding model’s context window. Add a recursive character splitter as a second pass:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# First pass: split by headings
header_chunks = splitter.split_text(markdown)
# Second pass: split long sections into smaller pieces
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
final_chunks = []
for chunk in header_chunks:
if len(chunk.page_content) > 1000:
sub_chunks = text_splitter.split_text(chunk.page_content)
for sc in sub_chunks:
from langchain.schema import Document
final_chunks.append(Document(
page_content=sc,
metadata=chunk.metadata
))
else:
from langchain.schema import Document
final_chunks.append(Document(
page_content=chunk.page_content,
metadata=chunk.metadata
))
print(f"Total chunks: {len(final_chunks)}")
The heading metadata is preserved on every sub-chunk. This is important — it means retrieval results carry their section context regardless of how small the chunk is.
Step 3: Embed and store in ChromaDB
ChromaDB is an open-source vector database that runs locally. No infrastructure required.
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=final_chunks,
embedding=embeddings,
collection_name="pdf_rag",
persist_directory="./chroma_db"
)
print(f"Stored {len(final_chunks)} chunks in ChromaDB")
text-embedding-3-small is the cheapest OpenAI embedding model and works well for most document types. Use text-embedding-3-large if you need higher retrieval accuracy on complex technical content.
Step 4: Retrieve relevant chunks
Build a retriever that returns the top-k most similar chunks for a given query:
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
# Test retrieval
query = "What are the key findings?"
relevant_docs = retriever.invoke(query)
for i, doc in enumerate(relevant_docs):
print(f"\n--- Chunk {i+1} (Section: {doc.metadata}) ---")
print(doc.page_content[:200])
If your retrieval results look off, the problem is almost always in step 1 (parsing) or step 2 (chunking) — not in the embedding model. Fix upstream before tuning retrieval parameters.
Step 5: Generate answers with an LLM
Wire the retriever to an LLM with LangChain’s chain interface:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt_template = PromptTemplate(
input_variables=["context", "question"],
template="""Use the following context from a PDF document to answer the question.
If the answer is not in the context, say "I don't have enough information to answer this."
Context:
{context}
Question: {question}
Answer:"""
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt_template},
return_source_documents=True,
)
response = qa_chain.invoke({"query": "What are the main points of this document?"})
print("Answer:", response["result"])
print("\nSources:")
for doc in response["source_documents"]:
print(f" - Section: {doc.metadata}")
Full pipeline: copy-paste and run
Here’s the complete pipeline in a single script:
"""
RAG pipeline: PDF → pdftomarkdown → LangChain → ChromaDB → GPT-4o
"""
from pdftomarkdown import convert
from langchain.text_splitter import (
MarkdownHeaderTextSplitter,
RecursiveCharacterTextSplitter,
)
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
def build_rag_pipeline(pdf_source: str, api_key: str = "demo_public_key"):
"""Build a RAG pipeline from a PDF source (URL or file path)."""
# 1. Convert PDF to markdown
print("Converting PDF to markdown...")
result = convert(pdf_source, api_key=api_key)
markdown = result.markdown
print(f"Got {len(markdown)} characters of markdown")
# 2. Chunk by headings
header_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
)
header_chunks = header_splitter.split_text(markdown)
# Sub-split long sections
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = []
for chunk in header_chunks:
if len(chunk.page_content) > 1000:
sub_chunks = text_splitter.split_text(chunk.page_content)
chunks.extend([
Document(page_content=sc, metadata=chunk.metadata)
for sc in sub_chunks
])
else:
chunks.append(Document(
page_content=chunk.page_content,
metadata=chunk.metadata,
))
print(f"Created {len(chunks)} chunks")
# 3. Embed and store
print("Embedding and storing in ChromaDB...")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
collection_name="pdf_rag",
)
# 4. Build QA chain
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5},
)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = PromptTemplate(
input_variables=["context", "question"],
template="""Use the following context from a PDF document to answer the question.
If the answer is not in the context, say "I don't have enough information to answer this."
Context:
{context}
Question: {question}
Answer:"""
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True,
)
return qa_chain
def ask(chain, question: str) -> str:
"""Ask a question against the RAG pipeline."""
response = chain.invoke({"query": question})
print(f"\nQ: {question}")
print(f"A: {response['result']}")
print(f"Sources: {[doc.metadata for doc in response['source_documents']]}")
return response["result"]
if __name__ == "__main__":
pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
chain = build_rag_pipeline(pdf_url)
ask(chain, "What is this document about?")
ask(chain, "What are the key points?")
Processing multiple PDFs
Real projects ingest more than one document. Extend the pipeline to handle a batch:
pdf_urls = [
"https://example.com/report-q1.pdf",
"https://example.com/report-q2.pdf",
"https://example.com/report-q3.pdf",
]
all_chunks = []
for url in pdf_urls:
result = convert(url, api_key="your-api-key")
header_chunks = header_splitter.split_text(result.markdown)
for chunk in header_chunks:
# Add source URL to metadata for provenance tracking
chunk.metadata["source"] = url
all_chunks.append(Document(
page_content=chunk.page_content,
metadata=chunk.metadata,
))
vectorstore = Chroma.from_documents(
documents=all_chunks,
embedding=embeddings,
collection_name="multi_pdf_rag",
)
Adding the source URL to metadata means your answers can cite which document they came from — critical for any production RAG system.
Why parsing quality matters for RAG
The quality of your RAG pipeline is bounded by the quality of your PDF parsing. Here’s what happens when parsing goes wrong:
- Mangled tables become incoherent text chunks that embed poorly and retrieve incorrectly.
- Lost headings mean you can’t chunk by section, so you fall back to naive fixed-size splitting that cuts through sentences and paragraphs.
- Merged columns in multi-column PDFs produce garbled text that confuses both the embedding model and the LLM.
pdftomarkdown uses vision-language models to understand document layout the way a human does — reading columns in order, preserving table structure, maintaining heading hierarchy. The markdown it produces is ready for structured chunking without post-processing.
See the RAG integration guide for more on optimizing PDF-to-RAG workflows.
Next steps
- Read the API documentation for authentication, file upload, and batch processing options.
- Explore the RAG guide for advanced chunking strategies and retrieval tuning.
- Replace
demo_public_keywith a free API key — sign in with GitHub to get 100 pages/month, no credit card required.
The demo key works for testing. For production pipelines, get your API key and start building.