The Definitive Guide to PDF Structure (For Developers Who Hate PDFs)
You’ve tried pdfplumber. You’ve tried PyMuPDF. You’ve tried piping a PDF through pdftotext and grepping the output. And every time, something is wrong — garbled characters, merged columns, missing text, tables that come out as word soup.
The problem isn’t your code. The problem is PDF itself.
This guide explains what’s actually inside a PDF file, why every text extraction approach fights the format, and what that means for developers who need to get data out of documents.
PDF is a page description language, not a document format
This is the single most important thing to understand. PDF was created by Adobe in 1993 as a way to describe how a page looks when printed. It is a set of instructions for placing ink on paper.
It is not a document format. It doesn’t store “paragraphs.” It doesn’t store “headings.” It doesn’t store “tables.” It stores coordinates and glyphs.
Think of it this way: HTML says “here is a heading, here is a paragraph, here is a table.” PDF says “draw this glyph at position (72, 750), then draw this glyph at position (78, 750), then draw this glyph at position (84, 750).”
The difference is fundamental, and it’s the root cause of every PDF extraction problem you’ve ever encountered.
What’s actually inside a PDF file
A PDF file is a collection of numbered objects linked together by a cross-reference table. Open any PDF in a hex editor and you’ll see something like this at the top:
%PDF-1.7
And something like this near the end:
xref
0 8
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000115 00000 n
0000000266 00000 n
0000000346 00000 n
0000000450 00000 n
0000000550 00000 n
trailer
<< /Size 8 /Root 1 0 R >>
startxref
625
%%EOF
The cross-reference table (xref) is a lookup table. It tells a PDF reader the byte offset of every object in the file. This is why PDF readers can jump to any page instantly — they don’t parse the file sequentially.
Each object is referenced by number. 1 0 R means “object 1, generation 0.” The /Root object is the starting point of the document tree. From there, you follow references: root -> pages -> page -> content stream.
This structure is designed for efficient random-access rendering. It is not designed for sequential text extraction.
Content streams: where the “text” lives
The actual visible content of a PDF page lives in content streams. These are sequences of PostScript-like operators that draw things on the page. Here’s what a content stream for a simple line of text looks like:
BT
/F1 12 Tf
72 750 Td
(Hello, World) Tj
ET
Breaking this down:
BT/ET— begin/end a text block/F1 12 Tf— select font F1 at 12pt72 750 Td— move to position (72, 750) in points from the bottom-left corner(Hello, World) Tj— draw the string “Hello, World”
That’s the simple case. Real content streams look more like this:
BT
/F1 10 Tf
1 0 0 1 72 750 Tm
[(H) 20 (ello) -15 (, ) 10 (W) 30 (or) -10 (ld)] TJ
ET
The TJ operator takes an array where numbers represent kerning adjustments in thousandths of a text unit. The text “Hello, World” is split into fragments with manual spacing tweaks between them. Your extraction code has to reassemble the text from these fragments, applying the kerning values to figure out where word boundaries are.
And that’s still simple. A two-column layout doesn’t use any “column” concept. It just places text objects on the left side of the page, then places more text objects on the right side. Your code has to figure out from the x/y coordinates that these are two separate columns and reconstruct the reading order.
Font encoding: why you get garbled characters
Here’s one of the most common PDF extraction failures. You extract text and get something like:
Wkh txlfn eurzq ira mxpsv ryhu wkh od}b grj
Or worse:
☞✎✁ ✂✄☎✆✝ ✞✟✠✡☛ ☞✠☛
This happens because of how PDF handles font encoding. In a PDF, a character code doesn’t necessarily map to the character you’d expect. Every font in a PDF can define its own encoding — a custom mapping from byte values to glyphs.
The font object might look like this:
5 0 obj
<< /Type /Font
/Subtype /Type1
/BaseFont /ABCDEF+CustomFont
/Encoding /WinAnsiEncoding
/ToUnicode 6 0 R
>>
endobj
The critical piece is /ToUnicode. This is a CMap (character map) that translates the font’s internal glyph codes to Unicode code points. If this CMap is missing or incomplete, text extraction produces garbage.
Many PDF generators — especially older ones, or tools that convert from other formats — subset their fonts and use arbitrary glyph IDs. The letter “A” might be stored as glyph ID 43. The letter “B” might be glyph ID 7. Without a correct ToUnicode mapping, no amount of parsing will recover the original text.
This is also why copy-paste from a PDF sometimes gives you nonsense even though the text looks fine on screen. The renderer knows which glyph to draw for each code. But the mapping back to Unicode is broken or absent.
No semantic structure: there are no “paragraphs”
HTML has <h1>, <p>, <table>, <li>. PDF has none of these.
When you look at a PDF and see a heading, what’s actually stored is something like:
BT
/F2 18 Tf
72 700 Td
(Introduction) Tj
ET
BT
/F1 11 Tf
72 680 Td
(This paper presents a novel approach to...) Tj
ET
There’s no marker that says “the first line is a heading.” There’s just a font change (F2 at 18pt vs. F1 at 11pt) and a position change. Your code has to infer the semantics:
- Is F2 at 18pt a heading? Probably. But maybe it’s a pull quote, a figure caption, or a decorative element.
- Is the gap between y=700 and y=680 a paragraph break or just line spacing? Depends on the document.
- Are these two text blocks part of the same section? You have to guess based on spatial proximity and font consistency.
Every PDF extraction library is essentially a collection of heuristics for guessing what the document structure is from position and font data. Different libraries make different guesses. None of them are always right.
Tagged PDF (PDF/UA) is an attempt to fix this. It adds an optional structure tree with semantic tags like /P for paragraphs, /H1 for headings, and /Table for tables. In practice, very few PDFs are properly tagged. Most PDF generators ignore tags entirely. You cannot rely on them.
Image-only pages: scanned documents
Everything above assumes the PDF contains actual text objects. Scanned documents don’t.
A scanned PDF is just a sequence of raster images wrapped in a PDF container. The content stream looks like:
q
612 0 0 792 0 0 cm
/Im1 Do
Q
That’s it. q/Q save/restore the graphics state. cm sets a transformation matrix (scaling the image to fill the page). /Im1 Do draws image object Im1. There are no text operators at all.
If you run pdfplumber or PyMuPDF on this page, you’ll get an empty string. The PDF contains zero text — only pixels.
To get text from scanned PDFs, you need OCR. But traditional OCR (Tesseract, etc.) gives you a flat text dump with no structure — no headings, no tables, no formatting. You’re back to guessing the layout from coordinates.
Form fields: AcroForms
PDFs can contain interactive form fields — text inputs, checkboxes, dropdowns, radio buttons. These are defined using AcroForms, a separate subsystem within the PDF spec.
Form data lives in annotation objects, not in the page’s content stream:
10 0 obj
<< /Type /Annot
/Subtype /Widget
/FT /Tx
/T (FirstName)
/V (Jane)
/Rect [100 700 250 720]
>>
endobj
The field name is /T (FirstName) and the value is /V (Jane). But this data is separate from the page’s rendered content. Some extraction tools only read the content stream and miss form data entirely. Others extract both but don’t indicate which text came from a form field vs. static content.
If you’re extracting data from filled-out forms — insurance claims, tax documents, applications — you need to handle AcroForms explicitly. Most general-purpose PDF text extractors don’t.
Digital signatures
PDFs support cryptographic digital signatures. A signed PDF contains a signature dictionary that covers a specific byte range of the file:
<< /Type /Sig
/Filter /Adobe.PPKLite
/SubFilter /adbe.pkcs7.detached
/ByteRange [0 840 960 240]
/Contents <308201...>
>>
The /ByteRange specifies which bytes of the file are covered by the signature. The /Contents holds the actual PKCS#7 signature data.
For extraction purposes, digital signatures are mostly irrelevant — you care about the text, not the cryptographic envelope. But they add complexity to the file structure and can confuse naive parsers. Some signed PDFs also use incremental updates (appending changes to the end of the file rather than rewriting it), which means the cross-reference table has multiple sections. Extraction tools need to handle this or they’ll read stale data.
Why PDF/A exists
PDF/A is an ISO standard (ISO 19005) for long-term archival of documents. It’s a constrained subset of PDF that bans features which make reliable processing harder:
- No JavaScript
- No external font references (all fonts must be embedded)
- No encryption
- Requires embedded color profiles
- Requires ToUnicode mappings for all fonts
That last point is the big one for text extraction. A PDF/A-compliant document guarantees that you can map every glyph back to Unicode. This eliminates the garbled-character problem described above.
If you control the PDF generation pipeline and care about downstream extraction quality, generate PDF/A. It won’t solve the layout reconstruction problem, but it will at least give you correct characters.
What this means for developers
Every PDF extraction approach runs into the same wall:
-
Text-based extraction (PyMuPDF, pdfplumber, PDFMiner) reads the content stream directly. It gives you characters and coordinates. You write heuristics to reconstruct layout. It works until it doesn’t — a new document format, a different column width, an unusual font encoding, and your pipeline breaks.
-
OCR-based extraction (Tesseract, AWS Textract) renders the page and reads it as an image. This handles scanned documents but loses any text-native precision and produces flat, unstructured output.
-
Hybrid approaches try text extraction first, fall back to OCR. But they still produce unstructured text that requires post-processing.
The fundamental issue is that all these methods try to reconstruct semantic structure (headings, paragraphs, tables) from a format that deliberately doesn’t store it. You’re reverse-engineering the author’s intent from positioning data.
Vision-based approaches bypass the problem entirely
Modern vision-language models take a different approach. Instead of parsing the PDF’s internal structure, they render each page as an image and read it the way a human does.
A human looking at a page doesn’t need to know about content streams, font encodings, or cross-reference tables. They see a heading because it’s large and bold. They see a table because it has rows and columns. They see two columns because there’s a visible gap between them.
Vision-based extraction works the same way. It sidesteps every structural problem described in this guide:
- Garbled font encodings? Irrelevant — the model reads rendered glyphs, not byte codes.
- No semantic tags? Doesn’t matter — the model infers structure from visual layout.
- Scanned documents? No special case — everything is an image anyway.
- Form fields in separate annotation objects? They’re visible on the rendered page.
- Complex multi-column layouts? The model sees them the way you do.
This is why PDF parsing tools built on vision models produce dramatically better structured output than traditional text extraction libraries. They’re not fighting the format — they’re ignoring it.
Try it
pdfToMarkdown uses a vision-language model pipeline to convert PDFs into clean, structured markdown. It handles everything described in this guide — garbled fonts, scanned documents, complex layouts, form fields — through a single OCR API endpoint.
curl -X POST https://pdftomarkdown.dev/v1/convert \
-H "Authorization: Bearer demo_public_key" \
-H "Content-Type: application/json" \
-d '{"input":{"pdf_url":"https://example.com/your-document.pdf"}}'
No signup required for the demo key. Read the docs and test it on the PDFs that have been breaking your pipeline.