OCR: Turning Scanned Documents into Searchable Text

What Is OCR?

OCR — Optical Character Recognition — converts images of text into actual, selectable, searchable text. When you scan a document, the result is essentially a picture. OCR reads that picture and adds an invisible text layer on top, so you can:

Search for words within the document
Copy and paste text from scanned pages
Index documents for full-text search in your file system or document management tool

The Problem with Scanned PDFs

A scanned PDF looks like a normal document, but to your computer it's just a stack of images. Try to select text — nothing happens. Try Ctrl+F to search — no results. Want to quote a paragraph? You're retyping it manually.

This is a daily problem for anyone who works with:

Scanned contracts and agreements
Paper forms digitized by a scanner or phone camera
Legacy documents archived as image-only PDFs
Receipts and invoices from suppliers

How PulpPDF Does OCR

PulpPDF uses your operating system's built-in text recognition engine:

Platform	OCR Engine	No extra install needed
macOS 10.15+	Apple Vision Framework	Built into macOS
Windows 10 1809+	Windows.Media.Ocr	Built into Windows

The Process

Page analysis — PulpPDF checks each page for existing text. Pages that already have selectable text are skipped.
Rendering — Scanned pages are rendered at 300 DPI for accurate recognition.
Recognition — The native OCR engine identifies characters, words, and their positions.
Text layer — An invisible text layer is added to the page, perfectly aligned with the visible text in the image.

The original scan is preserved. OCR only adds — it doesn't modify the visual content.

Smart Per-Page Detection

PulpPDF doesn't blindly OCR every page. It checks each one individually:

A 20-page PDF where 15 pages have text and 5 are scanned? Only those 5 get OCR.
A fully text-based PDF? OCR is skipped entirely — no wasted processing.
A fully scanned document? Every page gets OCR.

OCR + Compression

You can combine OCR with any compression preset:

None + OCR — Add searchable text without changing the file at all
Balanced + OCR — Compress and make searchable in one pass
Ultra + OCR — Rasterize pages for minimum size, then add text layer back

The OCR step runs after compression, so the text layer is always based on the final output.

Limitations

OCR accuracy depends on scan quality. Blurry or low-resolution scans produce worse results.
Handwriting recognition is limited — OCR works best with printed text.
Complex layouts (multi-column, tables overlapping images) can confuse the engine.
The invisible text layer adds a small amount of file size (typically 1-5% of the compressed file).