OCR: Turning Scanned Documents into Searchable Text

What OCR is, how it works in PulpPDF, and why adding a searchable text layer to scanned PDFs saves hours of manual work.

What Is OCR?

OCR — Optical Character Recognition — converts images of text into actual, selectable, searchable text. When you scan a document, the result is essentially a picture. OCR reads that picture and adds an invisible text layer on top, so you can:

  • Search for words within the document
  • Copy and paste text from scanned pages
  • Index documents for full-text search in your file system or document management tool

The Problem with Scanned PDFs

A scanned PDF looks like a normal document, but to your computer it's just a stack of images. Try to select text — nothing happens. Try Ctrl+F to search — no results. Want to quote a paragraph? You're retyping it manually.

This is a daily problem for anyone who works with:

  • Scanned contracts and agreements
  • Paper forms digitized by a scanner or phone camera
  • Legacy documents archived as image-only PDFs
  • Receipts and invoices from suppliers

How PulpPDF Does OCR

PulpPDF uses your operating system's built-in text recognition engine:

Platform OCR Engine No extra install needed
macOS 10.15+ Apple Vision Framework Built into macOS
Windows 10 1809+ Windows.Media.Ocr Built into Windows

The Process

  1. Page analysis — PulpPDF checks each page for existing text. Pages that already have selectable text are skipped.
  2. Rendering — Scanned pages are rendered at 300 DPI for accurate recognition.
  3. Recognition — The native OCR engine identifies characters, words, and their positions.
  4. Text layer — An invisible text layer is added to the page, perfectly aligned with the visible text in the image.

The original scan is preserved. OCR only adds — it doesn't modify the visual content.

Smart Per-Page Detection

PulpPDF doesn't blindly OCR every page. It checks each one individually:

  • A 20-page PDF where 15 pages have text and 5 are scanned? Only those 5 get OCR.
  • A fully text-based PDF? OCR is skipped entirely — no wasted processing.
  • A fully scanned document? Every page gets OCR.

OCR + Compression

You can combine OCR with any compression preset:

  • None + OCR — Add searchable text without changing the file at all
  • Balanced + OCR — Compress and make searchable in one pass
  • Ultra + OCR — Rasterize pages for minimum size, then add text layer back

The OCR step runs after compression, so the text layer is always based on the final output.

Limitations

  • OCR accuracy depends on scan quality. Blurry or low-resolution scans produce worse results.
  • Handwriting recognition is limited — OCR works best with printed text.
  • Complex layouts (multi-column, tables overlapping images) can confuse the engine.
  • The invisible text layer adds a small amount of file size (typically 1-5% of the compressed file).