OCR: Turning Scanned Documents into Searchable Text
What OCR is, how it works in PulpPDF, and why adding a searchable text layer to scanned PDFs saves hours of manual work.
What Is OCR?
OCR — Optical Character Recognition — converts images of text into actual, selectable, searchable text. When you scan a document, the result is essentially a picture. OCR reads that picture and adds an invisible text layer on top, so you can:
- Search for words within the document
- Copy and paste text from scanned pages
- Index documents for full-text search in your file system or document management tool
The Problem with Scanned PDFs
A scanned PDF looks like a normal document, but to your computer it's just a stack of images. Try to select text — nothing happens. Try Ctrl+F to search — no results. Want to quote a paragraph? You're retyping it manually.
This is a daily problem for anyone who works with:
- Scanned contracts and agreements
- Paper forms digitized by a scanner or phone camera
- Legacy documents archived as image-only PDFs
- Receipts and invoices from suppliers
How PulpPDF Does OCR
PulpPDF uses your operating system's built-in text recognition engine:
| Platform | OCR Engine | No extra install needed |
|---|---|---|
| macOS 10.15+ | Apple Vision Framework | Built into macOS |
| Windows 10 1809+ | Windows.Media.Ocr | Built into Windows |
The Process
- Page analysis — PulpPDF checks each page for existing text. Pages that already have selectable text are skipped.
- Rendering — Scanned pages are rendered at 300 DPI for accurate recognition.
- Recognition — The native OCR engine identifies characters, words, and their positions.
- Text layer — An invisible text layer is added to the page, perfectly aligned with the visible text in the image.
The original scan is preserved. OCR only adds — it doesn't modify the visual content.
Smart Per-Page Detection
PulpPDF doesn't blindly OCR every page. It checks each one individually:
- A 20-page PDF where 15 pages have text and 5 are scanned? Only those 5 get OCR.
- A fully text-based PDF? OCR is skipped entirely — no wasted processing.
- A fully scanned document? Every page gets OCR.
OCR + Compression
You can combine OCR with any compression preset:
- None + OCR — Add searchable text without changing the file at all
- Balanced + OCR — Compress and make searchable in one pass
- Ultra + OCR — Rasterize pages for minimum size, then add text layer back
The OCR step runs after compression, so the text layer is always based on the final output.
Limitations
- OCR accuracy depends on scan quality. Blurry or low-resolution scans produce worse results.
- Handwriting recognition is limited — OCR works best with printed text.
- Complex layouts (multi-column, tables overlapping images) can confuse the engine.
- The invisible text layer adds a small amount of file size (typically 1-5% of the compressed file).
