Make searchable PDFs

From Parallel Library Services
Jump to navigation Jump to search

There are a range of tools you can use to make PDFs searchable (in effect, turning an image PDF into one with an OCR text layer that can be searched with a computer), including:

Most of these tools make use of Tesseract, a powerful OCR engine.

Tesseract syntax

The simplest invocation of tesseract to OCR an image:

tesseract imagename outputbase

This uses English as the default language and the default output format is text.

You may notice that tesseract has a particular way of being run:

           ----main options---- --------------------additional options----------------------------
          |                    |                                                                  |
tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

In particular, rather than giving an output filename, you need to give an output "base" name (the first part of the output file name), and then separately a "configuration" which basically defines what kind(s) of output you want to produce.

For this recipe, we will focus on the main options. Excellent further documentation of additional options can be found here.

For more information about the various command line options use tesseract --help or man tesseract.

Use tesseract to convert a single JPEG into a PDF with searchable text

To convert a single JPEG called 01_01.jpg into a PDF named (01_01.pdf), you would use the command:

tesseract 01_01.jpg 01_01 pdf

NB: There is a SPACE between the basename, 01_01 and the kind of output, pdf.

If you type:

tesseract 01_01.jpg 01_01.pdf

then tesseract uses as output "basename" 01_01.pdf, and defaults to text output, producing a text file named:

01_01.pdf.txt

Joining PDFs (and preserving searchability)

Unfortunately using ImageMagick to join different PDF files together removes any non-image material (such as tesseract's OCR output). Luckily pdfunite (part of the Poppler package) is a program that can join searchable PDFs and preserves this information!

To check pdfunite on just two particular files, you might try:

pdfunite 01_01.pdf 01_02.pdf output.pdf

Then finally, you can use a wildcard to join all the single page PDFs together:

pdfunite 01_*.pdf output.pdf

OCR script

Tesseract can only process one image at a time. This is not a bad thing. Following the command-line aesthetic of "doing one thing well" it does it's thing. It's then up to the intrepid shell scripter (that's you) to put tesseract commands into a loop to process a whole bunch of input files.

So, to combine the different steps above in a loop and finally joining the different PDFs into a single, searchable one using pdfunite:

mkdir -p icons
for i in *.jpg
do
	echo ocring $i...
	convert $i -resize 200x200 icons/$i
	tesseract $i `basename -s .jpg $i` pdf
done
pdfunite 01_*.pdf 01.pdf

pdfsandwich

pdfsandwich is a command that just calls a number of other commands to turn an input PDF (without text information) into a PDF with text (aka searchable). It makes use of:

  • ImageMagick convert: to extract images from a source PDF
  • unpaper: to "fix" / clean up a scanned image to work
  • tesseract: to do OCR and produce a single page PDF
  • pdfunite: to "rebind" the single page PDFs back into a multi-page PDF

When you run the command in "verbose" mode, the script outputs a "trace" of the commands it's using (a bit like using the -x option on the bash command):

pdfsandwich -verbose test.pdf
pdfsandwich version 0.1.7
Checking for convert:
convert -version
Version: ImageMagick 6.9.10-23 Q16 x86_64 20190101 https://imagemagick.org
Copyright: © 1999-2019 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP 
Delegates (built-in): bzlib djvu fftw fontconfig freetype heic jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png tiff webp wmf x xml zlib
Checking for unpaper:
unpaper -V
6.1
Checking for tesseract:
tesseract -v
tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE
Checking for gs:
gs -v
GPL Ghostscript 9.27 (2019-04-04)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
Checking for pdfinfo:
pdfinfo -v
Checking for pdfunite:
pdfunite -v
Input file: "zine.pdf"
Output file: "zine_ocr.pdf"
Number of pages in inputfile: 2
More threads than pages. Using 2 threads instead.

Parallel processing with 2 threads started.
Processing page order may differ from original page order.

Processing page 2.
Processing page 1.
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[0]" 
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[1]" 
convert -units PixelsPerInch  -type Bilevel -density 300x300  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[1]" /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm
convert -units PixelsPerInch  -type Bilevel -density 300x300  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[0]" /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm
Processing sheet #1: /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm -> /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich8ddf62  -l eng pdf 
Processing sheet #1: /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm -> /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich33bb2b.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwich33bb2b.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich70ba74  -l eng pdf 
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmpbd9242/pdfsandwichd39e00.pdf

OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmpbd9242/pdfsandwichbea9bf.pdf

OCR done. Writing "zine_ocr.pdf"
pdfunite /tmp/pdfsandwich_tmpbd9242/pdfsandwichd39e00.pdf /tmp/pdfsandwich_tmpbd9242/pdfsandwichbea9bf.pdf /tmp/pdfsandwich_tmpbd9242/pdfsandwich_outputc93999.pdf

zine_ocr.pdf generated.

Done.

Other sources of scanned materials to choose from

Platforms