Difference between revisions of "Make searchable PDFs"

Revision as of 15:05, 8 December 2021

For this recipe, we will focus on the main options. Excellent further documentation of additional options can be found here.

Use tesseract to convert a single jpeg into a PDF with searchable text

Tesseract has a particular way of being run:

           ----main options---- ------additional options------
          |                    |                              |
tesseract imagename outputbase [-l lang] [-psm] [configfile...]

In particular, rather than giving an output filename, you need to give an output "base" name (the first part of the output file name), and then separately a "configuration" which basically defines what kind(s) of output you want to produce. So for instance to convert the single JPEG (01_01.jpg) into a PDF named (01_01.pdf), you would use the command:

tesseract 01_01.jpg 01_01 pdf

NB: There is a SPACE between the basename, 01_01 and the kind of output, pdf.

If you type:

tesseract 01_01.jpg 01_01.pdf

then tesseract uses as output "basename" 01_01.pdf, and defaults to text output, producing a text file named:

01_01.pdf.txt

Joining PDFs (and preserving searchability)

Unfortunately using ImageMagick to join different PDF files together removes any non-image material (such as tesseract's OCR output). Luckily pdfunite (part of the poppler package) is a program that can join searchable PDFs and preserves this information!

To check pdfunite on just two particular files, you might try:

pdfunite 01_01.pdf 01_02.pdf output.pdf

Then finally, you can use a wildcard to join all the single page PDFs together:

pdfunite 01_*.pdf output.pdf

OCR script

Tesseract can only process one image at a time. This is not a bad thing. Following the command-line aesthetic of "doing one thing well" it does it's thing. It's then up to the intrepid shell scripter (that's you) to put tesseract commands into a loop to process a whole bunch of input files.

So, to combine the different steps above in a loop and finally joining the different PDFs into a single, searchable one using pdfunite:

mkdir -p icons
for i in *.jpg
do
	echo ocring $i...
	convert $i -resize 200x200 icons/$i
	tesseract $i `basename -s .jpg $i` pdf
done
pdfunite 01_*.pdf 01.pdf

pdfsandwich

pdfsandwich is a command that just calls a number of other commands to turn an input PDF (without text information) into a PDF with text (aka searchable). It makes use of:

ImageMagick convert: to extract images from a source PDF
unpaper: to "fix" / clean up a scanned image to work
tesseract: to do OCR and produce a single page PDF
pdfunite: to "rebind" the single page PDFs back into a multi-page PDF

When you run the command in "verbose" mode, the script outputs a "trace" of the commands it's using (a bit like using the -x option on the bash command):

pdfsandwich -verbose test.pdf

pdfsandwich version 0.1.7
Checking for convert:
convert -version
Version: ImageMagick 6.9.10-23 Q16 x86_64 20190101 https://imagemagick.org
Copyright: © 1999-2019 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP 
Delegates (built-in): bzlib djvu fftw fontconfig freetype heic jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png tiff webp wmf x xml zlib
Checking for unpaper:
unpaper -V
6.1
Checking for tesseract:
tesseract -v
tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE
Checking for gs:
gs -v
GPL Ghostscript 9.27 (2019-04-04)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
Checking for pdfinfo:
pdfinfo -v
Checking for pdfunite:
pdfunite -v
Input file: "zine.pdf"
Output file: "zine_ocr.pdf"
Number of pages in inputfile: 2
More threads than pages. Using 2 threads instead.

Parallel processing with 2 threads started.
Processing page order may differ from original page order.

Processing page 2.
Processing page 1.
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[0]" 
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[1]" 
convert -units PixelsPerInch  -type Bilevel -density 300x300  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[1]" /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm
convert -units PixelsPerInch  -type Bilevel -density 300x300  "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[0]" /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm
Processing sheet #1: /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm -> /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich8ddf62  -l eng pdf 
Processing sheet #1: /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm -> /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich33bb2b.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwich33bb2b.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich70ba74  -l eng pdf 
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmpbd9242/pdfsandwichd39e00.pdf

OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmpbd9242/pdfsandwichbea9bf.pdf

OCR done. Writing "zine_ocr.pdf"
pdfunite /tmp/pdfsandwich_tmpbd9242/pdfsandwichd39e00.pdf /tmp/pdfsandwich_tmpbd9242/pdfsandwichbea9bf.pdf /tmp/pdfsandwich_tmpbd9242/pdfsandwich_outputc93999.pdf

zine_ocr.pdf generated.

Done.

Other sources of scanned materials to choose from

Platforms

@@ Line 1: / Line 1: @@
+For this recipe, we will focus on the main options. Excellent further documentation of additional options can be found [https://guides.library.illinois.edu/c.php?g=347520&p=4121426 here].
 == Use tesseract to convert a single jpeg into a PDF with searchable text ==
@@ Line 9: / Line 11: @@
 </syntaxhighlight>
-For this recipe, we will focus on the main options. Excellent further documentation of additional options can be found [https://guides.library.illinois.edu/c.php?g=347520&p=4121426 here].
 In particular, rather than giving an output filename, you need to give an output "base" name (the first part of the output file name), and then separately a "configuration" which basically defines what kind(s) of output you want to produce. So for instance to convert the single JPEG (01_01.jpg) into a PDF named (01_01.pdf), you would use the command: