Make searchable PDFs
For this recipe, we will focus on the main options. Excellent further documentation of additional options can be found here.
Use tesseract to convert a single jpeg into a PDF with searchable text
Tesseract has a particular way of being run:
----main options---- ------additional options------
| | |
tesseract imagename outputbase [-l lang] [-psm] [configfile...]
In particular, rather than giving an output filename, you need to give an output "base" name (the first part of the output file name), and then separately a "configuration" which basically defines what kind(s) of output you want to produce. So for instance to convert the single JPEG (01_01.jpg) into a PDF named (01_01.pdf), you would use the command:
tesseract 01_01.jpg 01_01 pdf
NB: There is a SPACE between the basename, 01_01
and the kind of output, pdf
.
If you type:
tesseract 01_01.jpg 01_01.pdf
then tesseract uses as output "basename" 01_01.pdf
, and defaults to text output, producing a text file named:
01_01.pdf.txt
Joining PDFs (and preserving searchability)
Unfortunately using ImageMagick to join different PDF files together removes any non-image material (such as tesseract's OCR output). Luckily pdfunite
(part of the poppler package) is a program that can join searchable PDFs and preserves this information!
To check pdfunite
on just two particular files, you might try:
pdfunite 01_01.pdf 01_02.pdf output.pdf
Then finally, you can use a wildcard to join all the single page PDFs together:
pdfunite 01_*.pdf output.pdf
OCR script
Tesseract can only process one image at a time. This is not a bad thing. Following the command-line aesthetic of "doing one thing well" it does it's thing. It's then up to the intrepid shell scripter (that's you) to put tesseract commands into a loop to process a whole bunch of input files.
So, to combine the different steps above in a loop and finally joining the different PDFs into a single, searchable one using pdfunite
:
mkdir -p icons
for i in *.jpg
do
echo ocring $i...
convert $i -resize 200x200 icons/$i
tesseract $i `basename -s .jpg $i` pdf
done
pdfunite 01_*.pdf 01.pdf
pdfsandwich
pdfsandwich is a command that just calls a number of other commands to turn an input PDF (without text information) into a PDF with text (aka searchable). It makes use of:
- ImageMagick
convert
: to extract images from a source PDF - unpaper: to "fix" / clean up a scanned image to work
- tesseract: to do OCR and produce a single page PDF
- pdfunite: to "rebind" the single page PDFs back into a multi-page PDF
When you run the command in "verbose" mode, the script outputs a "trace" of the commands it's using (a bit like using the -x option on the bash command):
pdfsandwich -verbose test.pdf
pdfsandwich version 0.1.7
Checking for convert:
convert -version
Version: ImageMagick 6.9.10-23 Q16 x86_64 20190101 https://imagemagick.org
Copyright: © 1999-2019 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP
Delegates (built-in): bzlib djvu fftw fontconfig freetype heic jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png tiff webp wmf x xml zlib
Checking for unpaper:
unpaper -V
6.1
Checking for tesseract:
tesseract -v
tesseract 4.0.0
leptonica-1.76.0
libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
Checking for gs:
gs -v
GPL Ghostscript 9.27 (2019-04-04)
Copyright (C) 2018 Artifex Software, Inc. All rights reserved.
Checking for pdfinfo:
pdfinfo -v
Checking for pdfunite:
pdfunite -v
Input file: "zine.pdf"
Output file: "zine_ocr.pdf"
Number of pages in inputfile: 2
More threads than pages. Using 2 threads instead.
Parallel processing with 2 threads started.
Processing page order may differ from original page order.
Processing page 2.
Processing page 1.
identify -format "%w\n%h\n" "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[0]"
identify -format "%w\n%h\n" "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[1]"
convert -units PixelsPerInch -type Bilevel -density 300x300 "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[1]" /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm
convert -units PixelsPerInch -type Bilevel -density 300x300 "/tmp/pdfsandwich_tmpbd9242/pdfsandwich_inputfile6e1885.pdf[0]" /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm
unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm
unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm
Processing sheet #1: /tmp/pdfsandwich_tmpbd9242/pdfsandwich8228d4.pbm -> /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmpbd9242/pdfsandwich104659_unpaper.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwiche7f914.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich8ddf62 -l eng pdf
Processing sheet #1: /tmp/pdfsandwich_tmpbd9242/pdfsandwich3d8ad8.pbm -> /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmpbd9242/pdfsandwich7570dd_unpaper.pbm /tmp/pdfsandwich_tmpbd9242/pdfsandwich33bb2b.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpbd9242/pdfsandwich33bb2b.tif /tmp/pdfsandwich_tmpbd9242/pdfsandwich70ba74 -l eng pdf
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmpbd9242/pdfsandwichd39e00.pdf
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmpbd9242/pdfsandwichbea9bf.pdf
OCR done. Writing "zine_ocr.pdf"
pdfunite /tmp/pdfsandwich_tmpbd9242/pdfsandwichd39e00.pdf /tmp/pdfsandwich_tmpbd9242/pdfsandwichbea9bf.pdf /tmp/pdfsandwich_tmpbd9242/pdfsandwich_outputc93999.pdf
zine_ocr.pdf generated.
Done.
Other sources of scanned materials to choose from
- Factsheet 5
- https://archive.org/search.php?query=factsheet%20five
- https://archive.leftove.rs/documents/CLP