Add a text layer to a PDF with OCRmyPDF

From Parallel Library Services
Jump to navigation Jump to search

Basic examples

ocrmypdf has built-in help.

ocrmypdf --help

Add an OCR layer and convert to PDF/A

ocrmypdf input.pdf output.pdf

Add an OCR layer and output a standard PDF

ocrmypdf --output-type pdf input.pdf output.pdf

Create a PDF/A with all color and grayscale images converted to JPEG

ocrmypdf --output-type pdfa --pdfa-image-compression jpeg input.pdf output.pdf

Modify a file in place

The file will only be overwritten if OCRmyPDF is successful.

ocrmypdf myfile.pdf myfile.pdf

Correct page rotation

OCR will attempt to automatic correct the rotation of each page. This can help fix a scanning job that contains a mix of landscape and portrait pages.

ocrmypdf --rotate-pages myfile.pdf myfile.pdf

You can increase (decrease) the parameter --rotate-pages-threshold to make page rotation more (less) aggressive. The threshold number is the ratio of how confidence the OCR engine is that the document image should be changed, compared to kept the same. The default value is quite conservative; on some files it may not attempt rotations at all unless it is very confident that the current rotation is wrong. A lower value of 2.0 will produce more rotations, and more false positives. Run with -v1 to see the confidence level for each page to see if there may be a better value for your files.

If the page is “just a little off horizontal”, like a crooked picture, then you want --deskew. --rotate-pages is for when the cardinal angle is wrong.

OCR languages other than English

OCRmyPDF assumes the document is in English unless told otherwise. OCR quality may be poor if the wrong language is used.

ocrmypdf -l fra LeParisien.pdf LeParisien.pdf
ocrmypdf -l eng+fra Bilingual-English-French.pdf Bilingual-English-French.pdf

Language packs must be installed for all languages specified. See Installing additional language packs.

Unfortunately, the Tesseract OCR engine has no ability to detect the language when it is unknown.

Produce PDF and text file containing OCR text

This produces a file named “output.pdf” and a companion text file named “output.txt”.

ocrmypdf --sidecar output.txt input.pdf output.pdf

Note: The sidecar file contains the OCR text found by OCRmyPDF. If the document contains pages that already have text, that text will not appear in the sidecar. If the option --pages is used, only those pages on which OCR was performed will be included in the sidecar. If certain pages were skipped because of options like --skip-big or --tesseract-timeout, those pages will not be in the sidecar.

To extract all text from a PDF, whether generated from OCR or otherwise, use a program like Poppler’s pdftotext or pdfgrep.