Add a text layer to a PDF with OCRmyPDF
Basic examples
ocrmypdf has built-in help.
ocrmypdf --help
Add an OCR layer and convert to PDF/A
ocrmypdf input.pdf output.pdf
Add an OCR layer and output a standard PDF
ocrmypdf --output-type pdf input.pdf output.pdf
Create a PDF/A with all color and grayscale images converted to JPEG
ocrmypdf --output-type pdfa --pdfa-image-compression jpeg input.pdf output.pdf
Modify a file in place
The file will only be overwritten if OCRmyPDF is successful.
ocrmypdf myfile.pdf myfile.pdf
Correct page rotation
OCR will attempt to automatic correct the rotation of each page. This can help fix a scanning job that contains a mix of landscape and portrait pages.
ocrmypdf --rotate-pages myfile.pdf myfile.pdf
You can increase (decrease) the parameter --rotate-pages-threshold to make page rotation more (less) aggressive. The threshold number is the ratio of how confidence the OCR engine is that the document image should be changed, compared to kept the same. The default value is quite conservative; on some files it may not attempt rotations at all unless it is very confident that the current rotation is wrong. A lower value of 2.0 will produce more rotations, and more false positives. Run with -v1 to see the confidence level for each page to see if there may be a better value for your files.
If the page is “just a little off horizontal”, like a crooked picture, then you want --deskew. --rotate-pages is for when the cardinal angle is wrong.
OCR languages other than English
OCRmyPDF assumes the document is in English unless told otherwise. OCR quality may be poor if the wrong language is used.
ocrmypdf -l fra LeParisien.pdf LeParisien.pdf
ocrmypdf -l eng+fra Bilingual-English-French.pdf Bilingual-English-French.pdf
Language packs must be installed for all languages specified. See Installing additional language packs.
Unfortunately, the Tesseract OCR engine has no ability to detect the language when it is unknown.
Produce PDF and text file containing OCR text
This produces a file named “output.pdf” and a companion text file named “output.txt”.
ocrmypdf --sidecar output.txt input.pdf output.pdf
Note: The sidecar file contains the OCR text found by OCRmyPDF. If the document contains pages that already have text, that text will not appear in the sidecar. If the option --pages is used, only those pages on which OCR was performed will be included in the sidecar. If certain pages were skipped because of options like --skip-big or --tesseract-timeout, those pages will not be in the sidecar.
To extract all text from a PDF, whether generated from OCR or otherwise, use a program like Poppler’s pdftotext or pdfgrep.