Difference between revisions of "Flatbed scan a book and process it into a PDF"
(17 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
https://git.xpub.nl/pedrosaclout/Flatbed_Scanner_Workflow | https://git.xpub.nl/pedrosaclout/Flatbed_Scanner_Workflow | ||
A [[flatbed scanner]] is a commonly found piece of equipment that can be used to scan a book, essentially making a sequence of images from its pages. These scripts, written by Pedro Sá Couto do further work of processing the scans to produce a PDF with a selectable text layer. | |||
A flatbed scanner is a commonly found | |||
== | == Dependencies == | ||
=== brew (MAC) or apt-get (LINUX) === | |||
You’ll need the command-line tools for Xcode installed. | |||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
Line 28: | Line 11: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
After, install homebrew: | |||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
Line 34: | Line 17: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Run the following command once you’re done to ensure | Run the following command once you’re done to ensure homebrew is installed and working properly: | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
Line 49: | Line 32: | ||
=== pip3 === | === pip3 === | ||
Install all necessary packages with pip3: | |||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
Line 57: | Line 41: | ||
Your scans must look like this for the scripts to perform correctly. | Your scans must look like this for the scripts to perform correctly. | ||
< | <syntaxhighlight lang="bash"> | ||
RIGHT PAGE | RIGHT PAGE | ||
————————————————————— | ————————————————————— | ||
Line 85: | Line 69: | ||
| | | | | | | | | | ||
————————————————————— ————————————————————— | ————————————————————— ————————————————————— | ||
</ | </syntaxhighlight> | ||
Add your pictures from the book scanner to the folder "/scans" | Add your pictures from the book scanner to the folder "/scans" | ||
Line 106: | Line 90: | ||
=== Create 3 directories === | === Create 3 directories === | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash" line> | ||
mkdir split | mkdir split | ||
mkdir ocred | mkdir ocred | ||
Line 120: | Line 104: | ||
=== Burst the pdf in ''scans'' === | === Burst the pdf in ''scans'' === | ||
Burst this pdf, renaming all the files so they can be iterated later. | |||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
python3 burstpdf.py | python3 burstpdf.py | ||
Line 142: | Line 128: | ||
=== Merge all the files and create the pdf === | === Merge all the files and create the pdf === | ||
The OCRed pages are now joined into their final PDF, your book is ready :) | The OCRed pages are now joined into their final PDF, your book is ready :) | ||
Line 149: | Line 136: | ||
==== License ==== | ==== License ==== | ||
The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT). | The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT). | ||
[[Category:Cookbook]] | |||
[[Category:Python]] | |||
[[Category:Command line]] |
Latest revision as of 21:53, 2 November 2021
https://git.xpub.nl/pedrosaclout/Flatbed_Scanner_Workflow
A flatbed scanner is a commonly found piece of equipment that can be used to scan a book, essentially making a sequence of images from its pages. These scripts, written by Pedro Sá Couto do further work of processing the scans to produce a PDF with a selectable text layer.
Dependencies
brew (MAC) or apt-get (LINUX)
You’ll need the command-line tools for Xcode installed.
xcode-select --install
After, install homebrew:
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Run the following command once you’re done to ensure homebrew is installed and working properly:
brew doctor
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
brew install python3 python3-pip imagemagick poppler pdfunite
pip3
Install all necessary packages with pip3:
sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract
How to use the scripts
Your scans must look like this for the scripts to perform correctly.
RIGHT PAGE
—————————————————————
| |
|—————————— |
| | |
| | |
| | |
| | |
| | |
| 01 | |
|—————————— |
| |
—————————————————————
LEFT PAGE RIGHT PAGE
————————————————————— —————————————————————
| | | |
| ——————————| |—————————— |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | 02 | | 03 | |
| —————————— | |—————————— |
| | | |
————————————————————— —————————————————————
Add your pictures from the book scanner to the folder "/scans"
Make all the files executable:
sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh
In case you want to skip any of the scripts just comment out in the shell code, workshop_stream.sh.
Run ./workshop_stream.sh
Wait :)
Additional information
The workflow follows these scripts, in successive order:
Create 3 directories
mkdir split
mkdir ocred
mkdir cropped
Merge the files in the directory scans
All the scans will be appended to one pdf called out.pdf
./merge_scans.sh
Burst the pdf in scans
Burst this pdf, renaming all the files so they can be iterated later.
python3 burstpdf.py
Cropping the bounding boxes
The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.
python3 bounding_box.py
OCR (Optical Character Recognition)
In this part we OCR the jpg, turning these into PDFs.
python3 tesseract_ocr.py
Merge all the files and create the pdf
The OCRed pages are now joined into their final PDF, your book is ready :)
./merge_files.sh
License
The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).