Difference between revisions of "Flatbed scan a book and process it into a PDF"

From Parallel Library Services
Jump to navigation Jump to search
Tag: Reverted
Tag: Reverted
Line 7: Line 7:
You’ll need the command-line tools for Xcode installed.
You’ll need the command-line tools for Xcode installed.


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
xcode-select --install
xcode-select --install
</syntaxhighlight>
</syntaxhighlight>
Line 13: Line 13:
After, install homebrew:
After, install homebrew:


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
</syntaxhighlight>
</syntaxhighlight>
Line 19: Line 19:
Run the following command once you’re done to ensure homebrew is installed and working properly:
Run the following command once you’re done to ensure homebrew is installed and working properly:


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
brew doctor
brew doctor
</syntaxhighlight>
</syntaxhighlight>


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
</syntaxhighlight>
</syntaxhighlight>


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
brew install python3 python3-pip imagemagick poppler pdfunite
brew install python3 python3-pip imagemagick poppler pdfunite
</syntaxhighlight>
</syntaxhighlight>
Line 34: Line 34:
Install all necessary packages with pip3:
Install all necessary packages with pip3:


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract
sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract
</syntaxhighlight>
</syntaxhighlight>
Line 75: Line 75:
Make all the files executable:
Make all the files executable:


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh
sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh
</syntaxhighlight>
</syntaxhighlight>
Line 90: Line 90:
=== Create 3 directories ===
=== Create 3 directories ===


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
mkdir split
mkdir split
mkdir ocred
mkdir ocred
Line 99: Line 99:
All the scans will be appended to one pdf called out.pdf
All the scans will be appended to one pdf called out.pdf


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
./merge_scans.sh
./merge_scans.sh
</syntaxhighlight>
</syntaxhighlight>
Line 105: Line 105:
=== Burst the pdf in ''scans'' ===
=== Burst the pdf in ''scans'' ===
<p>Burst this pdf, renaming all the files so they can be iterated later.</p>
<p>Burst this pdf, renaming all the files so they can be iterated later.</p>
<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
python3 burstpdf.py
python3 burstpdf.py
</syntaxhighlight>
</syntaxhighlight>
Line 113: Line 113:
The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.
The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
python3 bounding_box.py
python3 bounding_box.py
</syntaxhighlight>
</syntaxhighlight>
Line 121: Line 121:
In this part we OCR the jpg, turning these into PDFs.
In this part we OCR the jpg, turning these into PDFs.


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
python3 tesseract_ocr.py
python3 tesseract_ocr.py
</syntaxhighlight>
</syntaxhighlight>
Line 128: Line 128:
The OCRed pages are now joined into their final PDF, your book is ready :)
The OCRed pages are now joined into their final PDF, your book is ready :)


<syntaxhighlight lang="sh" line="line">
<syntaxhighlight lang="sh">
./merge_files.sh
./merge_files.sh
</syntaxhighlight>
</syntaxhighlight>

Revision as of 11:02, 7 October 2021

https://git.xpub.nl/pedrosaclout/Flatbed_Scanner_Workflow

A flatbed scanner is a commonly found piece of equipment that can be used to scan a book, essentially making a sequence of images from its pages. These scripts, written by Pedro Sá Couto do further work of processing the scans to produce a PDF with a selectable text layer.

Dependencies

brew (MAC) or apt-get (LINUX)

You’ll need the command-line tools for Xcode installed.

xcode-select --install

After, install homebrew:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Run the following command once you’re done to ensure homebrew is installed and working properly:

brew doctor
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
brew install python3 python3-pip imagemagick poppler pdfunite

pip3

Install all necessary packages with pip3:

sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract

How to use the scripts

Your scans must look like this for the scripts to perform correctly.

                                   RIGHT PAGE
                             —————————————————————
                            |                     |
                            |——————————           |
                            |           |         |
                            |           |         |
                            |           |         |
                            |           |         |
                            |           |         |
                            |        01 |         |
                            |——————————           |
                            |                     |
                             —————————————————————
    
      LEFT PAGE                RIGHT PAGE
     —————————————————————   —————————————————————
    |                     | |                     |
    |           ——————————| |——————————           |
    |         |           | |           |         |
    |         |           | |           |         |
    |         |           | |           |         |
    |         |           | |           |         |
    |         |           | |           |         |
    |         | 02        | |        03 |         |
    |          —————————— | |——————————           |
    |                     | |                     |
     —————————————————————   —————————————————————

Add your pictures from the book scanner to the folder "/scans"

Make all the files executable:

sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh

In case you want to skip any of the scripts just comment out in the shell code, workshop_stream.sh.

Run ./workshop_stream.sh

Wait :)

Additional information

The workflow follows these scripts, in successive order:

Create 3 directories

mkdir split
mkdir ocred
mkdir cropped

Merge the files in the directory scans

All the scans will be appended to one pdf called out.pdf

./merge_scans.sh

Burst the pdf in scans

Burst this pdf, renaming all the files so they can be iterated later.

python3 burstpdf.py

Cropping the bounding boxes

The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.

python3 bounding_box.py

OCR (Optical Character Recognition)

In this part we OCR the jpg, turning these into PDFs.

python3 tesseract_ocr.py

Merge all the files and create the pdf

The OCRed pages are now joined into their final PDF, your book is ready :)

./merge_files.sh

License

The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).