Difference between revisions of "Flatbed scan a book and process it into a PDF"
(Created page with "https://git.xpub.nl/pedrosaclout/Flatbed_Scanner_Workflow = Flatbed_Scanner_Workflow = == Getting started == This set of scripts was written for the Text Laundrette wor...") |
|||
Line 1: | Line 1: | ||
https://git.xpub.nl/pedrosaclout/Flatbed_Scanner_Workflow | https://git.xpub.nl/pedrosaclout/Flatbed_Scanner_Workflow | ||
= | == Notes == | ||
A flatbed scanner is a commonly found device that can be used to scan a book, essentially making a sequence of images from its pages. These scripts do further work of processing the scans to produce a PDF with a selectable text layer. | |||
== Getting started == | == Getting started == |
Revision as of 10:36, 6 October 2021
https://git.xpub.nl/pedrosaclout/Flatbed_Scanner_Workflow
Notes
A flatbed scanner is a commonly found device that can be used to scan a book, essentially making a sequence of images from its pages. These scripts do further work of processing the scans to produce a PDF with a selectable text layer.
Getting started
This set of scripts was written for the Text Laundrette workshop. The workshop first took place in the Publication Station, WDkA building.
Rotterdam, 03-02-2020.
This is a workflow to turn the pictures from a Flatbed Scanner into a final OCRed PDF.
About the workshop
DESCRIPTION
We will use a home-made, DIY book scanner, and open-source software to scan, process, and add digital features to printed texts brought by the participants to the workshop. Ultimately, we will include them in the “bootleg library”, a shadow library accessible over a local network.
Shadow libraries operate outside of legal copyright frameworks, in response to decreased open access to knowledge. This workshop aims to extend our research on libraries, their sociability, and methods by which we can add provenance to texts included in public or private, legal or extra-legal collections.
Participants should bring: a printed text, which they’d like to digitize and share.
- Dependencies
- Brew (MAC) or apt-get (LINUX)
- Dependencies
You’ll need the command-line tools for Xcode installed.
xcode-select --install
After install Homebrew.
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Run the following command once you’re done to ensure Homebrew is installed and working properly:
brew doctor
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
brew install python3 python3-pip imagemagick poppler pdfunite
pip3
sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract
How to use the scripts
Your scans must look like this for the scripts to perform correctly.
RIGHT PAGE ————————————————————— | | |—————————— | | | | | | | | | | | | | | | | | 01 | | |—————————— | | | ————————————————————— LEFT PAGE RIGHT PAGE ————————————————————— ————————————————————— | | | | | ——————————| |—————————— | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 02 | | 03 | | | —————————— | |—————————— | | | | | ————————————————————— —————————————————————
Add your pictures from the book scanner to the folder "/scans"
Make all the files executable:
<syntaxhighlight lang="bash"> sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh <syntaxhighlight>
In case you want to skip any of the scripts just comment out in the shell code, workshop_stream.sh.
Run ./workshop_stream.sh
Wait :)
Additional information
The workflow follows these scripts, in successive order:
Create 3 directories
<syntaxhighlight lang="bash"> mkdir split mkdir ocred mkdir cropped <syntaxhighlight>
Merge the files in the directory scans
All the scans will be appended to one pdf called out.pdf
<syntaxhighlight lang="bash"> ./merge_scans.sh <syntaxhighlight>
Burst the pdf in scans
Burst this pdf, renaming all the files so they can be iterated later.
<syntaxhighlight lang="bash"> python3 burstpdf.py <syntaxhighlight>
Cropping the bounding boxes
The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.
<syntaxhighlight lang="bash"> python3 bounding_box.py <syntaxhighlight>
OCR (Optical Character Recognition)
In this part we OCR the jpg, turning these into PDFs.
<syntaxhighlight lang="bash"> python3 tesseract_ocr.py <syntaxhighlight>
Merge all the files and create the pdf
The OCRed pages are now joined into their final PDF, your book is ready :)
<syntaxhighlight lang="bash"> ./merge_files.sh <syntaxhighlight>
License
The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).