Digitising, scanning, processing and republishing

From Parallel Library Services
Jump to navigation Jump to search
Digitising, scanning, processing and republishing
Location: At Varia (Gouwstraat 3, Rotterdam), and online
Date: December 8th, 2021
Time: 16:00-19:00 CET
Pad: https://pad.simonbrowne.biz/p/pls-meeting-5
Tools: Tesseract, OCRmyPDF, PDFsandwich
Guests: Pedro Sá Couto

Context

Digitising printed matter involves more than scanning - to make a file searchable it requires a text layer. In this workshop, we were joined by guest speaker Pedro Sá Couto, a designer and PhD researcher based on Porto, Portugal interested in the realm of surveillance in publishing digital and analog media. Pedro presented his work on projects such as Tactical Watermarks, an online republishing platform that adds user-generated watermarks to uploaded PDFs. His more recent PhD research follows copy shops located near Portuguese academic instituions, which act as “informal libraries”.

Activities

Pertinent to this topic, we explored the process of digitising printed books, from scan to a PDF with and OCR (Optical Character Recognition) layer. The second half of the workshop took a deep dive into using tesseract, an open-source OCR engine. While tesseract does a good job of recognising the characters in printed text, other software is needed to compile the PDF. Following some experiments with Tesseract, we trialled software such as OCRmyPDF and PDFsandwich, which can compile and run OCR in one command.

Add_a_text_layer_to_a_PDF_with_OCRmyPDF

Make_searchable_PDFs