Imagining librarianship & experiments with document conversion

From Parallel Library Services
Jump to navigation Jump to search
Imagining librarianship & experiments with document conversion
Location: At Varia (Gouwstraat 3, Rotterdam), and online
Date: November 24th, 2021
Time: 16:00-19:00 CET
Pad: https://pad.simonbrowne.biz/p/pls-meeting-4

Context

PDF (Portable Document Format) is a highly popular digital file format for ebooks. In this workshop, we created, queried and embedded metadata in a PDF by using tools such as Pandoc, ExifTool and of course Calibre, "the swiss army knife of document conversion".

Activities

After some catching up on the contexts of our projects, we discussed the plan for today:

  • a tour of Calibre
  • hybrid publishing workflows
  • embedding metadata in PDFs
  • making digital files (EPUB, PDF) with pandoc
  • converting between file formats in Calibre (.docx > .epub)

The first half of the workshop involved taking a close look at Calibre and hybrid publishing workflows using plain text file formats such as HTML and Markdown.

Calibre's main toolbar preferences

We followed a tutorial (originally written by Roel Roscam-Abbing) which shows how to inspect metadata in PDFs using ExifTool, and then embed it with a Calibre plugin. This plugin, as well as many others that extend Calibre's functionality, can be added to the main toolbar in Calibre easily. It was important to note that this is only possible in Calibre, Calibre-web does not support this, or other plugins.

Pls-workshop-04.png

Our workshop was documented on a pad using Markdown to create structure. Markdown is a lightweight markup language that can be useful in hybrid publishing, where inputs (plain text) may have may outputs (file formats). From the one document it is possible to create a variety of files, including EPUB, PDF, HTML and even Wikitext, the syntax MediaWiki uses.

Markdown uses YAML metadata headers, which require a title in the initial metatdata block:

---
title: my new document
---

After this, it uses a simple syntax to make headings, paragraphs, bold and italic, lists (ordered and unordered), hyperlinks, and many more elements that can easily be converted to multiple file formats. This is part of a markdown publishing workflow, whereby content is gathered and structured in plain text documents. These are usually a source markdown document with the extension .md, and a stylesheet - in CSS, for example - with the file extension .css.

A traditional publishing workflow, with linear content creation and intense design activity to produce many formats (image from the Digital Publishing Toolkit, pg 92)
A "single source" publishing workflow, using a markup language such as Markdown to create content and design in parallel, with multiple formats to export to (image from the Digital Publishing Toolkit, pg 97)

We then exported the pad to a plain text format by running curl in a terminal:

curl https://pad.simonbrowne.biz/p/pls-meeting-4/export/txt -o pls-meeting-4.md

This exports the file in plain text, from which we can use Markdown and CSS to make a PDF with pandoc's weasyprint pdf rendering engine:

pandoc --pdf-engine=weasyprint -c stylesheet.css -s pls-meeting-4.md -o pls-meeting-4.pdf

Alongside PDF, EPUB is also a common digital book format. Using Calibre's document conversion features, we tried converting a Microsoft Word document (in .docx format) from

File:Workshop 04.md.pdf