Imagining librarianship & experiments with document conversion
|Imagining librarianship & experiments with document conversion|
|Location: At Varia (Gouwstraat 3, Rotterdam), and online|
|Date: November 24th, 2021|
|Time: 16:00-19:00 CET|
|Tools: Pandoc, ExifTool, Calibre|
PDF (Portable Document Format) is a highly popular digital file format for ebooks. In this workshop, we created, queried and embedded metadata in a PDF by using tools such as Pandoc, ExifTool and of course Calibre, "the swiss army knife of document conversion".
After some catching up on the contexts of our projects, we discussed the plan for today:
- a tour of Calibre
- hybrid publishing workflows
- embedding metadata in PDFs
- making digital files (EPUB, PDF) with pandoc
- converting between file formats in Calibre (.docx > .epub)
The first half of the workshop involved taking a close look at Calibre and hybrid publishing workflows using plain text file formats such as HTML and Markdown.
We followed a tutorial (originally written by Roel Roscam-Abbing) which shows how to inspect metadata in PDFs using ExifTool, and then embed it with a Calibre plugin. This plugin, as well as many others that extend Calibre's functionality, can be added to the main toolbar in Calibre easily. It was important to note that this is only possible in Calibre, Calibre-web does not support this, or other plugins.
Our workshop was documented on a pad using Markdown to create structure. Markdown is a lightweight markup language that can be useful in hybrid publishing, where inputs (plain text) may have may outputs (file formats). From the one document it is possible to create a variety of files, including EPUB, PDF, HTML and even Wikitext, the syntax MediaWiki uses.
Markdown uses YAML metadata headers, which require a title in the initial metatdata block:
--- title: my new document ---
After this, it uses a simple syntax to make headings, paragraphs, bold and italic, lists (ordered and unordered), hyperlinks, and many more elements that can easily be converted to multiple file formats. This is part of a markdown publishing workflow, whereby content is gathered and structured in plain text documents. These are usually a source markdown document with the extension
.md, and a stylesheet - in CSS, for example - with the file extension
We then exported the pad to a plain text format by running curl in a terminal:
curl https://pad.simonbrowne.biz/p/pls-meeting-4/export/txt -o pls-meeting-4.md
This exports the file in plain text, from which we can use Markdown and CSS to make a PDF with pandoc's weasyprint pdf rendering engine:
pandoc --pdf-engine=weasyprint -c stylesheet.css -s pls-meeting-4.md -o pls-meeting-4.pdf
Alongside PDF, EPUB is also a common digital book format. Using Calibre's document conversion features, we tried converting a Microsoft Word document (in .docx format) from