Difference between revisions of "Convert a .docx file to EPUB with Calibre"

From Parallel Library Services
Jump to navigation Jump to search
Line 45: Line 45:
== Practical ePub dissection: Table of Content ==
== Practical ePub dissection: Table of Content ==


The last file that we need to understand for hacking ePubs is the TOC (Table Of Content). It is usually called <code>toc.ncx</code> and is located in the OEBPS folder of the ePub container. So far we have only covered ePub with only one html file used for the content, which is why the TOC only had only one item/link.
The last file that we need to understand for hacking ePubs is the TOC (Table Of Content). It is usually called <code>toc.ncx</code> and is located in the OEBPS folder of the ePub container. So far we have only covered ePub with only one HTML file used for the content, which is why the TOC only had only one item/link.


The three important parts of the <code>toc.ncx</code> file are:
The three important parts of the <code>toc.ncx</code> file are:
Line 60: Line 60:
     </navPoint>
     </navPoint>
</syntaxhighlight>
</syntaxhighlight>
* '''id:''' can be anything you want in theory but to make it easier to remember and for the sake of compatibility, use the same id names as those defined in content.opf.
* '''id:''' can be anything you want in theory but to make it easier to remember and for the sake of compatibility, use the same id names as those defined in <code>content.opf</code>
* '''playOrder:''' the display order of this item in the TOC. Must be an integer and continuous, ie: 1, 2, 3, 4
* '''playOrder:''' the display order of this item in the TOC. Must be an integer and continuous, ie: 1, 2, 3, 4
* '''<text>:''' the title of the chapter as it will be displayed in the TOC
* '''<text>:''' the title of the chapter as it will be displayed in the TOC

Revision as of 20:41, 16 November 2021

This recipe to make an ePub from one, or many .docx files has been adapted from instructions by Silvio Lorusso, published on the Institute of Network Cultures blog.

You'll need to download and install Calibre and demo.docx file available from the Calibre team before starting:

http://calibre-ebook.com/downloads/demos/demo.docx

Converting demo.docx to demo.epub

You can convert a .docx file directly into an ePub using Calibre. For this, we will use the demo.docx file provided by the developers at Calibre. Add this file to your Calibre library either via the "add books" button, or by dragging and dropping the file into the Calibre interface.

Adding docx to Calibre.png

Click on the "Convert books" button, and choose ePub as the destination (output) format.

Converting docx to EPUB Calibre.png

You will now see an ePub format has been added to Calibre under the same listing. Open the file in an e-reader to inspect it. Because we didn't specify a cover image, Calibre generated a default cover. The features of the .docx are largely preserved in the ePub, minus some minor glitches:

Demo epub inside.png

Unzipping the ePub

We can look inside the ePub by unzipping it. To do this, find the file demo.epub in your computer, and change the file extension from .epub to .zip, or on Mac, open a terminal session, and in the directory where the file demo.epub is, run the following command:

unzip demo.epub

Anatomy of an ePub

Demo epub unzipped.png

You'll now see that the ePub is made of a collection of different files, mainly:

  • mimetype
  • fonts folder - containing document fonts
  • META-INF folder
    • container.xml: tells the reader software where in the zip file to find the book.
  • OEBPS folder - books content (name can change)
    • images folder - images (PNG) go here (can be changed)
    • Content.opf - lists what's in the zip file
    • toc.ncx - Table of content
    • xhtml files - The book's contents are in these
    • CSS files

Practical ePub dissection: Table of Content

The last file that we need to understand for hacking ePubs is the TOC (Table Of Content). It is usually called toc.ncx and is located in the OEBPS folder of the ePub container. So far we have only covered ePub with only one HTML file used for the content, which is why the TOC only had only one item/link.

The three important parts of the toc.ncx file are:

  • <head>: make sure you use same ebook uid as the one you declared in content.opf
  • <docTitle>: make sure you use the same or similar title as your ebook title. This one will be displayed as the book title in your TOC.
  • <navMap>: This is where you need to describe the chapters of your book. The navMap is made of navPoints. Each navPoint tag represents a chapter and where it is located in the container. More particularly:
    <navPoint id="navpoint-1" playOrder="1">
      <navLabel>
        <text>Book cover</text>
      </navLabel>
      <content src="title.html"/>
    </navPoint>
  • id: can be anything you want in theory but to make it easier to remember and for the sake of compatibility, use the same id names as those defined in content.opf
  • playOrder: the display order of this item in the TOC. Must be an integer and continuous, ie: 1, 2, 3, 4
  • <text>: the title of the chapter as it will be displayed in the TOC
  • <content>: needs to point to a valid declared HTML content file in your container.

Practical ePub dissection: font embedding

Font embedding is the technique that gives the opportunity to an ebook designer to provide and use his/her own set of fonts. This feature is not yet totally supported by all readers, but no worries the limited reader will use then its own fonts as fallback. So there is not reasons not to start using this feature!

For example:

 @font-face {
   font-family: "Linux Libertine";
   font-style: normal;
   font-weight: normal;
   src:url(LinLibertine_Re.ttf);
 }

ePUB repacking

  • Choose an ePUB
  • Modify some files
  • repack:
cd /tmp/epub
zip -0Xq my.epub mimetype
zip -Xr9Dq my.epub *