Bobbie's Blog

Boys be ambitious.

Preserve the Table of Contents When Converting a Book From Djvu to PDF

There are many readily available softwares (e.g. DjVu2PDF) for converting a book from .djvu to .pdf format, but none of those will preserve the table of contents in the output PDF.

Having a table of contents is very handy. For example when viewing a book in Preview, the table of contents works like a multi-level bookmark, you can simply click on any link in the sidebar to jump to any chapter/section of the book.

So I Googled and found this quetion on StackExchange that asked exactly my question. Here is a summary of the accepted answer on how you can preserve (or more precisely, create) the table of contents in a PDF converted from Djvu.

1. Preliminary

You will need to install pdftk (part of PDFtk Server) and djvused (part of DjVuLibre)

Note 1: pdftk for Mac OS X 10.11 and above. I found in this answer on Stack Overflow that the developer of PDFtk provides an installer for PDFtk Server on OS X 10.11 and above. It is kind of strange that the official website only provides the installer for OS X up to 10.8. (This older version can be installed, but won’t run. When you type pdftk commands in the Terminal, it will make you wait forever.)

Note 2: About djvused command line setup on OS X. After installing DjVuLibre, in order to use djvused in command line, you need to run

1
eval '/Applications/DjView.app/Contents/setpath.sh'

If this doesn’t add the correct path, you can also manually add the following line into ~/.bash_profile

1
PATH="/Applications/DjView.app/Contents/bin:${PATH}"

2. Convert the Table of Contents

(Note: all materials in this section follow closely the original answer on StackExchange, except I coded a very simple python program in Step 2.)

Suppose now you have converted book.djvu into book.pdf, the former has a table of contents but the latter doesn’t.

Step 1. extract Djvu outline

Use the following command to extract the table of contents from book.djvu

1
djvused "book.djvu" -e 'print-outline' > bmarks.out

The output file bmarks.out lists the table of contents in a serialized tree format using SEXPR, which can be summarized as:

1
2
3
4
5
6
7
file ::= (bookmarks
           <bookmark>*)
bookmark ::= (name
               page
               <bookmark>*)
name ::= "<character>*"
page ::= "#<digit>+"

Notice that under this format, you can append a “child bookmark” inside a “parent bookmark”. For example, a bmarks.out may look like this

1
2
3
4
5
6
7
8
9
10
11
12
(bookmarks
  ("bmark1"
    "#1")
  ("bmark2"
    "#5"
    ("bmark2subbmark1"
      "#6")
    ("bmark2subbmark2"
      "#7"))
  ("bmark3"
    "#9"
    ...))

Step 2. translate the Djvu outline to PDF metadata format

Now, Djvu and PDF store the bookmark data in different formats. While Djvu uses SEXPR, PDF uses metadata, which looks like this:

1
2
3
4
5
6
file ::= <entry>*
entry ::= BookmarkBegin
          BookmarkTitle: <title>
          BookmarkLevel: <number>
          BookmarkPageNumber: <number>
title ::= <character>*

The example in Step 1 when translated into PDF metadata will look like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
BookmarkBegin
BookmarkTitle: bmark1
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: bmark2
BookmarkLevel: 1
BookmarkPageNumber: 5
BookmarkBegin
BookmarkTitle: bmark2subbmark1
BookmarkLevel: 2
BookmarkPageNumber: 6
BookmarkBegin
BookmarkTitle: bmark2subbmark2
BookmarkLevel: 2
BookmarkPageNumber: 7
BookmarkBegin
BookmarkTitle: bmark3
BookmarkLevel: 1
BookmarkPageNumber: 9
...

It is a fun exercise to work out the correspondence of the two formats.

Note: I have written a python program to automatically convert the Djvu SEXPR bmarks.out into the PDF metadata form and output as bmarks2.txt

Convert Djvu outline into PDF metadata (bmarkDjvu2pdf.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#! /usr/bin/env python
metadata = '' # string to be written into PDF metadata
f = open("bmarks.out") # open input file: djvu outline
line = f.readline().strip()
if line.startswith('(bookmarks'):
    level = 0

while (level >= 0):
    line = f.readline().strip()
    if line.startswith('("'):
        level = level + 1
        metadata = metadata + "BookmarkBegin\nBookmarkTitle: "+line.strip('("')+"\nBookmarkLevel: "+str(level)+'\n'
        line = f.readline().strip()
        while line.endswith(')'):
            level = level - 1
            line = line[:-1].strip()
        metadata = metadata + "BookmarkPageNumber: "+line.strip('"#')+'\n'
    else:
        while line.endswith(')'):
            level = level - 1
            line = line[:-1].strip()

f.close()
f = open("bmarks2.txt",'w') # output file: for PDF metadata
f.write(metadata)
f.close()

Step 3. modify PDF metadata to include the bookmark data

Extract PDF metadata with this command:

1
pdftk "book.pdf" dump_data > pdfmetadata.out

Open the pdfmetadata.out file, and find the line that begins with NumberOfPages:, and insert your list of bookmarks after this line. Save the new file as pdfmetadata.in. Now run this command:

1
pdftk "book.pdf" update_info "pdfmetadata.in" output newbook.pdf

The output newbook.pdf is your new book.pdf equiped with a convenient table of contents. Happy reading!

Comments