Friday 17 June 2011

Creating an ebook for the Kindle

I have been reading ebooks for years using the MobiPocket application and one ofmy reasons for liking it was the ability to create your own ebooks from sources such as project Gutenberg in addition to buying commercially produced ebooks.

I recently gave in to temptation and bought a Kindle from Amazon. One of the attractions for me was that it also allows you to read ebooks that you have created yourself. At first, I only created simple ebooks but I recently volunteered to help create a Kindle version of the excellent "Architecture of Open Source Applications" http://www.aosabook.org/en/index.html and, in the process, learnt more about creating better quality ebooks. None of this is very difficult if you are comfortable editing HTML and XML so this post is intended to share the techniques that I found useful.

Alternative formats

The Kindle will do its best to display text and PDF files but these do not give a perfect reading experience.

When displaying text files, the Kindle treats each line as separate - it does not run subsequent lines into a paragraph. It will wrap lines to fit the screen based on the current test size so all the text is visible but in some cases the line breaks can be jarring.

PDF files present a different challenge. Because the PDF files are intended to preserve the page layout, the actual text flow is lost. The Kindle has to preserve the page layout for a PDF file and cannot change text size or handle pages wider than the screen. I have a few books in PDF form that I manage to read on the Kindle by switching to landscape format but it is not possible to change the text size. On the plus side, embedded images work fine.

Neither text nor PDF files allow the Kindle to provide a useful table of contents.

In summary, if you have a small text file or a file that is only available as a PDF then you can read it using the Kindle but it will not give an ideal reading experience.

Native format

In order to get the best reading experience with variable text size and a table of contents, it is necessary to generate an ebook in the 'native' .mobi format.

A simple book can be generated using the open source Calibre tool. This provides a full ebook management service but it includes a format converter that can take a suitable HTML file and generate an ebook.

I used Calibre for relatively simple books but I then found that kindlegen gave more control so I switched to that.

Kindlegen is provided by Amazon - Google gave me this link but search for the latest version
http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621 and it is the officially sanctioned tool for generating ebooks for the Kindle. There are versions of kindlegen for Linux, Mac and Windows and the download includes a very useful example. A guide to using kindlegen is separately available by going to Amazon's page on publishing for Kindle http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621 and is well worth getting.

Kindlegen also supports the creation of an ebook from a single HTML file but if you want to create a larger ebook or to include a table of contents then you need to go to the trouble of creating an .opf file and structuring the book accordingly.

Creating an ebook using .opf, .ncx and other files

The best way to create your first ebook based on .opf is to take the relevant files from the kindlegen example and edit those. The .opf file is both a manifest that lists the component files and meta-data and also includes a reading order.

The various meta-data include title, author(s), ISBN number, subject and summary and are all pretty obvious.

The manifest part is a list of files to include. These are mostly HTML files but the cover image (if any) is specified there and the .ncx contents file (covered below). When specifying files to include, simply keep them all in the same directory as the .opf file and refer to them by name. You can probably refer to files in relative or absolute directories if you need to.

One piece of good news (if you are lazy like me) is that you do not need to explicitly include image files in the .opf file. If you include an HTML file that includes an image then kindlegen will pick it up automatically.

The kindlegen guide includes guidelines on cover images - basically go for 600 by 800 in colour and try not to re-scale because that reduces the quality. If you use large embedded images then the kindle will scale them nicely for different screen layout. There is no need to change the image size unless memory usage becomes an issue. Larger than 600 or 800 in either direction will be wasted but re-scaling it yourself may affect quality.

Once you have all of the files in the manifest, you need to set the reading order and the table of contents.

The table of contents is slightly annoying in that you are advised to create it in two ways. You create an HTML table of contents that lists the chapters (or sub-sections) and has links (more below on links). This is to be found when the reader starts at the beginning of the book. The second table of contents is an XML file with a .ncx extension that associates text with a file and optional location. This table of contents is the one found when the reader selects go to table of contents.

As with the .opf file, the best way to create the table of contents files is to edit the examples provided with kindlegen.

Once you have the .opf, .ncx and various HTML files, you can simply run kindlegen to generate a .mobi file. Copy this to your Kindle (use the USB cable, it is your friend when generating ebooks). You can then check the tables of contents and cover picture as well as the actual content.

Links within an ebook work simply. If you have a link with the full http:// protocol then following it will open a browser (unsurprisingly). If you leave off http:// and just refer to a file name or an id then the link is internal to the book.

This allows you to build the table of contents or to put links within the book to other sections or to pictures or tables. A link of the form <a href="otherfile.html"> links to another content file in the book. A link of the form <href="otherfile.html#atag"> links to a tag with id="atag" in the file. If you have a set of content files from the web with links between them then consider using a script to remove the http:// part.

Formatting tips and tricks

Once you have a book with a structure and a table of contents, you can tune the appearance. You should be aware that it is better to have simple formats rather then overly elaborate ones. The Kindle screen is limited in size and being too fussy about layout can backfire.

One technique that I wished I had learnt earlier is that kindlegen / the Kindlegen has reasonable support for CSS. I have not tried using external stylesheets but I was able to put class styles in the file headers and they worked perfectly. When I started using kindlegen, I assumed that it would not be this capable so I wrote scripts to edit the formats - changing class styles was much quicker and
easier to experiment with.

I wanted to make some text smaller. This needs restraint as the Kindle allows the user to change text size so don't override their choice but we found that monospaced text was larger than the default font and looked better when reduced. Both the <font>> tag and the font-size style worked but you have to go for the smallest font size to have any effect. In contrast, it is possible to have several larger font sizes but they quickly take over the screen.

Horizontal rules <hr> tags work well to divide up text and can be full width or partial.

The various <h1>, <h2>, <h3> tags work and should be used for sections and sub-sections.

If you want whitespace between paragraphs (for example, we wanted space before and after section headings) then <p>&nbsp;</p> provides a gap.

Paragraphs are indented by default (this is normal book layout) but can be unindented <p style="text-indent:0"> - this technique is described in Amazon's kindle publishing guidelines.

The Kindle does not handle all Unicode characters. I encountered problems with some vertical bars, some quotation marks and a character to denote a space. In some cases, these are stored as entities in the HTML and can be replaced by similar characters automatically. Other cases were UTF8 and these can also be replaced by a script.

Not all Unicode or entity characters cause problems. I found the best approach was to leave them alone and then manually proof-read the Kindle version, spot invalid characters and then go and replace all occurences. This risks missing some. Once I found offending characters, I replaced them using a script. if I was creating ebooks more often then I would just build and maintain a script to do this because it would be consistent across all content.

One HTML technique that works really badly on the Kindle is tables. Amazon's publishing guidelines spend a lot of time on this and they are right. A narrow table is OK but a table does not behave well if the table is wider than the screen. Any excess is simply lost. When I was converting a book, I had to manually re-layout all of the tables and I wasn't able to script the conversions.