Musings of a Software Architect

Friday, 17 June 2011

Creating an ebook for the Kindle

I have been reading ebooks for years using the MobiPocket application and one ofmy reasons for liking it was the ability to create your own ebooks from sources such as project Gutenberg in addition to buying commercially produced ebooks.

I recently gave in to temptation and bought a Kindle from Amazon. One of the attractions for me was that it also allows you to read ebooks that you have created yourself. At first, I only created simple ebooks but I recently volunteered to help create a Kindle version of the excellent "Architecture of Open Source Applications" http://www.aosabook.org/en/index.html and, in the process, learnt more about creating better quality ebooks. None of this is very difficult if you are comfortable editing HTML and XML so this post is intended to share the techniques that I found useful.

Alternative formats

The Kindle will do its best to display text and PDF files but these do not give a perfect reading experience.

When displaying text files, the Kindle treats each line as separate - it does not run subsequent lines into a paragraph. It will wrap lines to fit the screen based on the current test size so all the text is visible but in some cases the line breaks can be jarring.

PDF files present a different challenge. Because the PDF files are intended to preserve the page layout, the actual text flow is lost. The Kindle has to preserve the page layout for a PDF file and cannot change text size or handle pages wider than the screen. I have a few books in PDF form that I manage to read on the Kindle by switching to landscape format but it is not possible to change the text size. On the plus side, embedded images work fine.

Neither text nor PDF files allow the Kindle to provide a useful table of contents.

In summary, if you have a small text file or a file that is only available as a PDF then you can read it using the Kindle but it will not give an ideal reading experience.

Native format

In order to get the best reading experience with variable text size and a table of contents, it is necessary to generate an ebook in the 'native' .mobi format.

A simple book can be generated using the open source Calibre tool. This provides a full ebook management service but it includes a format converter that can take a suitable HTML file and generate an ebook.

I used Calibre for relatively simple books but I then found that kindlegen gave more control so I switched to that.

Kindlegen is provided by Amazon - Google gave me this link but search for the latest version
http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621 and it is the officially sanctioned tool for generating ebooks for the Kindle. There are versions of kindlegen for Linux, Mac and Windows and the download includes a very useful example. A guide to using kindlegen is separately available by going to Amazon's page on publishing for Kindle http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621 and is well worth getting.

Kindlegen also supports the creation of an ebook from a single HTML file but if you want to create a larger ebook or to include a table of contents then you need to go to the trouble of creating an .opf file and structuring the book accordingly.

Creating an ebook using .opf, .ncx and other files

The best way to create your first ebook based on .opf is to take the relevant files from the kindlegen example and edit those. The .opf file is both a manifest that lists the component files and meta-data and also includes a reading order.

The various meta-data include title, author(s), ISBN number, subject and summary and are all pretty obvious.

The manifest part is a list of files to include. These are mostly HTML files but the cover image (if any) is specified there and the .ncx contents file (covered below). When specifying files to include, simply keep them all in the same directory as the .opf file and refer to them by name. You can probably refer to files in relative or absolute directories if you need to.

One piece of good news (if you are lazy like me) is that you do not need to explicitly include image files in the .opf file. If you include an HTML file that includes an image then kindlegen will pick it up automatically.

The kindlegen guide includes guidelines on cover images - basically go for 600 by 800 in colour and try not to re-scale because that reduces the quality. If you use large embedded images then the kindle will scale them nicely for different screen layout. There is no need to change the image size unless memory usage becomes an issue. Larger than 600 or 800 in either direction will be wasted but re-scaling it yourself may affect quality.

Once you have all of the files in the manifest, you need to set the reading order and the table of contents.

The table of contents is slightly annoying in that you are advised to create it in two ways. You create an HTML table of contents that lists the chapters (or sub-sections) and has links (more below on links). This is to be found when the reader starts at the beginning of the book. The second table of contents is an XML file with a .ncx extension that associates text with a file and optional location. This table of contents is the one found when the reader selects go to table of contents.

As with the .opf file, the best way to create the table of contents files is to edit the examples provided with kindlegen.

Once you have the .opf, .ncx and various HTML files, you can simply run kindlegen to generate a .mobi file. Copy this to your Kindle (use the USB cable, it is your friend when generating ebooks). You can then check the tables of contents and cover picture as well as the actual content.

Links within an ebook work simply. If you have a link with the full http:// protocol then following it will open a browser (unsurprisingly). If you leave off http:// and just refer to a file name or an id then the link is internal to the book.

This allows you to build the table of contents or to put links within the book to other sections or to pictures or tables. A link of the form <a href="otherfile.html"> links to another content file in the book. A link of the form <href="otherfile.html#atag"> links to a tag with id="atag" in the file. If you have a set of content files from the web with links between them then consider using a script to remove the http:// part.

Formatting tips and tricks

Once you have a book with a structure and a table of contents, you can tune the appearance. You should be aware that it is better to have simple formats rather then overly elaborate ones. The Kindle screen is limited in size and being too fussy about layout can backfire.

One technique that I wished I had learnt earlier is that kindlegen / the Kindlegen has reasonable support for CSS. I have not tried using external stylesheets but I was able to put class styles in the file headers and they worked perfectly. When I started using kindlegen, I assumed that it would not be this capable so I wrote scripts to edit the formats - changing class styles was much quicker and
easier to experiment with.

I wanted to make some text smaller. This needs restraint as the Kindle allows the user to change text size so don't override their choice but we found that monospaced text was larger than the default font and looked better when reduced. Both the <font>> tag and the font-size style worked but you have to go for the smallest font size to have any effect. In contrast, it is possible to have several larger font sizes but they quickly take over the screen.

Horizontal rules <hr> tags work well to divide up text and can be full width or partial.

The various <h1>, <h2>, <h3> tags work and should be used for sections and sub-sections.

If you want whitespace between paragraphs (for example, we wanted space before and after section headings) then <p> </p> provides a gap.

Paragraphs are indented by default (this is normal book layout) but can be unindented <p style="text-indent:0"> - this technique is described in Amazon's kindle publishing guidelines.

The Kindle does not handle all Unicode characters. I encountered problems with some vertical bars, some quotation marks and a character to denote a space. In some cases, these are stored as entities in the HTML and can be replaced by similar characters automatically. Other cases were UTF8 and these can also be replaced by a script.

Not all Unicode or entity characters cause problems. I found the best approach was to leave them alone and then manually proof-read the Kindle version, spot invalid characters and then go and replace all occurences. This risks missing some. Once I found offending characters, I replaced them using a script. if I was creating ebooks more often then I would just build and maintain a script to do this because it would be consistent across all content.

One HTML technique that works really badly on the Kindle is tables. Amazon's publishing guidelines spend a lot of time on this and they are right. A narrow table is OK but a table does not behave well if the table is wider than the screen. Any excess is simply lost. When I was converting a book, I had to manually re-layout all of the tables and I wasn't able to script the conversions.

Tuesday, 25 August 2009

Dynamically generating SVG in an HTML page

In this post, I want to briefly list the less obvious aspects of using SVG to draw a game board. These are notes from my initial implementation so there may well be more polished alternatives.

For this work I used a recent version of Firefox and I have not yet investigated other browsers in any depth so your mileage with other browsers may vary.

My work was aimed at dynamically creating a page (hence the use of Javascript) but the same techniques could be used to create a static HTML page including SVG.

This project uses XML, SVG and Javascript as well as simple HTML. I have not provided any links to information on these subjects; I recommend finding a good book on XML and Javascript but a simple web search generated enough information on SVG.

Embedding SVG in HTML

I did not want to just have a stand-alone SVG page but an SVG picture embedded in an HTML page. This is because I wanted to make use of HTML elements and I also found HTML better for loading Javascript than stand-alone SVG.

I found that I needed to create the HTML page as an XML page so the start of the page is as follows:


<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml">
      xmlns:svg="http://www.w3.org/2000/svg" 
      >
<head>

This provides both a default XML namespace for the HTML (actually XHTML but don't worry about that) and a namespace for use by SVG elements. This is important as the browser needs to be able to tell whether any element in the page is meant to be HTML or SVG. We will come back to namespaces later. If you are not familiar with XML namespaces then it would be worthwhile brushing up on them but the examples below include all the code necessary.

When I served this from Django I had to ensure that the HTTP Response had the correct MIME type as follows:

return HttpResponse(t.render(c), mimetype="text/xml")

If you are not familiar with Django then don't worry about this line, it just returns the page but with the MIME type set to text/xml.

Within the HTML body I simply embedded the SVG picture directly. I did not include it in any explicit graphical or other object.


<svg:svg id="board" width="500" height="500">
  <svg:rect x="10" y="10" width="485" height="485" fill="#FF9933">
</svg:svg>

Note that the elements all use the svg namespace with the shorthand reference defined at the head of the file. We need to give the svg element an id so that we can refer to it from Javascript.

I found that I had to set the width and height of the svg object. I had hoped to have it more dynamically sized but that did not work too well. Although I deliberately did not set the units of the svg element, I found that it ended up using pixels as the units. I was hoping to avoid setting the size in pixels in order to make the page work well with different sized screens and windows but I was unable to make it scale well. I could have investigated further but I chose to pick a plausible size instead. I may return to this at a later date.

In the example above, I included a rectangle. I could have just drawn it dynamically but it helped me during development to ensure that the SVG was drawing properly before I started creating more elements.

If you just put these elements in a HTML page then you should get a page (which you can put any normal text, images etc. as standard HTML) with a picture embedded in it. The next section covers adding more drawing elements.

Drawing SVG elements with Javascript

In this section, I assume that you are already familiar with using Javascript to generate HTML elements and dynamically add them to an existing page. This type of DHTML is covered by many articles online but the basic concept is to create a new element object of the required type, set any attributes and then to append it as a child element under a known location. The code below uses the same technique but with an extra wrinkle for SVG. If your Javascript DHTML knowledge is limited to copying inner HTML then this is the next level but it is not too difficult.

For my purposes, I knew that I was going to create a lot of SVG elements so I wrote a function to create an arbitrary SVG element.


function createSVGElement( eltType, properties) { // Create an arbitrary SVG element
    var tgt = document.getElementById("board");
    var elt = document.createElementNS('http://www.w3.org/2000/svg', eltType);
    for ( prop in properties ) {
        elt.setAttribute(prop, properties[prop]);
        }
    tgt.appendChild(elt);
    }

This code has three parts:

1) The createElementNS() function is used to create a new element. If you have worked only in HTML then you may have used the CreateElement() function. This version adds a namespace argument which we need in order to specify an SVG element. Without this, the browser will not recognise the element, even if you create an element with a name that is not valid in HTML.

2) The attributes of the new element need to be set. These attributes will include geometric attributes for location and size as well as fill color etc. I used the array approach simply because of wanting a re-usable function. If I was using inline code to create the elements then I could use SetAttribute() as above or use the dot notation. If you use the dot notation to set attributes then be aware that some SVG attribute names are not legal Javascript names (they include a dash). For these attributes, the SetAttribute() method is necessary as it uses the property name as a string.

3) Finally, the new element is appended as a child to the top-level svg element. This is just the same technique as is used for dynamically adding HTML elements to an HTML page. In my example, I added every element directly under the root svg element but there is nothing to stop you from creating a less flat element structure. This might be particularly useful if you want to try changing properties of grouping elements for dynamic effects.

Given this function (and I can already see at least one way of optimizing it), here is a chunk of code that will draw a simple line.


var lineProperties = {'stroke': 'black'};
lineProperties.x1 = 10;
lineProperties.x2 = 10;
lineProperties.y1 = 10;
lineProperties.y2 = 490;
createSVGElement( 'line', lineProperties);

This chunk is slightly clumsy because it has been taken from a more dynamic function. The lineProperties object is initialized with the line colour and then properties are added for the x and y of each end.

Note that I have used both the string way of setting a property and the dot notation (which is fine for these properties). In this case, I could have just created the properties in one statement.

Given the set of properties, we simply call createSVGElement() with the name of the element and the set of properties.

Full details of the properties of SVG elements can be found in the SVG specification.

Associating events with SVG elements

The code above shows the techniques required to draw an SVG picture within an HTML page. In my case, I wanted to draw a Go board and I was simply able to draw the lines of the board with a for loop. However, in order to make this part of a game, I wanted a way to capture input events. SVG elements support an onclick() attribute which is a reference to a Javascript function. This is set in the same way as any other attribute.

While using onclick(), I used some other techniques. The first is to remember to set the pointer-events attribute (remember that I said that some SVG properties have a dash - this one cannot be set using the dot notation).

The second is that it is possible to have a hidden element that can still capture pointer events.

The third is that I set the id property for each pickable element and I built the id into the onclick() event as an argument. This means that each pickable element can identify itself when selected.


var circProperties = {'visibility': 'hidden', 'pointer-events': 'all', 'r': '10'};
circProperties.id = 'circ_10_10';
circProperties.cx = 10;
circProperties.cy = 10;
circProperties.onclick =  'playMove("'+circProperties.id+'");';
createSVGElement( 'circle', circProperties);

In my real use of this example, the cx and cy and id properties are set from loop variables.

Unfortunately, SVG elements do not have drag events. A simple click event is fine for Go but less than ideal for some other games. For example, in order to implement a game of Chess where the obvious user action would be to select a piece and drag it to its new position, I would need to select a piece and then select a new position. This requires some state in the Javascript and is less obvious for users. One thought is that, when a piece is selected, I could highlight the legal positions that it could move to. This requires the game logic to be embedded in the Javascript rather than at the server side but it would not be too hard.

Alternative Approaches

As I mentioned above, I did this work in Firefox. I like Firefox but I found that SVG support in IE and Konqueror is very limited (i.e. I couldn't make any of this work but I didn't try for more than a few minutes). However, A comment to my last post from Brad Neuberg pointed out that the SVGWeb project (http://code.google.com/p/svgweb) should add SVG support to IE and other browsers. I have not had time to try this but I will definitely give it a go.

However, I have also considered other approaches. One of these is to generate a picture file and lay it over an image map in straight HTML. This requires generating the complete picture at the server side and then sending the picture file to the browser. This picture file will definitely be larger than fragments of XML containing move information and the server will actually have to generate the picture. I have dabbled with generating a PNG file from Python and it can be optimized quite heavily but it is still less efficient than using SVG. However, for compatibility reasons and to see just how difficult it is, I will probably try it out at some time.

Monday, 10 August 2009

Thoughts on a Browser-based board-game

One of my spare-time projects over the last few weeks has been a browser-based, pure HTML and SVG implementation of the Go board game. This has reached a usable state (although definitely still pre-alpha) and is worth some reflection on the architectural implicatons.

My starting point was that I wanted to build the game using a Django / Python back-end and HTML / AJAX / SVG front-end. The project came out of my experimenting with Django and I am considering other types of game but Go was a good starting point.

A more commercially realistic approach would be to build a Java or Flash client but I wanted to see how much could be achieved using SVG. Also, I dislike Flash (due to its over-use) and I am tired of problems with Java installers.

Client Implementation
The client worked fine once I read up slightly on Javascript. Directly creating SVG objects has a namespace quirk but otherwise works very easily. I chose Go because the graphics are easy and the game-play just involves selecting points. My experiments have shown that it would be straightforward to use graphics files for the board and pieces if I wanted a more attractive appearance.

The two drawbacks that I have found with this approach are browser support and the lack of drag and drop.

My early experiments with IE and Konqueror show poor SVG support so I just worked in Firefox for now. If I persist with the project longer term then an alternative would be to draw the board and pieces server-side into a PNG file and send that combined with a client-side image map. This would be more work on the server side and would require more data transfer but I would quite enjoy creating a basic PNG library in Python.

SVG includes onClick events so I was able to put hidden items on the points to be selected. This works fine for Go but would not work in, for example, chess where a drag and drop action would be desirable. It would be possible to click once to seect the piece and then again to select the destination but I will have to try it out to see if it feels too clumsy.

Server Interface

I was deliberately working with a Django back-end intended for ease of deploymen so I wanted no non-standard server features. this proved interesting as what I initially considered to be a clumsy architecture turns out to have some attractive aspects.

Dedicated server versus HTTP transactions

If I had not been basing the design on Django then I would have started with a server that kept each active game in memory and applied moves as they were received. As I was using Django, each browser action was totally separate and so the POST request to play a move requires the game to be loaded from the database, the move applied and a response sent to the browser with captured stones etc. and the game saved in an updated form.

In a quick-moving game, having to load and save the game for each move is an overhead but when a longer-term view is taken the Django approach (really , the generic web-server approach) has robustness advantages. With a server running direct connections to clients I would have to work out what to do if a connection went down - how could a client re-connect and resume their game. Also, what if players want to pause a game overnight? Basing the whole approach on an atomic move basis automatically handles these aspects and also provides a re-play feature almost for free.

Server Push

Another aspect that was less convenient was trying to implement server-push. I want to be able to update player A's browser when player B makes a move (I also included a chat pane in the game which has the same requirement). In a dedicated server, I would use a long-standing read from the client and would provide a practically instant response. This is not really workable using an HTTP server.

In theory, I could use a dedicated HTTP GET request to fetch updates and have the server delay replying until there is event data to return. On the browser side this should work fine. Unfortunately, it has two problems on the server side

Firstly, it means that each game would involve two dedicated requests outstanding at all times in addition to the transient requests to fetch game state, post moves etc. With conventional HTTP servers this would likely cause a scalability problem as each connection requires resources and servers are configured to limit connections.

Secondly, HTTP servers running Django or other CGI frameworks keep each request separate (for very good reasons). Therefore, if I had a pending request for player A and a separate POST from player B it would be tricky to communicate between those requests. This might not be insuperable (and would be trivial in a dedicated server) but I did not investigate further given the resource issues.

This means that my event handling is based on poll-based requests that mostly return no new event data. This is fine in terms of server resources but it does consume network bandwidth and means that the responses can be delayed for a few seconds. This type of problem has been met by a lot of other people (look up Comet in Wikipedia for example) and I expect that I will return to it to investigate further at some time. For example, at what point does the server load from repeated requests with no events outweigh the resource cost from outstanding requests and how can I pass events from one request to another (maybe the new Django signals will help)?

Implementation Efficiency

One lesson from this project so far is the ease of use of Django. Because Django takes care of all of the plumbing, I have 200-300 lines of Python to implement the rules of Go, about 100 lines of plumbing and chat code in Python and about 200 lines of Javascript to draw the game and handle moves. The whole project took less than a week of evenings so far and that included a chunk of learning. Adding more error checking and more attractive admin interfaces will probably double the size but the whole project is easily understandable and manageable.

Wednesday, 17 June 2009

Why I think Opera Unite is missing the point

I watched the PR fuss over Opera Unite with some interest but I do think that they have missed the point. This posting by Chris Messina raises some valid points about the use of Opera as a proxy but I think that this covers only some of the weaknesses.

Over the years, I have seen proposals for servers running on individual personal computers and on mobile phones and they consistently look to me like a solution looking for a problem.

Technically, it is easy to create or configure an HTTP server on most devices. On a PC, a server like Apache is not hard (for a techie) to install and configure although I don't see average users trying it. It is not hard to port Apache to smaller platforms but I have also seen (or created) custom HTTP servers developed for more embedded platforms. Creating a basic HTTP server in C/C++ or Python (or other languages) is easy. All of the clever options that 'real' web servers support are more challenging but probably not required for smaller devices.

What is clearly more challenging is deployment and configuration for average users, but this ties into the user cases. Given the assumption that it is practical to create an HTTP server on a range of devices, the question arises of what to use it for and how is the address made accessible. I think that both of these are real problems. I think that Unite addresses the deployment and configuration challenges but still does not come up with good use cases.

Opera try to handle publishing addresses by means of their own proxies. The comments in the blog above show some of the problems with this but I don't see a good alternative. Unlike standard web servers that have relatively stable IP addresses and that are hooked into the DNS system to make URLs usable, personal computers and mobile devices normally do not have fixed IP addresses. They commonly use DHCP to get a transient IP address. This means that it is difficult for a remote client to know what address to connect to. Also, PCs and mobile devices are not necessarily on-line the whole time (but more of this below). In practice, a personal or mobile HTTP server would need to get an address and then push its address to some central addressing server so that clients could find the address and tell if the server is online.

When we come to consider what should be published, I think that Opera are just wrong in their assumptions. I think that it is daft to try to publish static content such as images or other media files from a PC or mobile device compared to using a hosting service. If a user wants to self-publish then they have to consider issues such as backups and access restrictions. I seriously doubt that most users will want to do this. Why not just push the content to Flickr, Facebook or one of the groups services? Even if you don't fully trust the hosting services' backups they will provide basic access control and the content will be available approximately 24/7. If you want to retain control over the content then set up a cheap hosted web site. If you want to exercise access control then consider email or set up a proper site! If these files are made available from your home PC then you will have to effectivekly upload them whenever they are accessed - most home accounts are not great at uploading so this looks like a bad idea to me.

If you do want to publish any static content then surely you want the content to be available most of the time. I don't know about you but my laptop and smartphone are not connected 24/7. They are offline or turned off for substantial parts of each day.

The Fridge and Lounge services in Unite look a bit me-too to me. I could knock up such a service on a hosted site very quickly and they would be available when my home PC was turned off. I haven't looked but I would be very surprised if this type of service was not cheaply available with various hosting services. I guess that Unite makes it available for free and with less work than setting up a hosting service - you trade increased convenience for the price of going through Opera's servers. As an alternative, try setting up a closed Yahoo group and you can get many of the benefits.

I do see value in a local HTTP server but for more specialised purposes:

If a device (maybe embedded or mobile) wants to make its status or other dynamic information available then HTTP is a good enough protocol to use as it is well understood. I could see the value in a service wherby such devices ran a server and then informed a central service that they were available for information retrieval. This is a pretty closed use case.
Specialised HTTP servers such as synchronisation servers are useful but they will tend to run when required and have specialised methods for making contact with a client (you can tell that I used to work on OMA Datasync).

So, although I like the idea of running a local HTTP server on a PC or mobile device (it is quite fun to consider the porting issues), I still think it is a solution looking for a problem. The Opera site mentions future developments and the ability for developers to work with it but, heck, if I want to create web services, I will play with a hosted service. I can use Django (just one of many) to build something and all that Units gives me is some convenience at the cost of beling locked into Unite. I would love to be wrong but I don't think I am.

P.S.
Having mentioned Chris Messina's post above, I subsequently came across a response from Lawrence Eng here.

I should add that I am not criticising Opera for not going open source
and I am not commenting on the technical implementation - I have not
spent the time to look at that so I am prepared to assume that it is
fine. My view is that the whole concept is a problem looking fr a
solution. Some of the blog comments make the point that it is easy
for naive users to use and this may turn out to be sufficient.
Alternatively, somebody may come up with the killer use case that I
have overlooked. until then, I will rem ain interested but sceptical.

Sunday, 17 May 2009

Further thoughts on implementing non DBMS storage

Since my last post I spent a couple of evenings putting together a pure Python storage class to store arbitrary chunks of data in a way inspired by Haystack. This proved quite easy in Python (unsurprisingly). It would also be quite easy to implement in in C/C++ for performance but I am not convinced that is necessary and a pure Python implementation has the virtue of being simpler to deploy.

The next stage is to integrate it into Django so that it can be used as a model field and extend the admin application and generic views handle it. This should be almost as easy so I am now reading the nice tutorial on creating new model fields along with the FileField source code.

Interestingly, I came across this story via Reddit about drawbacks of CouchDB.
http://blog.woobling.org/2009/05/why-i-dont-use-couchdb.html
I had already decided that I wasn't keen to play with CouchDB at present because I think some things (such as authentication and real regular data) are more efficiently done in a normal SQL DBMS.

I am now brooding about creating a CMS with a combination of my storage classes and search but I am trying to work out if it is any different to a Wiki with search facilities.

Monday, 11 May 2009

Some thoughts on a text storage system for Django inspired by Facebook's Haystack

I have been looking at Django lately as I wanted a project to play with and I thought that I might extend it in some way. In some ways, Django has been a disappointment as it is quite mature and very powerful so the opportunities for tinkering that I wanted are not really present. This makes it great for real web developers but less great for me personally ;-)

However, one aspect that kept nagging at me was the use of DBMS fields for large quantities of text and for searching (this is not only a feature of Django, of course). When I used real DBMS (back in the nineties) we had to be very careful about optimising column sizes and we would not have dreamt of storing whole blog entries in a database, let alone larger bodies of text. I know that hardware has gotten cheaper and DBMS have improved but this still feels to me like using the wrong tool - as if the designers used a DBMS for everything because it was the tool that they knew best.

I was browsing for search software and found examples such as Xapian and Sphinx. Again, I had used a similar tool in the late nineties when building an in-house knowledge database but the world has moved on significantly since then. I installed Xapian on my Linux laptop and was shocked by how easy and quick it was to feed in text and search on it. I fed in a Robert E Howard novel using the Python binding and didn't notice it go in. A search was also extermely fast. This provoked the thought of extending Django with Models that include searchable fields. It should be possible to simply tag model fields as searchable and have a Django extension automatically index them with Xapian (or another search tool - the choice would be transparent) and then extend the generic views to include searches. This feels like a project that would be technically interesting and that would extend Django in a style that is consistent with Django.

One issue that occurred to me when thinking about Xapian was the sequencing of index updates. I am wary of indexing as part of the data creation in a web server. If multiple users are creating data simultaneously then Xapian would not like having too many simultaneous updates. As Django is deployed with a range of web servers (or certainly with a range of interfaces to web servers) I would be reluctant to try to implement locking across multiple requests. My current proposed solutuion is to create an internal message whenever an item is added that needs indexing and then to pick them up and deal with them in a batch from a specialised request. Django provides such a messaging system so it is not necessary to add another table for the purpose and the indexing requests could be kicked off by a cron job or manually by a sysadmin. The use of a single batched indexing systems allows more efficient indexing sessions.

While considering non relational DBMS storage systems (I browsed through articles on BigTable and CouchDB), I came across another story that I noticed on Reddit about Haystack - the storage system used by Facebook to store photographs. Apparently, Facebook uses MySQL extensively but the sheer volume of photo data made it impractical for photo storage. Haystack is used to store phots in a very efficient manner - the Haystacks contain meta data pointing to the photo data; all data is appended to files so multiple reads can take place while appending takes place; the data is supposed to be structured so that it can be retrieved with minimal disk seeks.

Putting these ideas together, I thought of a text version of Haystack. Each text object can be appended to a filing set and the database can just contain its meta data (file name, offset and size). This makes the relational database smaller and so more efficient. The text can be indexed for searching as it gets saved and it should be possible to extend the generic views to handle this transparently. Other ideas from Haystack and CouchDB that can be applied include versioning and never deleting content. I would probably also have a configurable maximum size for a storage file and have the system just keep extending the system with new storage files - I have a bias against unreasonably large individual files for backup and management reasons.

If I get a few hours, I should be able to prototype this and feed in a range of literature from project Gutenberg as a test set.

Sunday, 3 May 2009

Use of the Broswer as a (cross-platform) UI

I was considering some mobile application development and was debating which UI library to use. Some thinking came up with the idea of using the browser as the UI with a local HTTP server. This is really suitable for creating PC local applications rather than mobile apps at this time but more powerful mobile phones may change this.

An obvious attraction is that it helps the separation of content from presentation (as long as you do not over-indulge in Javascript, let alone do anything daft such as using Java Applets or Flash)

Browser as UI

With modern browsers and straightorward applications, a perfectly good UI can be built. It is not really possible to create a very complex UI but I claim that most applications do not need a very complex UI - just how dynamic and immersive does a PIM application need to be?

Applications such as Gmail and online office packages demonstrate what can be done.

HTTP server accessing local functions

Using an HTTP server for the 'back-end' functionality means that any language can be used - there really is no need for anything in common with the UI. My personal preference is Python so I am going to experiment with Django.

Using HTTP as the interface can be seen as overkill or inefficient. Rather than reading data directly from the UI and modifying it directly, everything has to be serialised and transferred over an internal socket. This is true but the overhead is probably negligible compared with the actual application functionality. The real queston is whether it provides a convenient programming environment.

Most web development is concerned with preventing insecure access but if the local HTTP server is the application engine then it needs full access to the local machine - not really a problem.

The real issue is deployment of the server - no normal user is going to install and configure Apache and MySQL just for one application. The good news is that the server does not need to be scaleable as it will only be serving one user. This means that lighter-weight servers such as Django's development server or one custom-built from Python's simple HTTP server classes can be adequate.