Monday 11 May 2009

Some thoughts on a text storage system for Django inspired by Facebook's Haystack

I have been looking at Django lately as I wanted a project to play with and I thought that I might extend it in some way. In some ways, Django has been a disappointment as it is quite mature and very powerful so the opportunities for tinkering that I wanted are not really present. This makes it great for real web developers but less great for me personally ;-)

However, one aspect that kept nagging at me was the use of DBMS fields for large quantities of text and for searching (this is not only a feature of Django, of course). When I used real DBMS (back in the nineties) we had to be very careful about optimising column sizes and we would not have dreamt of storing whole blog entries in a database, let alone larger bodies of text. I know that hardware has gotten cheaper and DBMS have improved but this still feels to me like using the wrong tool - as if the designers used a DBMS for everything because it was the tool that they knew best.

I was browsing for search software and found examples such as Xapian and Sphinx. Again, I had used a similar tool in the late nineties when building an in-house knowledge database but the world has moved on significantly since then. I installed Xapian on my Linux laptop and was shocked by how easy and quick it was to feed in text and search on it. I fed in a Robert E Howard novel using the Python binding and didn't notice it go in. A search was also extermely fast. This provoked the thought of extending Django with Models that include searchable fields. It should be possible to simply tag model fields as searchable and have a Django extension automatically index them with Xapian (or another search tool - the choice would be transparent) and then extend the generic views to include searches. This feels like a project that would be technically interesting and that would extend Django in a style that is consistent with Django.

One issue that occurred to me when thinking about Xapian was the sequencing of index updates. I am wary of indexing as part of the data creation in a web server. If multiple users are creating data simultaneously then Xapian would not like having too many simultaneous updates. As Django is deployed with a range of web servers (or certainly with a range of interfaces to web servers) I would be reluctant to try to implement locking across multiple requests. My current proposed solutuion is to create an internal message whenever an item is added that needs indexing and then to pick them up and deal with them in a batch from a specialised request. Django provides such a messaging system so it is not necessary to add another table for the purpose and the indexing requests could be kicked off by a cron job or manually by a sysadmin. The use of a single batched indexing systems allows more efficient indexing sessions.

While considering non relational DBMS storage systems (I browsed through articles on BigTable and CouchDB), I came across another story that I noticed on Reddit about Haystack - the storage system used by Facebook to store photographs. Apparently, Facebook uses MySQL extensively but the sheer volume of photo data made it impractical for photo storage. Haystack is used to store phots in a very efficient manner - the Haystacks contain meta data pointing to the photo data; all data is appended to files so multiple reads can take place while appending takes place; the data is supposed to be structured so that it can be retrieved with minimal disk seeks.

Putting these ideas together, I thought of a text version of Haystack. Each text object can be appended to a filing set and the database can just contain its meta data (file name, offset and size). This makes the relational database smaller and so more efficient. The text can be indexed for searching as it gets saved and it should be possible to extend the generic views to handle this transparently. Other ideas from Haystack and CouchDB that can be applied include versioning and never deleting content. I would probably also have a configurable maximum size for a storage file and have the system just keep extending the system with new storage files - I have a bias against unreasonably large individual files for backup and management reasons.

If I get a few hours, I should be able to prototype this and feed in a range of literature from project Gutenberg as a test set.

No comments: