Since my last post I spent a couple of evenings putting together a pure Python storage class to store arbitrary chunks of data in a way inspired by Haystack. This proved quite easy in Python (unsurprisingly). It would also be quite easy to implement in in C/C++ for performance but I am not convinced that is necessary and a pure Python implementation has the virtue of being simpler to deploy.
The next stage is to integrate it into Django so that it can be used as a model field and extend the admin application and generic views handle it. This should be almost as easy so I am now reading the nice tutorial on creating new model fields along with the FileField source code.
Interestingly, I came across this story via Reddit about drawbacks of CouchDB.
http://blog.woobling.org/2009/05/why-i-dont-use-couchdb.html
I had already decided that I wasn't keen to play with CouchDB at present because I think some things (such as authentication and real regular data) are more efficiently done in a normal SQL DBMS.
I am now brooding about creating a CMS with a combination of my storage classes and search but I am trying to work out if it is any different to a Wiki with search facilities.
Sunday, 17 May 2009
Monday, 11 May 2009
Some thoughts on a text storage system for Django inspired by Facebook's Haystack
I have been looking at Django lately as I wanted a project to play with and I thought that I might extend it in some way. In some ways, Django has been a disappointment as it is quite mature and very powerful so the opportunities for tinkering that I wanted are not really present. This makes it great for real web developers but less great for me personally ;-)
However, one aspect that kept nagging at me was the use of DBMS fields for large quantities of text and for searching (this is not only a feature of Django, of course). When I used real DBMS (back in the nineties) we had to be very careful about optimising column sizes and we would not have dreamt of storing whole blog entries in a database, let alone larger bodies of text. I know that hardware has gotten cheaper and DBMS have improved but this still feels to me like using the wrong tool - as if the designers used a DBMS for everything because it was the tool that they knew best.
I was browsing for search software and found examples such as Xapian and Sphinx. Again, I had used a similar tool in the late nineties when building an in-house knowledge database but the world has moved on significantly since then. I installed Xapian on my Linux laptop and was shocked by how easy and quick it was to feed in text and search on it. I fed in a Robert E Howard novel using the Python binding and didn't notice it go in. A search was also extermely fast. This provoked the thought of extending Django with Models that include searchable fields. It should be possible to simply tag model fields as searchable and have a Django extension automatically index them with Xapian (or another search tool - the choice would be transparent) and then extend the generic views to include searches. This feels like a project that would be technically interesting and that would extend Django in a style that is consistent with Django.
One issue that occurred to me when thinking about Xapian was the sequencing of index updates. I am wary of indexing as part of the data creation in a web server. If multiple users are creating data simultaneously then Xapian would not like having too many simultaneous updates. As Django is deployed with a range of web servers (or certainly with a range of interfaces to web servers) I would be reluctant to try to implement locking across multiple requests. My current proposed solutuion is to create an internal message whenever an item is added that needs indexing and then to pick them up and deal with them in a batch from a specialised request. Django provides such a messaging system so it is not necessary to add another table for the purpose and the indexing requests could be kicked off by a cron job or manually by a sysadmin. The use of a single batched indexing systems allows more efficient indexing sessions.
While considering non relational DBMS storage systems (I browsed through articles on BigTable and CouchDB), I came across another story that I noticed on Reddit about Haystack - the storage system used by Facebook to store photographs. Apparently, Facebook uses MySQL extensively but the sheer volume of photo data made it impractical for photo storage. Haystack is used to store phots in a very efficient manner - the Haystacks contain meta data pointing to the photo data; all data is appended to files so multiple reads can take place while appending takes place; the data is supposed to be structured so that it can be retrieved with minimal disk seeks.
Putting these ideas together, I thought of a text version of Haystack. Each text object can be appended to a filing set and the database can just contain its meta data (file name, offset and size). This makes the relational database smaller and so more efficient. The text can be indexed for searching as it gets saved and it should be possible to extend the generic views to handle this transparently. Other ideas from Haystack and CouchDB that can be applied include versioning and never deleting content. I would probably also have a configurable maximum size for a storage file and have the system just keep extending the system with new storage files - I have a bias against unreasonably large individual files for backup and management reasons.
If I get a few hours, I should be able to prototype this and feed in a range of literature from project Gutenberg as a test set.
However, one aspect that kept nagging at me was the use of DBMS fields for large quantities of text and for searching (this is not only a feature of Django, of course). When I used real DBMS (back in the nineties) we had to be very careful about optimising column sizes and we would not have dreamt of storing whole blog entries in a database, let alone larger bodies of text. I know that hardware has gotten cheaper and DBMS have improved but this still feels to me like using the wrong tool - as if the designers used a DBMS for everything because it was the tool that they knew best.
I was browsing for search software and found examples such as Xapian and Sphinx. Again, I had used a similar tool in the late nineties when building an in-house knowledge database but the world has moved on significantly since then. I installed Xapian on my Linux laptop and was shocked by how easy and quick it was to feed in text and search on it. I fed in a Robert E Howard novel using the Python binding and didn't notice it go in. A search was also extermely fast. This provoked the thought of extending Django with Models that include searchable fields. It should be possible to simply tag model fields as searchable and have a Django extension automatically index them with Xapian (or another search tool - the choice would be transparent) and then extend the generic views to include searches. This feels like a project that would be technically interesting and that would extend Django in a style that is consistent with Django.
One issue that occurred to me when thinking about Xapian was the sequencing of index updates. I am wary of indexing as part of the data creation in a web server. If multiple users are creating data simultaneously then Xapian would not like having too many simultaneous updates. As Django is deployed with a range of web servers (or certainly with a range of interfaces to web servers) I would be reluctant to try to implement locking across multiple requests. My current proposed solutuion is to create an internal message whenever an item is added that needs indexing and then to pick them up and deal with them in a batch from a specialised request. Django provides such a messaging system so it is not necessary to add another table for the purpose and the indexing requests could be kicked off by a cron job or manually by a sysadmin. The use of a single batched indexing systems allows more efficient indexing sessions.
While considering non relational DBMS storage systems (I browsed through articles on BigTable and CouchDB), I came across another story that I noticed on Reddit about Haystack - the storage system used by Facebook to store photographs. Apparently, Facebook uses MySQL extensively but the sheer volume of photo data made it impractical for photo storage. Haystack is used to store phots in a very efficient manner - the Haystacks contain meta data pointing to the photo data; all data is appended to files so multiple reads can take place while appending takes place; the data is supposed to be structured so that it can be retrieved with minimal disk seeks.
Putting these ideas together, I thought of a text version of Haystack. Each text object can be appended to a filing set and the database can just contain its meta data (file name, offset and size). This makes the relational database smaller and so more efficient. The text can be indexed for searching as it gets saved and it should be possible to extend the generic views to handle this transparently. Other ideas from Haystack and CouchDB that can be applied include versioning and never deleting content. I would probably also have a configurable maximum size for a storage file and have the system just keep extending the system with new storage files - I have a bias against unreasonably large individual files for backup and management reasons.
If I get a few hours, I should be able to prototype this and feed in a range of literature from project Gutenberg as a test set.
Sunday, 3 May 2009
Use of the Broswer as a (cross-platform) UI
I was considering some mobile application development and was debating which UI library to use. Some thinking came up with the idea of using the browser as the UI with a local HTTP server. This is really suitable for creating PC local applications rather than mobile apps at this time but more powerful mobile phones may change this.
An obvious attraction is that it helps the separation of content from presentation (as long as you do not over-indulge in Javascript, let alone do anything daft such as using Java Applets or Flash)
Browser as UI
With modern browsers and straightorward applications, a perfectly good UI can be built. It is not really possible to create a very complex UI but I claim that most applications do not need a very complex UI - just how dynamic and immersive does a PIM application need to be?
Applications such as Gmail and online office packages demonstrate what can be done.
HTTP server accessing local functions
Using an HTTP server for the 'back-end' functionality means that any language can be used - there really is no need for anything in common with the UI. My personal preference is Python so I am going to experiment with Django.
Using HTTP as the interface can be seen as overkill or inefficient. Rather than reading data directly from the UI and modifying it directly, everything has to be serialised and transferred over an internal socket. This is true but the overhead is probably negligible compared with the actual application functionality. The real queston is whether it provides a convenient programming environment.
Most web development is concerned with preventing insecure access but if the local HTTP server is the application engine then it needs full access to the local machine - not really a problem.
The real issue is deployment of the server - no normal user is going to install and configure Apache and MySQL just for one application. The good news is that the server does not need to be scaleable as it will only be serving one user. This means that lighter-weight servers such as Django's development server or one custom-built from Python's simple HTTP server classes can be adequate.
An obvious attraction is that it helps the separation of content from presentation (as long as you do not over-indulge in Javascript, let alone do anything daft such as using Java Applets or Flash)
Browser as UI
With modern browsers and straightorward applications, a perfectly good UI can be built. It is not really possible to create a very complex UI but I claim that most applications do not need a very complex UI - just how dynamic and immersive does a PIM application need to be?
Applications such as Gmail and online office packages demonstrate what can be done.
HTTP server accessing local functions
Using an HTTP server for the 'back-end' functionality means that any language can be used - there really is no need for anything in common with the UI. My personal preference is Python so I am going to experiment with Django.
Using HTTP as the interface can be seen as overkill or inefficient. Rather than reading data directly from the UI and modifying it directly, everything has to be serialised and transferred over an internal socket. This is true but the overhead is probably negligible compared with the actual application functionality. The real queston is whether it provides a convenient programming environment.
Most web development is concerned with preventing insecure access but if the local HTTP server is the application engine then it needs full access to the local machine - not really a problem.
The real issue is deployment of the server - no normal user is going to install and configure Apache and MySQL just for one application. The good news is that the server does not need to be scaleable as it will only be serving one user. This means that lighter-weight servers such as Django's development server or one custom-built from Python's simple HTTP server classes can be adequate.
Subscribe to:
Posts (Atom)