Monday, 21 September 2009

Search and Indexing

I've had a lot of exposure to full text indexers since I started working at HedTek Ltd. From Lucene to Solr and now even Sphinx, I feel it's time to write up some of my experiences.

Probably the best known of the indexers I've now encountered, Lucene is an Apache project that aimed (and succeeded) at implementing an efficient, simple and useful full text indexer in Java. This project is a great library for creating your own search indexes and performing quick searches across it with a familiar syntax. It's also very flexible, allowing you to plug-in your own functionality to index just about any kind of document.

With all that, you'd wonder why anyone would use anything else? Well, it's not all rose gardens with Lucene. Firstly, this is a low-level API designed to be the heart of an index and search engine. It's not a complete solution ready for use straight out the box. Second, in the project where I had my initial exposure to it, the version of lucene in use was the Zend PHP implementation of lucene. While this is an excellent idea (as it allows lucene to be used from PHP directly, no messing around with Java interfaces) there was one key problem - performance. With the index size in use (24 million records) searches that would take under a second with the Java library would take > 10 seconds with the Zend library. This is clearly very undesirable so other options are required.

Solr is one option to remove the need to interface with Java from your desired language while still retaining the Java lucene implementation. Solr calls itself an 'Enterprise search server' built on Lucene and fills in one of the gaps I mentioned earlier - Solr is a working search engine right out the box. It manages this feat by packaging the Lucene library up into a Java servlet container (runnable through any java server, e.g. Jetty, Tomcat) and providing a HTTP interface for searching. Results can be returned as XML or JSON straight out the box and there are a whole host of other features on top of this that are useful and help an ailing developer create a fully fledged search engine easily. One of the main ones is the ability to define 'schemas' that tell Solr how your records will look, adding a type system to the index and allowing malformed data to be picked up much more easily.

Of course, for Solr you need to have a java server set up. This isn't always the easiest task and there are some subtleties involved that can make this a daunting prospect (I certainly encountered this and still do as I'm not a java server expert... I'm barely a novice). Also the Solr schema is a requirement, so in order to set up your server you need to create a schema for your data. Not a huge imposition, but the schema is defined using an XML language that is a bit opaque to Solr newbies.

The last of the 3 I have tried, and I've only tried it so far on much smaller indexes. Sphinx is another alternative in the full text indexing marketplace, and doesn't rely on Lucene. It functions as a search server (making it more comparable to Solr rather than Lucene) and has several gains on Solr:
  • It is much easier to set up. Where Solr took me over a day to figure out how to install it and get it set up in just a basic configuration, Sphinx took me a bit over an hour to install and configure with a connection directly to a MySQL database.
  • It doesn't need a java server. Sphinx runs as a unix daemon, listening on a local port. This makes it much easier to set up and feels less clunky (at least to me)
  • Very easy to set up multiple indexes. This is possible in Lucene and Solr, but with Sphinx they make it very easy. You have a config file and just define lots of indexes. You can even use the same DB connection for them, allowing you to have indexes that are optimisations of a basic one, which is not as easy in Solr (it may be a simple process, but I haven't come across it yet, making it more effort to find initially with Solr than with Sphinx at the very least)
Sphinx does have disadvantages as well though. The search results it returns are less useful as they contain just the document ID, rather than the lucene results which return stored fields (which can be all you need in certain circumstances and avoid hitting the database after a search). It also seems more geared towards indexing databases, whereas Solr and Lucene are more general purpose. This makes Sphinx great when you are indexing a database but no good if you are indexing a large collection of XML files on disk, or crawling a web page.

So, I haven't come across an absolute winner in the full text indexing arena, but I have come across several alternatives and all of them are suitable for different purposes. If you need something indexed quickly and in Java, use Lucene. If you need a more robust server for general purpose indexing and searching, definitely check out Solr. And if you are searching databases specifically, then Sphinx should definitely be in your list of options.