Friday, 24 April 2009

SAX Parsing

After some time playing with minidom in Python, I finally looked at alternatives and discovered SAX. For my usual requirements (pulling data out of XML files into various internal data structures) this is so much of an improvement over minidom that I wish I'd come to it first, rather than deal with the mess that is minidom for reading XML. I'm still learning the ins and outs of SAX, but as a brief summary for those who know it even less than me I'll explain the basics.

SAX is an event-based XML parser. What this means is that it will read in XML and generate events based on what it has read in. Events would be something like the start of an element, the end of an element, etc. Your application then receives these events and can respond to them appropriately. This can entail a small amount of extra book-keeping in your application to know where you are in the XML tree (assuming that's important for your app), but it is so much easier to deal with than the multiple levels of looping through child nodes that is required to pull data out of an entire DOM tree.

My first problem with SAX (at least from what I know of it so far anyway) is that with a complicated XML file you can end up with an extremely large and non-cohesive handler for your file. To get around this I created a SAX content handler that maintains a 'processor stack' that allows you to split down your processing into more flexible, cohesive units. It isn't perfect (it has some ugliness when it comes to transferring data between processors currently, and adding a new processor is something that is hard-coded into each processor) but it makes for a nice system to me as I can split up my processing to any level I like, the processor stack acts as that extra book keeping I mentioned and it does so transparently. It allows for more flexibility in changing the structure and reuse of components.

All in all, I quite like SAX and I'll probably like it even more if I spend some time on my content processing to make it a more generic solution :)