Saturday 17 April 2010

Features that make you green

For a while now, I've been practicing one of the current 'crazes' of the ruby/rails development ecosystem - automated acceptance testing using Cucumber.

Acceptance testing is something that every software engineering project (hell, any project of any form) needs and in some sense has. It's performed by the person who commissioned a project when they are determining if the project fulfills their criteria and expectations. It may be formal or informal. Automated acceptance testing is the process of writing these tests in a manner that makes them possible to run by a computer. This allows an iterative approach to acceptance in a software project by letting a client have acceptance tests for the system that can be run easily (although not necessarily quickly, a lot of acceptance tests are slow as they exercise the entire stack), and therefore being able to accept parts of the project and having a guard against regression as more parts are completed (in the sense that once a feature is finished, it's acceptance tests pass. After that, if a future feature breaks the existing tests, the code has regressed and is no longer acceptable).

Cucumber itself is a tool for writing these tests. It uses a syntax called 'Gherkin' and breaks your acceptance tests down into specific features, each one made up of several scenarios. A scenario is written in the form 'Given <x>, When <y>, Then <z>'. x, y and z are then called steps, with each step having a definition that 'wires' the scenario up to the functionality required to pass the test. The idea is that these features can be written with the traditional TDD process of 'Red, green, refactor', although the overall process is longer than the tight loop of TDD and will encompass a lot of smaller 'Red, green, refactor' steps as the functionality is built with your TDD process.

My experiences with the system have been mostly positive. It has helped me with working out what I need to do in my projects, giving me starting points and a clear description of what I need to support. It helps with clarifying routes through your system to support specific tasks. However, it adds extra overhead and requires an extra skill of writing the scenarios. The 'advised' method of getting customers to write these doesn't seem to be working and besides, in the projects I've been working on haven't had a customer as such, just myself working out what I think I need.

I'm not going to abandon the practice in the near future as I do think it adds a worthwhile value to a project. However, it isn't a silver bullet to project success and requires a fair bit of effort to write good, maintainable features that will be worthwhile.

References:
Cucumber: http://cukes.info/

Sunday 7 March 2010

The case against software patents just got another pillar

Admittedly, it's a bit delayed in coming to light, but I've just helped push this article around the twittersphere: http://juixe.com/techknow/index.php/2010/03/04/us-patent-linked-list/

For those too busy to follow links, the gist of that article is that one Ming-Jen Wang of LSI Logic Corp patented the Linked List. Now, this would probably be seen as ok, if the patent had been granted back in the early days of computing when the Linked List was first being used... it would have set the development of the field back a couple of decades, but it would have made sense back then. However, this patent was granted in 2006! I learnt about Linked Lists in my introductory Algorithms course back in 2005, using books that were probably about 10 years old then. In fact, checking the wikipedia page, it seems that the Linked List is pretty much celebrating its 55th birthday this year. That's not just older than me, that's as old as my boss!

I'm guessing the requirement for prior art is being discontinued? I can't think of any other reason why someone could patent a data structure that is so old that even if it had been patented, the patent would have expired before I was born.

Monday 7 December 2009

NWRUG Code Kwoon, run by Ashley Moran of PatchSpace

I recently made my first foray into the deep, geeky depths of local user groups and went to a session being run by NWRUG (the North West Ruby User Group). The session was a 'Code Kwoon' designed to introduce people to the wonders of RSpec and Behaviour Driven Development and was run by Ashley Moran of the company PatchSpace.

Now, if you've been involved in Ruby at all in recent years, you probably will have at least heard of RSpec and BDD, but if (like me) you were living like a hermit crab and only sticking your head out from under a rock occasionally then you probably won't have gone any further than that. I personally had abandoned my rock a month or so previously and had started delving into RSpec and related BDD tools in a fairly serious manner but was on the lookout for anything that would help improve my knowledge of this area. This Code Kwoon seemed like a good opportunity.

Unfortunately, the Kwoon was meant for a much more introductory level than even me, being aimed at the people who had just arrived on the BDD planet and were blinking, stepping into the RSpec sun. But before I go into more detail on that, I should step back and give a brief (and probably wrong) explanation of RSpec, BDD and a 'Code Kwoon'.

So, RSpec. Where to start? Well, it's difficult to start on a description of RSpec without mentioning BDD so I should probably introduce that first... so BDD. Where to start? Well, BDD is a new philosophy rising out of the more mechanical process of Test Driven Development (TDD) and is gaining a lot of ground in the Ruby and RoR community currently. The driving principle, at least to me, is that while TDD says what to do (e.g. write your tests first), it lacks in saying what to test and how to test it, making it a mechanical process that is still a bit lacking. That's not to say it doesn't work (just look at all the TDD frameworks, books, and TDD 'best practices' that have arisen) but one of the key things about these is that they are all their own distinctive version of TDD. There is some overlap, but mainly at the mechanical level of writing your tests before writing the rest of your code.

What BDD brings to this is along the same lines as those frameworks in that it provides a version of TDD. However BDD recognises that this is required, and so names itself differently in order to make the distinction. TDD is the process you are following, but BDD is the set of principles guiding what you test and how you test it. Having gone through all that detail, the 'meat' of BDD is deceptively simple... you test 'behaviour'. This is done in a variety of ways, and it encompasses both the traditional unit testing granularity, and the integration testing, functional testing, all the way up to acceptance level testing. Now, I could probably go on and on about this and end up going around in circles (or possibly circling a drain) so I'll just leave it there, but with a mention that BDD tends to focus heavily on mocking software objects (I suggest you google this as many others have gone over what mocks are in a much more eloquent way than I could manage).

So, back to RSpec, which is now also deceptively simple to explain. Basically, RSpec is a framework (written in Ruby and designed for testing Ruby code) that implements a lot of the philosophy of BDD. It changes the language of tests from simple assertions to statements about what the code should be doing. While these are almost exactly the same in terms of physical implementation in the language, the change it makes in terms of understanding tests is remarkable. No longer are you doing a mechanical process of calling a function and ensuring it has an expected result. Instead you are saying that this function should do this, or that. In rails testing it really comes into it's own as you can easily write RSpec tests (called specs) that say that a particular action should be a success and do this and that, as opposed to the more traditional testing where you have a function call and then a series of fairly dry assertions about the result.

Now that I've thoroughly confused everyone regarding RSpec and BDD, it's time to muddy the waters with the 'Code Kwoon' aspect of the evening. Born from the (possibly feverish ;) ) imagination of Ashley as a 'Good Idea', the basic idea is similar to the 'coding kata' idea, except in a different language (Chinese rather than Japanese, I believe Ashley said). It's a way to practice coding skills with a specific problem and in this case it was being done in a pair-driven fashion with hot-seat pairs.

So, the evening... I mentioned previously that the level was a bit more introductory than I originally anticipated. This was mainly due to the fact that many of the attendees were making their very first forays into RSpec and the BDD arena. The problem on the table was a 'Poker hand recogniser' that was to be able to take in a series of Poker hands (potentially with different variations of Poker such as Texas Hold 'em or 5 card stud) and determine a winner. Given that the time allotted to the session was about 80 minutes this could be seen as a trifle ambitious ;) However, it worked in the sense that it wasn't a trivial problem, so it illustrated the process much better. To me the evening also showed what I'd describe as a 'clash' of methodologies. As said, there were a lot of people there making their first foray into RSpec land, and some attendees who were well established in the ways of RSpec and BDD. With an initial chunk of development done by established RSpecers, the hot-seating system took full force and some new-comers to the scene were taking development, and the change couldn't really be more obvious. Where the RSpecers are familiar with getting the tests to pass and then refactoring (so that you know you have a solution that passed the tests and your refactoring is only neatening things up), the newcomers were what I'd call 'traditional' developers and when faced with a failing test tried to build in new abstractions without fixing the failing test. The development at this point basically stopped as the hot-seat coding then meant that people were swapping out and then spending time changing the abstraction from the previous persons to how they thought about the problem and it wasn't until close to the end of the second session that the failing test was finally fixed (by going back to basics) and more tests were added.

Thus, the evening ended up not showing as much about RSpec as I thought it would (and probably not as much as the organisers intended) but it was an educational experience anyway. It showed how much trouble can occur when differing development styles clash (very much exaggerated by the quickness of the hot-seat - 7 minute slots), and it showed how more traditional developers try to solve a problem by redefining the problem rather than just getting it fixed and redefining and improving the code from a more solid foundation.

Thursday 19 November 2009

Enterprise Rails - Review

I've been reading this on the way to and from work this week and I have to say it's definitely a book that exceeded my initial expectations.

My initial thoughts about the books contents were (probably in line with most peoples thoughts of a book called 'Enterprise Rails') were that it would be filled with details on XML and SOAP and SOA and have very little information that is in lines with 'Railisms' that keep the elegance of the Rails framework. I figured there may be some nuggets of information about scalable architecture that would prove useful in the long run, which is why I acquired a copy.

As it turns out, I was wrong on almost every front. The book keeps 'Railisms' very much intact, instead concentrating on the areas that are ignored or under-treated in other rails books. It starts with chapters on code organisation using plugins and modules (and I immediately adopted the module organisation for one of my projects, before it was too late). It then moves on to several chapters targeted entirely at the database. Now, I don't consider myself bad at DB design and implementation. I can produce a data layout fairly easily that conforms to 3NF but I normally stopped there. The author here doesn't. He pushes well beyond this point into Domain Key Normal Form, shows how to easily base ActiveRecord models off views, how to ensure referential integrity at the database layer instead of in the application (where it is surprisingly easy to bypass even keeping within the ActiveRecord API) and generally pushing the database back up to a solid, working part of your application. In direct contrast, most rails books consider the database as of secondary concern and usually completely abstracted away by the use of ActiveRecord and migrations. The author acknowledges this viewpoint but points out that it is driven very much by applications that haven't reached the complexity of a 'simple' enterprise application and moreover it is supported in many ways by MySQL, which lacks the features of commercial quality databases and leads developers to think that these features are unimportant. The author makes the valid point that by the time these features become important (because your application has become hugely popular and is dying under the load) it is often too late to implement them in a complete fashion. So the chosen route is to engineer in all the constraints from the start and to choose PostgreSQL as the database, which is an open-source offering that DOES offer most of the features of a commercial offering.

That encompasses the first half of the book (the book is just over 300 pages long), and it is a testament to the quality of the author that he has fitted so much data into this space while not skimping on the quality of explanations or code samples. After just under a week of reading I have gone through the database and just started on his sections on SOA... and so far, it's as good quality as the database sections! Rather than the rather hazy descriptions regarding services you frequently get before someone launches into complicated SOAP envelopes and overwhelming a reader with XML, instead we get a picture of SOA as an architecture that does to monolithic web applications what OO design did to procedural coding.

Unfortunately, that is as far as I've read so far but based entirely on the first 2/3s of the book, I would say that this book is a must for any serious rails developer. It is well written, sensible, it keeps all the lovely rails conventions us developers love but fills in the gaps where Rails doesn't quite cover the ground completely. You may think you don't need the advice and code from this book, and you may be right. But if you plan on creating the Next Big Thing (tm) and building a site that WILL scale, then this book is a definite read.

Thursday 29 October 2009

Ruby XML Builder prefixes

This is a topic I've come across a few times now, and more recently turned up when someone asked a question in an IRC channel (it was either #ruby or #rails on irc.freenode.net, I can't quite remember which now). The basic problem was that they wanted to output an XML prefix to the tags generated with Builder. Anyone who has used Builder will be familiar with it's syntax:
xml.someTag do 
xml.anotherTag "tag content"
end
and can see that this doesn't work with a tag prefixed with an xml namespace, which includes a : (this being a special character in ruby). So what is the solution? The person asking the question actually got the answer from several people that it wasn't possible, but that didn't seem right. And it turns out that it is perfectly possible, it just requires a slightly more verbose syntax. Instead of
xml.someTag
you need to do
xml.tag! "somePrefix:someTag"
where the tag! function on an XML Builder object takes a string representing the entire tag and outputs it as is.

It turns out the consideration of the Builder creators didn't stop there. They have functions to allow for the full range of standard XML to be created. Need a CDATA field? use xml.cdata!, need to add, comments? Use xml.comment!, need to create a node with mixed text and child nodes? Use xml.text! like so:
xml.myNode do
xml.myChildNode "awesome"
xml.text! "More awesome"
end
All of these are supported with functions with a ! at the end, which keeps them nice and separate from tags that you would typically create.

And to really add to the point that these things were considered by the Builder creators, they even have a simpler form for using prefixes in XML now. You simply add a space between the prefix and the : to create an expression like:
xml.myPrefix :myElement, "look at this!"

or in more common ruby language, if you pass a symbol as the first argument when creating a tag, it will take the tag as a namespace prefix and the symbol as the actual tag name.

So yes, it is possible to add prefixes with Builder, and more than possible, it's simple! There's no excuse for saying it can't be done.

Monday 21 September 2009

Search and Indexing

I've had a lot of exposure to full text indexers since I started working at HedTek Ltd. From Lucene to Solr and now even Sphinx, I feel it's time to write up some of my experiences.

Lucene
Probably the best known of the indexers I've now encountered, Lucene is an Apache project that aimed (and succeeded) at implementing an efficient, simple and useful full text indexer in Java. This project is a great library for creating your own search indexes and performing quick searches across it with a familiar syntax. It's also very flexible, allowing you to plug-in your own functionality to index just about any kind of document.

With all that, you'd wonder why anyone would use anything else? Well, it's not all rose gardens with Lucene. Firstly, this is a low-level API designed to be the heart of an index and search engine. It's not a complete solution ready for use straight out the box. Second, in the project where I had my initial exposure to it, the version of lucene in use was the Zend PHP implementation of lucene. While this is an excellent idea (as it allows lucene to be used from PHP directly, no messing around with Java interfaces) there was one key problem - performance. With the index size in use (24 million records) searches that would take under a second with the Java library would take > 10 seconds with the Zend library. This is clearly very undesirable so other options are required.

Solr
Solr is one option to remove the need to interface with Java from your desired language while still retaining the Java lucene implementation. Solr calls itself an 'Enterprise search server' built on Lucene and fills in one of the gaps I mentioned earlier - Solr is a working search engine right out the box. It manages this feat by packaging the Lucene library up into a Java servlet container (runnable through any java server, e.g. Jetty, Tomcat) and providing a HTTP interface for searching. Results can be returned as XML or JSON straight out the box and there are a whole host of other features on top of this that are useful and help an ailing developer create a fully fledged search engine easily. One of the main ones is the ability to define 'schemas' that tell Solr how your records will look, adding a type system to the index and allowing malformed data to be picked up much more easily.

Of course, for Solr you need to have a java server set up. This isn't always the easiest task and there are some subtleties involved that can make this a daunting prospect (I certainly encountered this and still do as I'm not a java server expert... I'm barely a novice). Also the Solr schema is a requirement, so in order to set up your server you need to create a schema for your data. Not a huge imposition, but the schema is defined using an XML language that is a bit opaque to Solr newbies.

Sphinx
The last of the 3 I have tried, and I've only tried it so far on much smaller indexes. Sphinx is another alternative in the full text indexing marketplace, and doesn't rely on Lucene. It functions as a search server (making it more comparable to Solr rather than Lucene) and has several gains on Solr:
  • It is much easier to set up. Where Solr took me over a day to figure out how to install it and get it set up in just a basic configuration, Sphinx took me a bit over an hour to install and configure with a connection directly to a MySQL database.
  • It doesn't need a java server. Sphinx runs as a unix daemon, listening on a local port. This makes it much easier to set up and feels less clunky (at least to me)
  • Very easy to set up multiple indexes. This is possible in Lucene and Solr, but with Sphinx they make it very easy. You have a config file and just define lots of indexes. You can even use the same DB connection for them, allowing you to have indexes that are optimisations of a basic one, which is not as easy in Solr (it may be a simple process, but I haven't come across it yet, making it more effort to find initially with Solr than with Sphinx at the very least)
Sphinx does have disadvantages as well though. The search results it returns are less useful as they contain just the document ID, rather than the lucene results which return stored fields (which can be all you need in certain circumstances and avoid hitting the database after a search). It also seems more geared towards indexing databases, whereas Solr and Lucene are more general purpose. This makes Sphinx great when you are indexing a database but no good if you are indexing a large collection of XML files on disk, or crawling a web page.


So, I haven't come across an absolute winner in the full text indexing arena, but I have come across several alternatives and all of them are suitable for different purposes. If you need something indexed quickly and in Java, use Lucene. If you need a more robust server for general purpose indexing and searching, definitely check out Solr. And if you are searching databases specifically, then Sphinx should definitely be in your list of options.

Thursday 30 July 2009

TDD: The door analogy

Recently, I was explaining what test driven development was to my wife and used a description involving the creation of a door, and I realised this may be a very good way to explain what TDD is, how it's meant to function and why it produces superior results. I've thought a bit more about the analogy and fleshed it out some, so here goes:

Peter has asked you to create a door so you go away and start writing some tests for what the door should do based on his statement of what he wants. You start initially with:
1) The door should have a handle
2) If you turn the handle and push then the door should open

So you go away and create a door that fulfills these tests. You present this to Peter, he opens the door and it does this by falling over. So you go back to the tests and add some tests you missed:
3) When the door is open, you should be able to pull on the handle and it will close
4) The door should stay upright when both open and closed

You then create this door (realising with these extra tests that you needed hinges) and present it again. Peter is happier with this new door, but then notices that if he pushes the door without turning the handle it still opens. This is another missed test so you add it to your tests:
5) If you push on the door without turning the handle the door should stay closed

You then create a door, adding a latch to the door that retracts when you turn the handle and test again. When running the tests you notice that half the time, test's 3 and 5 are failing and you realise that it's because of the construction of the latch. If the door opens in one direction then the latch won't retract automatically when pulling the door closed. You go back to Peter and say you need to clarify what he wants to happen and present him with the following alternatives:
1) The door can only open in one direction so you need to push the door from one side and pull the door from the other
or
2) In order to close the door you must turn the handle in order to manually retract the latch and close the door fully

Peter considers this and says he wants option one. This then causes a rewrite of the test cases to the following:
1) The door should have a handle
2) The door should only open in one direction
3) To open the door twist the handle and either push or pull. Only one of these should work depending on which side of the door you are on
4) When the door is open, you can close the door by performing the opposite action to the one used to open it
5) The door should stay upright when both open and closed
6) If you attempt to open the door without turning the handle the door should stay closed

You rewrite the tests and then run the previously constructed door through these tests to see where the problems are. This time, tests 2, 3 and 6 fail. Looking at the first of these you see that the door opens in both directions, which is now a violation of the tests so you add a ridge to the door frame that stops it from opening in one direction and re-run the tests. This time no tests fail, you present your door to Peter one last time and he is happy with it and installs the door all over his house.

So with this process, you have several iterations and each one improves the door. More importantly, each iteration adds more tests in which show that the door is improved. Now, most people would think 'but a door is obvious, the final version is how I'd have created it initially' but consider... what if Peter had wanted a door that opened in both directions and required you to turn the handle to shut it? You would have created a door that didn't do that and wouldn't have identified the point where it was required. You would have just given Peter his door and he would have gone away less satisfied.

Also, this is a high level example. Consider what you'd do if you didn't know what a door was? You'd look at some doors and create something that looked like it, with no way of knowing it was correct or not. It may work initially, but after some improvements it suddenly starts falling over. Now you are stuck in the position of having no clue about why it's falling over and what it was meant to do initially. So you go 'Right, it shouldn't fall over' and just prop it up so it won't fall over... but then the door doesn't open and you have an annoyed customer. If you had your tests there you would be able to point to what it was meant to do (e.g. test 5, 'stay upright'), prop it up and then when retesting spot that other tests are failing. So you go away and look at the problem some more, coming up with the solution that you need a third hinge on the door to reinforce it and stop it falling over after some use. You also add to your test cases:
test 7 - The door should be able to be opened and shut multiple times without falling over
and present the new door to your customer who is now delighted that you've solved the problem properly.

The analogy is probably a bit strained now, but the principle still holds... the tests are there for more than just 'testing' the system. They are there as a verification, they are there as a safety net, they are there as your specifications and they are there as your guide in an area you may not know much about. If you have something that isn't working correctly (due to a lack of understanding for example), you should identify which test(s) are testing things incorrectly and then modify them to test for correct behaviour. You then re-run these tests *without* changing any code (even if you 'know' what's wrong) so that you verify first that the new tests are failing. If you modify your tests and they still pass then your tests are still wrong (as the program has incorrect behaviour still), but if you modify your tests and your code and your new tests pass you don't know if it was because your new code works perfectly or the modified tests are incorrect.

Now, I know this is all standard to people who are avid TDD people, but I'm still getting up to speed on this methodology and the reasons behind why I've avoided it are:
1) Difficulty of testing - Big things are hard to test, but the little things are trivial and seem like they don't need testing... test them anyway, you never know if you will find a bug there and by testing the little things you are then able to break your big things down into smaller tests that *you've already written* and only test the small bit of new behaviour.
2) Benefits - Until you really think about the process and break down where the tests come in, TDD seems like a silly reversal. Why would it have advantages? Of course, with the above description the advantages are that you more quickly identify problems (in the third iteration, you immediately spot the test failures when adding the latch and ask for what the customer wants done when you realise it's due to mutually exclusive test cases). If you didn't have tests, or your tests were incidental things written after you finished creating your door how you thought it should work then you lose this benefit. There are other benefits that I'm getting clearer on, but they are left as an exercise for the reader ;)
3) 'But some things can't be tested' - This is a common concern, and it is false. Some people see the UI as untestable, but there are a lot of tools nowadays that allow you to test the UI in it's entirety. And before you get to testing the full UI you have a lot of components building it up. These CAN be tested. You can test that they change state as expected when called with fake input. You can get it to draw to a bitmap and check that against a pixel perfect bitmap that is how it *should* look. So you can verify every step of the way and build up bigger tests from well tested components, making this exactly the same as reason 1.

Those are the big reasons for me, and they are very much false reasons. I'm starting to get on board the TDD bandwagon and in the future I intend to have much better tests and try to write them before writing my code :) Of course, if I sometimes fail then it's not the end of the world, but I'll know what to blame when things start going wrong.