ruby-xml-smart in Rm

I’m not done yet… so my benchmark that used to take 7.3 seconds now takes 5.2 seconds (so, only 40% faster, but that does include the time spent loading all of the libraries) but I’m part way through removing all of the REXML out of my formatting code in Rm and replacing it with Ruby-XML-Smart The nice part is that I can turn any existing REXML generation code that I haven’t gotten to yet into a string and then parse it back on the ruby-xml-smart side. So I expect I’ll get even better speed out of it soon enough. This is only a few hours of effort right now.

See, I started out using libxml-ruby for the formatting system. But they had neglected to add certain features to the library at the beginning (like namespaces) and I found that various things were just not implemented… and when I tried to hack the functionality in, it started having memory problems. So I went to REXML which I figured might be slow because it’s pure ruby… but at least it would work.

It turns out that REXML is, in fact, astonishingly slow, which I covered for by adding a caching layer.

The part that really sucks is that it forces you, to get even reasonable performance, to write code in a fairly baroque fashion. You end up iterating over pieces of XML yourself, which tends to make XML handling functions a little big. Now, if you were writing a script to parse config files or whatnot, I don’t think this would be a huge problem. But for parsing pages that need to be returned to the user fairly quickly… this becomes a problem and really isn’t what REXML is intended for, nice as it can be.

So now I’m switching over to ruby-xml-smart, which is a newer and more ruby-tastic attempt to wrap libxml for Ruby. And it works. I’ve got some issues with it, at the moment. I wrote two emails to the author of it before I realized I had answered my own questions and then deleted the messages unsent.

It needs some polish. The documentation’s not there yet, so I spent a lot of time using the source. But the idea seems to be sound, no mysterious crashes, and most of the functionality right out there.

I have to remind myself that things that are fairly expensive operations with REXML (like parsing an XML document from a string or following an XPath expression) are not nearly as bad when optimized C code is in use instead.

My gripes so far are:

  • There’s no good way to create a document solely through the API. Most of the time, it’s probably just as fast to make an XML document by concatenating strings. Sometimes, it’s easier to generate things pragmatically, probably through a builder interface. (I may have missed exactly how to do this, however)
  • Not everything is properly duck-typed. I found myself calling has_attr? instead of has_key? Or some objects have an each{} iterator and are able to be indexed as integers but no each_index{}.
  • There’s no way to take a node object and tell it to delete itself. You can tell a parent to delete an object from the tree, but you can’t tell an object to delete itself.
  • The error messages in REXML are more helpful when you toss it clearly invalid XML. ruby-xml-smart will just tell you that you passed it bogus XML. REXML will tell you exactly where.

None of these flaws are really showstoppers. Just some code is a little clunkier than I’d like. On the other hand, the overall clunkyness of the code has done down once you take into account all of the REXML workarounds…