WireWorld » Hacks » Wirehead on Hacking » New web content engine: Part 4

New web content engine: Part 4

It's done!

If you can read this, the new engine is live. One of my ex-cow-orkers was making fun of me that this would take over a year... and it did, although mostly because I stopped working on it for a while.

Last weekend, it reached the point where I could view parts of the site and do useful things, with only a few bits left to get working. All of my requirements for it were done -- I could take a bare machine, run the setup script, create pages through the interface, edit them later, and the bits that I needed to have if I wanted to reproduce the existing functionality were done.

So, what is it? It's:

Buzzwords ending in "programming" that this uses:

The general model is that every node in the site's structure is a single record in a SQL table. Except that I'm using PostgreSQL's table inheritance such that properties can be inherited.

It uses naked objects instead of MVC.

There's still a lot of room for it to grow. It's missing several necessary-but-not-critical subsystems, so it hasn't replaced all of the tools I want it to, like my protected personal wiki, nor does it do the new and impressive things I want it to do.

However, it does fulfill one of my major goals, which is that it's a totally useful tool for updating my site more often. I've been looking through the site using the control panel and tweaking pages here and there. Ideally, before it hits beta, it'll be able to track its own bugs and feature requests. :)

XML Library

The one problem with Ruby, at the moment, is that libraries that aren't part of the Rails experience aren't getting the greatest amount of support. I have one big example of this. My engine does not use embedded ruby templates. It might, in the future, but at the moment it doesn't. I made the reasonable assumption, when I got started, that if the formatting system was written to use a lot of XML, the exact performance of the XML parsing library would be important.

So I used the Ruby bindings for libxml. libxml is a fast bit of C code that tends to blow away the performance of any native-ruby XML bindings. Whoever first made a go at porting it ignored a bunch of issues, like namespaces, so it doesn't quite match the native C API and doesn't quite work in general. It's not quite ready for prime time, so as I was finishing things up, I ended up having to rip out the libxml code and use REXML instead.

In my mind, this is a tragedy. Ruby makes interfacing with C really easy. One major and unappreciated reason why you should use Ruby is that you can write the complete application in Ruby and then replace bits of it with C as needed. I have seen people who tried to do that with other languages and it just isn't pretty.

The problem is that Ruby still needs to grow outside of just the bounds of Rails. It's a beautiful tool for many other applications!

But it gets worse

I had been pointedly ignoring performance because I didn't want to optimize things until all of the pieces were in place. But now all of the pieces were in place...

Overall, the code is "fast enough". Even things that gave me a pause. Except for one fairly nasty thing... the XML interpolation engine.

It's one function that walks over the entire XML tree of the template, looks for specific node types, processes them, and replaces them with the result. It wasn't too bad when I was using libxml. But now that I'm using REXML, it's dog-slow.

I discovered two thing about using REXML in fairly sophisticated XML manipulation.

First, XPath expressions are fairly slow. I found that I could make it 50% faster by replacing all of the XPath expressions with incrementing over the child nodes. I'm pretty sure that this would be reversed where I to be using a native-code XML library.

Second, because REXML's parsing code is sufficiently slow, I tried a few caching algorithms that worked on pieces of the XML document and it turns out that any speed advantage in caching is wiped out by the need to re-parse the bits in order to feed things back into REXML.

The end result is that the system has a page cache on the resulting documents and that's about it. Until I have the time to find a good native-code XML library and splice it in -- I've been wanting to use ruby-xml-smart for a while now but I wanted to get everything else working before I screw with it.

Oh, and I posted the code for the cache, so others can use it.

Why so much XML?

Now, one might ask why I'm using so much XML. Well, one must note that I'm coding this for my own personal intellectual curiosity, so I have free reign to do things differently. Primarily, I wanted to make the generation of XML formats feel easy, especially when it seemed like the future of the web was that you would generate your page as semantically-available XML pages that XSLT would format for the user in such a way that one day you could turn off server-XSLT and rely on the client's XSLT.

I actually wrote a complete XSLT-based system that would accept a limited HTML-like vocabulary and recreate it, adding drop-shadows and the like, all generated through a build script. And the page would be semantically enabled, so you'd talk in terms of a picture that would have properties you could generate and different sizes, instead of just loading images.

The first cut at the new system took that XSLT-based model and added a few bits here and there to it so that it would work more dynamically. Partway through, I realized that a few bits would be coded neater if I scanned through the XML document and replaced some elements... and then I realized that it would work better if I did that some more.

Then I put the code aside for months. When I got back to it, I realized that it wasn't going to work very well.

So now I'm in a halfway step. I ripped out the XSLT and changed the existing tool that scanned the document for specific elements into a templating engine.

My thought is that I'll be able to abstract some fairly sophisticated operations that I'd like to do this way. See, the part of code that generates the "what's changed" on the front page uses the same processing path that the generation of the Atom feed uses.

Comments