A practical use of the RDF model

Every time I hear about an RDF-based web idea, I always check it out to see if it makes intuitive sense to a casual observer. And, so far, I haven’t found any. The problem is that RDF is this huge abstract thing, and there’s a lot of AI resear…ehrm… semantic web work associated with it.

If you remember in the first article in this series, RDF and the related “semantic web effort” is fairly verbose XML standard with several textual alternates to encode a single thing:

Subject -- Verb --> object

and then be able to ”reason” around a collection of those sets electronically. Because there’s three things here, they call them a “triple”.

The problem is, most of these things are sufficiently vague that your average disinterested casual observer isn’t going to care. How can I fill these pieces with values that actually makes somebody able to do something useful?

I tend to believe that various simplifications of RDF-like models can work well. For example, take flickr. The subject is the image, the verb is “has the tag”, and the object is whatever tag. Write code around that notion, shake, mix well, and you have a flower, kitten, and boobie picture publishing behemoth. The blessing and curse of the flickr model is that it works just well enough to be a popular and useful idea, but you end up with tricky holes in the model.

It’s pretty clear that if our human language lacked the ability to add words between the subject and object, a lot of ideas would be hard to express — it’s very good that I can differentiate between a person who has a donkey and a person who is an ass. With the trick, of course, being that you need to make it just comfortable enough that people use it.

The closest implementation I’ve seen so far didn’t actually use any RDF. It just gave you a section of “metadata” that was of the form “Verb:Subject = Object”.. the same idea as Machine Tags.

I’ve been thinking about this, because I’d like to add a lot of fairly practical pieces of information to the engine in such a way that is generalizable for future functionality, yet not too much of an obstruction to getting things done for a recreational project.

My first thought is that it’s actually a useful abstraction to add some notion of verbs, largely because this means I can share code. For example, let’s take a made-up blog posting and it’s metainformation:

Title: Things I found in my junk drawer that scare me Author: Me Posted: 2008-08-08 08:08:08 Modified: 2008-08-08 08:08:08 Tags: junkdrawer, funnystuff Access control: I can modify it, anybody can view it, authenticated users can comment Links: www.xkcd.com, another blog post

I could tell you about the time that I was chatting with a buddy on the phone and he realized that he was storing Naplam under his bed, but I won’t. Instead, let’s break things down into RDF-like data:

Page - has title -> Things I found in my junk drawer that scare me Page - has author -> Me Page - was posted -> 2008-08-08 08:08:08 Page - was modified -> 2008-08-08 08:08:08 Page - has tag -> junkdrawer Page - has tag -> funnystuff Page - can be modified by -> me Page - can be viewed by -> anybody Page - can be commented on by -> authenticated users Page - links to -> www.xkcd.com Page - links to -> another blog post

The most singularly useful feature that we want to enable is making faceted queries easier. For example, say you have a blog with tags, you might be able to search for articles with the tag “junkdrawer” or for articles posted in Aug 2008, but not all of them will let you search for articles with the tag “junkdrawer” posted in Aug 2008. Most sites with tags force you to do only one search at a time. So, assuming that my arguments don’t fall apart in the end, let’s call the ability to store these “triples” and build queries on them a useful feature.

Now, this is not trivial. You need to figure out how to encode the “verbs” and “objects” in a useful way so that the date is more than just a string.

There’s more. Let’s say I write up a page that lets me type in the context of the blog post in one box and then add any metainformation to the other box. Clearly, this will not work, because I’m probably not going to bother going through my post after I’m done to find the links. I just want to type in my post, maybe add a title and some tags, and get on with my life.

So, clearly, if you are going to use the notion of triples, you need some way to automatically generate them in many cases and properly maintain them. This has to be bulletproof. If I edit the article and remove the XKCD link, it needs to make that reflect the dropped link. If I delete the article, you don’t want the extra triples hanging around.

Oh, and let’s talk about the access control. Do you really want to deal with tagging each individual page on the site with the access rights? Some of the metainformation elements will apply to a group of overall statements.

So, we just reached the important question.. is all of this crap really worth it?

Well, clearly even if I implement the access control system without using a universal concept of triples, I’m going to have to deal with that case. Likewise, if I want to track links, I’m going to end up needing to cache those values. So most of these sticky problems need to be solved anyway.

I’m carefully avoiding any semblance of RDF or even any of the non-XML ways to represent RDF. There’s a bunch of extra stuff necessary to describe all of the parameters in a totally unambiguous way. RDF/OWL/etc. will be able to express a bunch of stuff that my model won’t, and it’ll be able to reason around them in funky ways. I don’t care! If you want to hook this up to a triple store, you should be able to generate RDF out of the less-flexible structure with not much extra effort.