WireWorld » Hacks » Wirehead on Hacking » The big problem with RDF

The big problem with RDF

So, in the previous article, I defined some terms and explained the basics of RDF and the Semantic web. Now I'd like to dive in to what didn't happen with RDF and the Semantic Web.

Every now and again, people get this big idea that we need every last bit of possible information machine encoded, to do various interesting yet completely vacuously defined tasks done. The first time around they called it "AI" but that died a nasty death in the eighties, so it keeps needing to come back under a new name. But it always comes back, because the basic idea of asking a computer a question... any question... and getting a useful answer is a powerful appeal.

The latest big and notable attempt at this is the Semantic Web effort, which gave us RDF, OWL, and a bunch of other related technologies, all oriented towards building a complete artificial intelligence suite defined in XML. The goal is to talk people who are putting content on the web (because, face it, if you need a document corpus of any sort, you can use the Internet to gather it easily) to encode the information in a useful fashion so that you can do interesting things with it.

Two ways to format a document

Originally, XML was supposed to be that universal system. But it did do some amount of good. The existence of HTML made people quite comfortable with the concept of angle-bracket-based tags for representing data. Folks recast a lot of SGML formats into XML and a lot of folks would use XML to store their data instead of in a text format that was slightly different for each program -- I've seen quite a few different implementations of the Windows INI file over time and most of them suck. So XML suddenly had loads of momentum because you could check off a buzzword and make your life a little easier at the same time.

Then, along came the semantic web effort where a bunch of people wanted to bring a lot of AI ideas into the mix. Between the perceived power of XML and the need to generate triples, they cooked up RDF, which is able to be parsed using XML but otherwise has nothing in common. You need another layer on top of the XML parser to actually do anything useful with the RDF data. If anything, trying to represent RDF as XML made things harder. If you rewrite RDF in the "Turtle" notation instead of with XML, it actually starts to make sense in a way that the XML serialization doesn't.

Meanwhile, on the practical end of things, changes have happened. Useful changes, with practical applications. We've moved from badly-formed HTML with browser-specific behavior, to a rather impressive amount of well-formed XML data available if you look carefully. We still don't have ontologies, but we do have roughly defined folksonomies. And none of this uses RDF -- in fact, most new standards I've watched people cook up have specifically discarded the suggestions of using RDF language.

There's a space of time that you can convince somebody to do something before they'll start ignoring you. Microformats delivered useful items after 2 years. RSS was a little unique in that it plodded along for a few years with no big patron before it became totally integrated into our lifestyle. If you can't make something useful out of it in this period of time but are trying to do something useful, you will be replaced by a technology that's more limited but immediately useful.

For example, take Technorati. At the core, it's a tag-based folksonomy for blog entries. It could have been implemented as a set of RDF statements in a page, of the format {blog entry} -- has tag --> {tag} It could have become a generalized tool for simple RDF triple queries then while still presenting an easy tag-based interface with the addition of trust and relevance metrics.

But instead, they made it work without any notion of RDF or semantic web, just the XML RSS and Atom formats and microformats because they could convince people to add tags in ways they couldn't with forcing them to add RDF.

Conversion

I keep an eye on these things, waiting for somebody to make an application or a demo that makes everything make perfect sense. But I haven't seen a truly useful fully-semantic application yet. The closest thing I've seen is a generalized tool that converts scavenged coins of semantically-tagged information that are in other formats into RDF.

This can be made easier or harder, depending on the site's designer. I'll return to this subject a little later in this post.

Google Killers

Google is not quashing any research into the problem, but they are more of an obstruction than you'd think of.

See, Google and Yahoo and many other companies are really building the tools that one might think of as the "Semantic web", but the exact model is hidden to you. The reason why Google is able to actually get useful results is not the ability to spider the web, it's the ability to convert the spidered pages into a useful set of metainformation and then convert your free-text query into a query on that metainformation. At the same time, to do this in such a way that they don't let people corrupt the results too much.

So our problems are twofold. First, you can't say "Gee, I wish I had a tool to give me useful information if I tell the search engine that I want to look for the integrated circuit 74HC14" because I can just type 74HC14 into a search engine and get what I want fast enough. In other words, these hidden metainformation models are too good.

Second, it is not in Google's best interest to actually let you plug into their metainformation model in any useful way because all you'd do is find ways to either subvert their existing results or help build a world where there's no need for Google.

There are semantic web search tools that spider the web for RDF information and let you do queries on the collected data, which leads me to believe that the reason is not necessarily that nobody's made a good RDF search engine, but that nobody's made a good RDF search engine that does better than the existing engines to do anything commonly accessed and useful that fits within the limits of CPU.

An XML bag on the side

Even though RSS was hijacked at parts to try and make it an RDF-based format, it ended up being a vaguely standardized XML-ish format. If you go to most pages these days, there's a few link tags in the header pointing to XML files maintained as separate "sidecar" documents that can be downloaded optionally.

This sort of thing works great for some cases, but it doesn't scale well, nor does it work well in all cases.

It works well in the case where a single sidecar document would be shared between many pages on the site, when there's a fairly persuasive application that would involve ignoring the main content of the page and just going to the sidecar, and only if you have a fairly short selection of extra files to maintain.

This doesn't work out when the sidecar document is different for each site or when your tools don't want to support them.

When it works, it's great. I can just publish an RSS or Atom file that everybody can easily accept. If I want to do more specialized sharing, I can just add a few custom tags using the RSS and Atom extension frameworks.

Microformats

I was ambivalent to microformats when they first came out, but I've since come to the conclusion that they are great.

See, the nice part about a microformat is that you can apply a much lighter weight level of metainformation annotation to a document and maintain it in parallel to the document. It's almost the same idea as what RDF would like you to do, except it's "street legal" and works the way the web works.

There are other potential ideas -- either inserting RDF statements directly into HTML documents or adding a generic RDF feed document like you add RSS feeds -- but the microformat system works better. It forces you to actually display the metainformation you are outputting, which encourages you to be slightly more truthful and also forces you to update the metainformation and displayed information at the same time.

So if RDF isn't the way forward, what is?

The problem with the world is that you cannot make revolution happen. It might happen on it's own, but you can't rely on it.

For example, one might argue that if we took every gas-powered car-and-truck off the road tomorrow and replaced them with electric cars and high speed railroads and ships and whatnot that we'd be in a greenhouse-gas-free perfect world with no global warming. And culture would evolve around the need to charge the car and the lack of an ability to take one's car across the country without stopping to charge it.

But you can't just say "Look! Here's a brand new high speed train and some electric cars and charging stations! Get rid of your car!" because there's too much effort involved right away for people to care. So your electric cars sit unsold. But if you make a hybrid car that can kinda work as an electric vehicle but doesn't take anything away, people start clamoring for the ability to plug it in and use it even more as an electric car.

Really, this all devolves down to a single big question: Can a program, aided by additional semantic information, do a better job of getting results than a program without this information?

Remember that this is not a cut-and-dried question because of the metacrap problem. Keywords were great to make early search engines work, but then people started doing keyword stuffing.

The way I see things:

Tweaking the design of existing things?

The problem with the semantic web people is that they are all ivory tower folks. They made a very generalized model for things that's too generalized to be understood in a day. And most of them haven't come down to the level of your average programmer to suggest ways that the rest of the world can contribute.

My next two articles are going to talk about how I'm progressively enhancing Rm to be more semantic without loosing sight of having an immediate useful application.

Comments