justin timberlake is a web data ninja

Someone made the mistake of asking, and I couldn’t find the old email I wrote on the topic, so I’m going to inflict upon you how I think that the whole dealing-with-web-data should probably break down in terms of information flow. My thinking here is heavily influenced by a very popular treatise on problem deconstruction, of course.

Step the first: you cut a hole in the page.

To work with page data, we need to find the data. We can do that using heuristics (like various webmail systems do to identify dates for calendar integration, or the auto-linkifying of URLs that is so common), using explicit metadata like microformats, or even letting the user select something on which to focus our data-detection powers.

Step the second: you put the data in the box

Once we’ve found and analyzed the data of the moment, we probably want to bin it (in the statistics sense, not the adorable British accent sense) into a broad classification like “date”, “place”, “person”, “photo”, “event”, etc.

Step the third: you give him or her the box

Once we’ve determined the type and value of the datum in question (I like to use words like “datum” to cover up my insecurity about a lack of academic credentials), we can then present it to the user so that they can send it to a web service, poke a helper app, turn it into HTML on the clipboard for them to paste in their blog, annotate the page in their Places store with the juicy tidbits. The work on improved content handling in Firefox 3 will give us some useful primitives here, I have reason to hope.

Epilogue

Every now and then, someone asks me “what are microformats? why do people use them? do they smell nice?” My first instinct is to say “well, google can tell you that” but it turns out that it’s really pretty likely that you’ll end up on microformats.org, where they will tell you this:

Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards.

There is then a link to “learn more” about microformats, by which I think a reasonable person might assume that they mean “learn anything“, because that description is sort of equivalent to describing Firefox as “a piece of software that is built from C++ and JavaScript”.

But then, I don’t think that talking about microformats specifically is really the right way forward, and I think that the microformats dudes would agree that microformats are a means to an end.