2006-06-20

Wikipedia's semantic cow paths

I've been quite silent here on this blog for two months, and meanwhile resumed a bit of my Wikipedian activity lately. Although I'd been an enthusiastic early adopter of Wikipedia back in 2001, I have a poor and episodic editing history so far. But every time I've been coming back to the editor dashboard after months or years of inactivity, I've been amazed by the tremendous qualitative growth of the toolkit made available to users. Wikipedia's growth has been stressed again and again in terms of quantity and quality of articles, languages, editors, popularity as a reference resource etc. But was has not been stressed enough is the parallel growth in terms of features supporting better search and editing. And many of those features are in fact adding a quality of information which makes it ready for semantic parsing, and easy RDF re-writing. The basis of it all is a sound use of URIs, names and namespace.
For example http://en.wikipedia.org/wiki/Volcano defines without ambiguity the unique page dedicated to this geological feature, while http://en.wikipedia.org/wiki/Volcano_(disambiguation) is a hub to potential homonyms such as http://en.wikipedia.org/wiki/Volcano_(film). Links are provided to similar resources in other languages such as http://fr.wikipedia.org/wiki/Volcan. One can reasonably use any of those URIs to identify the concept "Volcano". But what about a class Volcano? Here come Wikipedia categories. http://en.wikipedia.org/wiki/Category:Active_volcanoes is ready-made for a class, and parsing the page will give you easily the list of instances, with a link to the full description page. This description itself is formatted using templates such as "infoboxes", so that a page in a category "Active Volcanoes" will yield standard properties in a standard format. Easy to turn this into a data base, and if one interprets the infobox elements as so many properties, turn it into a RDF description, and the infobox structure itself in a RDFS or OWL description of the matching class.
What should we learn from that? That from a collaborative and mostly non-directed process, are emerging cow paths which look more and more like semantic markup, ready to be spidered by smart parsers and tools, either to improve Wikipedia content itself (there are already a bunch of bots and agents doing that), or to extract of this amazing knowledge base any kind of structured data in whatever format, including implicit ontology, like structure of categories, attributes used, and the like. And my hunch is that this process will be quickly much more effective for Semantic Web building than many costly academic ontologies nobody will ever use.

[2010-04-08] : What is described here is exactly what DBpedia started in 2007