Wednesday, November 16, 2005

Rookie Category Mistakes

Just because, as George Lakoff points out, every human being thinks in categories, doesn't mean that every human being is good at categorizing for other people. I've come into a few situations where the ontology/category structure/whatever is already in place, and it's my job to fix it or extend it. Every time I train a new person or look over these legacy systems, I see similar mistakes.

For the purposes of this little rant, let's assume that you're building a category structure that will serve as a user interface to data and an organizing system for data management, rather than as a back-end set of rules and axioms that will add meaning to data.

1. Everyone Has an Area of Expertise. This is Dangerous.

The problem with smart people is that they know a lot about how things are Rightly Done and classified. That's fine if you're playing trivial pursuit or working on the Wikipedia.* It's not fine if you're trying to organize something at a level that people who don't know a lot about the topic can access. One very specific mistake that I've run across more than once is the urge people who know linguistics have to organize languages by family. Great idea, right? Sure. It's a great idea, it's correct, it can be used to impart meaning up and down the hierarchy - But it doesn't help someone who doesn't know the first thing about what these families mean. A linguist or student of language knows what these families mean, and might be annoyed to see them arranged alphabetically. However, their needs are subsumed in this case by the person who needs to find their language quickly. I've seen this language-by-family classification in crazy places, like in a drop-down menu for a job application.

There's what's Right and there's what Works. Usually the two are compatible, sometimes they are not. If you're using a taxonomy as a user interface, it has to be usable above all things. If someone who knows nothing about the topic can browse to a simple goal (in the example above: Find the English Language), you've done it right.

This particular mistake leads into the second one:

2. Just because you know about it, doesn't mean it needs to be called out

Let's say you're putting together a category structure for "European Languages" to be used for a travel site that is compiling some commonly used phrases in multiple languages.** You already know that you shouldn't go to crazy with the language family parents, because your users are business travellers who need to find the bathroom, not linguists. Here are some categories you DO NOT need: Romany. Basque. Corsican. Faroese. C'mon, stop showing off.

Time and again I've seen exhaustive, over-granular category structures in specific places while other areas suffer from too-little differentiation. I worked on one taxonomy that had gone to the level of detail of calling out every chip that Intel had ever worked on, but didn't even mention AMD. You need breadth before depth. You need a plan before you dive in. Just because you can build something doesn't mean you should.

*even then, have you ever stumbled across a 46-paragraph cross-referenced article on some bit of pop culture and can't find a single thing about, say, Boise, Idaho? That's a minor weakness of asking a community to work on a project. The project will reflect what the community is interested in, rather than what the community needs more information on.
**I love convoluted examples.

Thursday, November 10, 2005

Truisms about ontologies

What the heck do you use ontologies for? Clay Shirky says they're overrated and useless for organizing content on the Web. This is quite true, at this time you can't use ontology to organize the entire web. (I will probably do a point-by-point reaction to this article at some point. He gets many things right and a few things half-right in distressing ways.) If you can trust content creators and the web community (which, by the way, you can't) you can build an ontology out of their tags and metadata. It still misses those people who don't understand how to create metadata or who can't be bothered to add it to their pages.

Structured data has several practical uses, and things go totally petwang when you try to mix them without acknowledging or understanding that you're mixing them. Here's how I've used ontologies and taxonomies:

A hierarchical taxonomy allows you to present content in an even, logical distribution for people to browse. This is often not accurate but it is understandable and easy to move through.

It can be used to explain semantic meaning to a computer.

It can hold synonyms and relationships between concepts.

It can describe a particular domain and help surface missing content.

One particular friction I've seen pretty much everywhere I've worked is the clash between the need to organize content and the need to have an accurate ontology. Sometimes you have to create classes and categories that are spurious just to present content in a way that is meaningful to a browser and that does not waste their time with needless clicking. For instance: http://dmoz.org/Arts/Performing_Arts/Acting/Actors_and_Actresses/ The letters at the top act like categories, but don't actually impart meaning to the child categories. You can see what letter the person's name starts with just by looking at it. However, if you throw thousands of categories up on a page, you are going to overwhelm your audience. It doesn't make for accurate ontology, but it makes for a good user interface.

Monday, November 07, 2005

Using the word Ontology in the way that metadata managers and programmers have been using it is pretty controversial. It always annoys people when words that have a specific meaning are co-opted by another group for another purpose. Think of the controversy over "gay". The jury is still out as to whether the expansion of the word "ontology" is a pejoration or a semantic shift

When I got my first real ontology gig, it was, of course, for an up-and-coming dotcom. I helped my boss place an ad in a few places, and because everything was fast and loose and the dotcom hadn't gotten around to hiring an HR department, we used my email address as the return address for prospective candidates. Some grad student from a philosophy department sent me a long rant about how I was personally responsible for the dumbing down of America (which, considering what the Web has done to our attention spans, may have some merit) and that overall we were evil people. He then went on to ask, rhetorically, "What IS an ontologist, anyway?"

I am afraid I emailed him back a rather snotty reply: "An ontologist is a librarian with stock options. Are you applying for the position?" I probably would take that email back if I could. On the other hand, if that's the snottiest I got during the dotcom boom, maybe I'm ahead of most.

Everyone except me has very specific ideas about what "Ontology" is. This guy gets squicky over the use of the definite article with "ontology". He says:

"If you want to really irritate me, refer to a plain taxonomic categorization as "an ontology". It's like calling a case of canned alphabet soup "a literature".

This is part of a general social trend that appropriates the most obvious outward manifestation of something they don't understand and devalues the original concept."

Whoa, hey, back off man. First off, I agree that taxonomy is only a probable component of ontology. I would disagree that taxonomy is "plain." A good taxonomy can be very complex and have multiple relationships within the hierarchy. The hierarchy itself imparts meaning to the concepts. (He goes on to make some very excellent points about the limitations of taxonomy and category structures as a representation of knowledge, so run off and read his post when you get the chance and can overcome his rather condescending writing style.)

Another comparison of the definitions of taxonomy and ontology occurs in the book Ontological Engineering : with examples from the areas of Knowledge Management, e-Commerce and the Semantic Web by Asuncion Gomez-Perez, et al. They say, "Sometimes the notion of ontology is diluted, in the sense that taxonomies are considered full ontologies...the ontology community distinguishes ontologies that are mainly taxonomies from ontologies that model the domain in a deeper way and provide _more restrictions_ on domain semantics. The community calls them lightweight ontologies and heavyweight ontologies respectively. On the one hand, lightweight ontologies [taxonomies] include concepts, concept taxonomies, relationships between concepts, and properties that describe concepts. On the other hand, heavyweight ontologies add axioms and constraints to lightweight ontologies. Axioms and constraints clarify the intended meaning of the terms gathered in the ontology."

I prefer the more relaxed approach of Heavyweight and Lightweight ontologies as opposed to the Philosophy community's stance that the word is being abused and the position of some others that ontologies and taxonomies have nothing to do with each other and anyone who uses the terms interchangeably is automatically an idiot trying to annoy ontological purists.

Friday, November 04, 2005

The Trouble With Folksonomy


Folksonomies are structured data, usually in the form of tags. These tags are set by the author of the content, and that author is usually an amateur who is just trying to get himself heard, and has a somewhat vague idea of the scope of the broader site. Flickr, Livejournal, Dailykos, and many other sites are currently using folksonomy-type tags. Some of these sites allow any user to register to add additional tags to an entry.


Folksonomies at their best solve one of the biggest problems with metadata: one person or team can't possibly know all the words that might be used to get at a particular piece or type of content. When you open up that pipeline and allow anyone who stops by or contributes to tack on the words he thinks are most relevant, you get so many more entry points to the data. The downside is that when people, particularly people who pride themselves at being outside of the mainstream, are faced with a field in which they are expected to submit something that is relatively standardized and guessable, will try to be unique. Today on Dailykos, kos complained to his users about editorializing in tags: "...the tags are meant to be used as a categorization tool. They're not supposed to be used as a place for editorial comments. ...If you are creating a new tag, make sure it's a legitimate tag ... The success of the tagging feature depends on proper categorization." To some degree, this is true. Deliberate misuse of tagging results in categorization noise that will never be used to return results. I have noticed that many people in online communities get satisfaction from trying new or unique ways to communicate. Livejournal users will put content in their "current music", "current mood", and tag fields that relate to the post but are not related to the way people are likely to browse for the content in the post. Deliberate misspellings, random phrases tangentically related to the topic, and editorial commentary abound.


Is this behavior a problem? It can be, but only if the content is intended to be found and consumed by "outsiders." Many online diarists will use non-standard tags that help them group together their own thoughts within their own journal. These non-standard tags are prevalent on flickr, too. The user striatic from Flickr has an excellent post about how and why to use personal tags, group tags, and public tags.


Impenetrable tagging is not the only issue with folksonomies. When a user uploads a photo, she may just tag it as she sees it: it's a photo of San Francisco, therefore it gets that tag. Many people will miss the obvious entry points and get too general. They will tag a photo with "San Francisco" when the photo is actually of a crowd at the Folsom Street Fair.


These notes on folksonomy are not intended as a condemnation of the idea. I think that a folksonomy that is gently guided by a manager who hooks together similar concepts with a controlled vocabulary (for instance, managing stemming in tags, so that "republican" and "republicans" access the same set of posts) could be a solution to many problems of organization within online communities and content repositories.

Thursday, November 03, 2005

Defining My Terms

`Now you talk like a reasonable child,' said Humpty Dumpty, looking very much pleased. `I meant by "impenetrability" that we've had enough of that subject, and it would be just as well if you'd mention what you mean to do next, as I suppose you don't mean to stop here all the rest of your life.'
`That's a great deal to make one word mean,' Alice said in a thoughtful tone.
`When I make a word do a lot of work like that,' said Humpty Dumpty, `I always pay it extra.'

--Lewis Carroll, Through the Looking Glass

It seems that most language used in describing Internet concepts, especially around ontology and other structured data, are repurposed from other disciplines. (As words get repurposed, so do ideas. The only difference is that we pretend that these are unique and pioneering ideas, rather than acknowledge that we are standing on the shoulders of giants.)

The following are the terms I use a lot, and and how I currently define them to myself. As this experimental self-education continues, I plan to update these definitions as needed.

Ontology - An ontology is a set of concepts and their relationships to each other. The most important aspect of an ontology is those relationships. They give it meaning. Ontologists come in two stripes: those who create ontologies for practical use (that's me!) and those who spend time with the theoretical aspects of ontology. These ontologists wrestle with the problem of defining a semantic relationship to a computer. To do that, your own mind must be very clear on what that relationship is. What is a synonym? What is a related concept, and how does that relationship differ from an alternate parent concept?

Taxonomy - Often used interchangably with Ontology, a taxonomy is a hierarchical category structure in which concepts have one true place. It is more authoritarian and less flexible than an ontology. The Yahoo! directory category structure is a taxonomy by this definition.

Controlled Vocabulary or Thesaurus - (not Roget's thesaurus) These are synonym lists. They point out alternate spellings or words for the same concept. These are of great use in allowing users to perform conceptual searches without exhaustive Boolean logic. Google's misspelling helper is an example of a synonym list, as is the Library of Congress Subject Headings.

Faceted Classification - Every item, idea, or person has multiple attributes that are not mutually exclusive. Things have color and size, for instance. If you gather information (metadata) on these various facets, you can allow searchers to choose the facets they find most important. EBay uses facets in its Product Finders, which can be combined with category browsing. This allows a very targeted search using seller-provided metadata.