Thursday, January 12, 2006

Creating a New Ontology

I have been shocked to get more than one request for me to update this thing. That must represent more than 100% of my readers. So, shoutouts to CF and KC - this ramble is for you. And CF - it's your turn now!

At my relatively new job, I have been charged with the enviable task of creating a brand-new, from scratch ontology. You might think this would be fairly simple. It should be. I've been working with sub-optimal ontologies both in actual content and in required field structure for so long that I'm feeling quite a bit of option paralysis when faced with the opportunity to do it right for a change. I didn't want to screw it up, so for a while I tried to get more info from the people who are going to use the thing. There was no joy there, they want to see it first before they will give feedback. That's cool. I can work with that.

The ontology tool this company has been working with is an open source tool called Protege, specifically, Protege 2000. I understand that there are new versions out. I haven't yet downloaded it because I'm worried about breaking something. I will probably switch over to the new version as soon as I master this one. The first thing I did when I saw the tool and read the documentation for it was join the mailing list. This list is fun, people from around the world use the software and they are always emailing to ask how you set up a field to do this or that, and sometimes the actual developers will answer. People seem to spend an inordinate amount of time configuring their fields and relationships and then, invariably, they will ask, "are there any pre-existing ontologies for SubjectX I can just plug in?"

This doesn't make much sense to me. On the one hand, I've always referred to pre-existing ontologies to confirm my own sense of right or to ensure that I'm not missing something big. For ontologies that will be used as a public-facing interface, it's important to understand what people are used to seeing. If you plan to change that setup, you'd better have a darn good reason. For instance, maybe you're referring to the Dewey Decimal Classification, and decide that parapsychology and librarianship should not be in the same part of the same category tree. You are probably right, although some of the lame articles I read in library school had worse scholarship than the better parapsychology journals.

On the other hand, everyone needs an ontology for different reasons. A pre-made ontology may be worse than useless. Even adding new clients for an ontology within a company can create a need for customization and simplification.

So this is me, with a list of concepts and their facets in the one hand and a copy of Protege in the other hand. Or rather, laptop. Everywhere else I've worked has had homegrown category/taxonomy/ontology software to work with. And everywhere else I've worked I've been the resident software critic, wondering why it's not good enough to do this or that or the other. Well, after working with Protege, which is as good as it should be considering the multiple ways people use or even define ontology, I apologize to all of the software I've used before. Having something customized for your company's needs is wonderful.

How am I doing at creating this ontology? It's hard to say. I refuse to make the mistake of creating too many fields and relationships. I also don't want to spend too much time entering in data that might not be needed. I currently have four fields and over 300 instances in about 30 classes. I believe it's shaping up nicely, but I haven't shown it to the guy who asked for it yet.

I have a friend who knits. I'm more of a crocheter, myself. This friend will learn a pattern by knitting it until she realizes it's not working. Then she'll unravel and start again. She'll do this as many times as she needs to in order to create a perfect product. This is how I ontologize, and I think that I am very different from the other ontologists whose work I've read. They tend to spend a lot of time planning out where everything is going to go and then have someone else do the data entry. I'd rather gather the data and then just start typing it into the system.

The disadvantage here is that I allow the ontology tool and its limitations shape the scope of what I'm doing. The advantage is that it takes less time in the long run, in my opinion. I also think that I would put The Fear into any unsuspecting ontologist who had to work with me.