Working 9 (Hours) to 5 (Minutes): Tuning the Pellet Classifier

by Mike Smith

The NCI Thesaurus is an ontology of cancer, diseases, and related terminology within and outside biomedicine. The latest version is really large—about 58,000 classes in the latest release. From our perspective, as maintainers of Pellet, large ontologies present opportunities. The folks at NCI agreed and they’ve funded us to improve Pellet’s classification service to the point that it can be used with the Thesaurus.

In a short month of work, we’ve progressed from infinite time to 9 hours to 5 minutes and, though we’re shifting focus at the moment, we’re confident there are more improvements to be realized. This has been an excellent example of why working on Pellet is rewarding, why software engineering matters, and how funding Pellet’s development can make a difference.

If you were paying attention to the NCI Thesaurus when some of us worked on it at the Mindswap lab, that (older) version now takes about 15 seconds to classify. Yeah, that’s right: 15 seconds. When we started this work it was 50 minutes.

Expect to see these classification improvements in the 1.5 release, and if you’ve got other big problems for which re-engineering Pellet may help, let us know.

Spread the word:
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati

Leave a Reply