Archive for November, 2007

(Slowly) Open Sesame

Monday, November 26th, 2007 · Michael Grove

OpenRDF/Aduna recently announced Sesame 2 is finally in the release candidate stage (see the note from 11/12 on their homepage). We’ve been using Sesame since about version 1.1 as the primary backend for jSpace with reasonable success. Sesame 1.x has proven to be a solid backend as development has progressed in jSpace. Unfortunately, jSpace’s UI is very query driven, and its responsiveness relies greatly on the performance of the backend it’s talking to; a slow database means a long time between selecting something in a column and seeing results in the subsequent column.

Until recently, Sesame 1.x has handled this rather well. I had to do some work on the auto-generated queries, some pre-optimization, to squeeze a little better performance out of them, but for the most part, the query response time has been adequate for development and testing. But we’re now we’re trying to test jSpace against non-trivial sized data sets, including my favorite, our scrape of the Retrosheet.org baseball data, which is about 7.5M triples. We’ve been using the in-memory Sesame repositories because they give us the best query performance, but we’re coming up on the point were its not going to fast enough for larger data sets. I’ve been tracking Sesame 2 for most of the time it’s been in development, about two odd years now I guess, and I was happy to hear it finally made release candidate status. That meant to me that it was worth finally giving it a test drive.

I downloaded the latest version (RC1) and tore into it like a kid at Christmas. I set up a very simple bit of profiling code, which basically just took sample queries dumped from a session of me using jSpace against the baseball data and posed them against the repository and tracked the query time. I was dismayed when I saw the inital results, they were not what I expected. The out-of-the-box configuration of a Sesame 2 in-memory repository was being crushed by a copy of Sesame 1.2.7 built from their CVS trunk about a month ago. We’re talking between two or three times slower for some queries, to two or three orders of magnitude slower for others. Out of 13 test queries, Sesame 2 outperformed its predecessor on only one, a rather simple query which grabbed all the rdfs:label triples from the kb. I posted the results on the Sesame forums and got two suggestions; one, trying SPARQL queries rather than SERQL, and two, there’s a dead simple query optimiziation that has not been included into the query optimizer yet, so maybe if I do that optimization by hand, I’ll see results more like what I expected.

So the next test was with SPARQL queries, but not surprisingly, there was no appreciable speed-up. The queries are parsed into the same query model which is excecuted by the engine, so this is what I expected. However, the hand-optimization did yield a significant improvement in performance. The worst-case difference was reduced to only an order of magnitude, and for the most part, queries were only a couple times slower with Sesame 2. And now there was a second query in which Sesame 2 was outperforming Sesame 1.x.

This cheered me up, there still seems to be hope for Sesame 2, but in a later release candidate. James, one of the fellows who responded to my post on the Sesame forums, did point out that Sesame 2’s performance may never reach up to the level of Sesame 1.x because of the added level of complexity of the new quad-based format over Sesame 1.x’s triple-based architecture. He makes a good point, but I’ve got my fingers crossed anyway. I’ve enjoyed using Sesame in the past, and I hope they can streamline the query engine some before the final release so we can continue using it.

For those interested in my post on the Sesame forums, you can see it here, and you can download the raw profilng results in .xls format.

Choosing a Syntax for User Defined Datatypes in OWL

Tuesday, November 20th, 2007 · Mike Smith

I’ve been representing C&P in the W3 OWL working group, and my recent focus has been on methods for defining and reusing user-defined unary datatypes in an ontology. The OWL 1.1 documents, which were taken as a working group input, allow users to define restrictions on built-in datatypes in-line using a custom syntax. The following snippet used in the description of Child from the family.owl example provides an example:


<owl:Restriction> <owl:onProperty rdf:resource="#hasAge"/> <owl:allValuesFrom> <owl:DataRange> <owl11:onDataRange rdf:resource="xsd:nonNegativeInteger"/> <owl11:maxExclusive>10</owl11> </owl:DataRange> </owl:allValuesFrom>
</owl:Restriction>

Note the use of the owl11 namespace and new vocabulary onDataRange and maxExclusive. This recycles some of XML Schema, notably constraining facets, without re-using the XML Schema syntax. An alternative approach, which embeds the XML Schema syntax directly into RDF/XML might yield the following revision to the example


<owl:Restriction> <owl:onProperty rdf:resource="#hasAge"/> <owl:allValuesFrom> <owl:DataRange rdf:parseType="Literal"> <xsd:SimpleType> <xsd:restriction base="&xsd;nonNegativeInteger"> <xsd:maxExclusive value="10"/> </xsd:restriction> </xsd:SimpleType> </owl:DataRange> </owl:allValuesFrom>
</owl:Restriction>

Such reuse has warts in some cases, but it’s believed these can be worked around. It’s notable too that these warts only appear in OWL/RDF, not in OWL/XML.

It’s too early to tell which datatype syntax will be used in the specifications. The custom syntax is more accessible to RDF tools and makes adding constraining facets in the future easy because they’re just URIs. On the other hand, the benefit of embedding the XML Schema syntax is reuse of that community’s tools, which due to time and industry support, are more mature than the RDF alternatives. I suspect that additional pros and cons will come out soon.

If you’re interested in this part of OWL, please provide thoughts on the alternatives in comments here or on the pellet-users or public-owl-dev mailing lists. I’m happy to relay them into the WG discussion.

Using Taxonomies for SPARQL-DL optimizations

Friday, November 9th, 2007 · Petr Kremen

Last post I spent showing directions for optimizing SPARQL-DL in general. Now I would like to touch a more particular case of SPARQL-DL queries, namely those that contain class/property variables in ABox part of the query.

I call down-monotonic all the class/property variables ?x that occur in either Type(•,?x), or PropertyValue(•,?x,•) atom. For these variables we can use class/property hierarchies to prune the search. Let’s take an atom Type(i, ?x) (i.e., i is either a binding for variable •, or constant •). If i is not of type C we can safely avoid considering subclasses of C as bindings for ?x ( and analogically for PropertyValue(i, ?x, j) and property hierarchy ).

However, reverse implication doesn’t hold, because of possible interaction with other TBox atoms. Finding a binding C for (down-monotonic) ?x we can’t take all superclasses/superporperties of C as valid bindings. Consider the query Type(i,?x), ComplementOf(?x, not(C)). If C is a subclass of D, it might happen that C is a valid binding for ?x, while D is not.

Of course, the best performance of the optimization is achieved when the search fails for a down-monotonic variables binding that is a root of a deep and complex sub/super hierarchy. Let’s take two sample queries run against LUBM :

  • “Give me all people (?X) together with their type (?A) that are advisors of themselves.”
    Abstract syntax:Type(?X, ?A), SubClassOf(?A, ub:Person), PropertyValue(?X, ub:advisor, ?X)

  • “Give me all people (?X) together with their type (?A) that are teaching assistants of some course (?Y)”
    Abstract syntax:SubClassOf(?A, ub:Person), Type(?X, ?A), PropertyValue(?X,ub:teachingAssistantOf,?Y), Type(?Y, ub:Course).

The first query does not return any result, thus the optimization prunes all subclasses of Person immediately, resulting in reduction of execution time by an order of magnitude (from 3 second to 0.3 seconds). On the other hand, the second query returns nonempty result set, resulting in the significant decrease of the pruning gain (from 3 seconds to 1.5 seconds).

Annotation System Proposal

Wednesday, November 7th, 2007 · Bijan Parsia

In spite of incredibly sore hands and a tendancy to go into shock every hour or so (let’s just say that the corticosteriod injection into the wrist did not go well; sufficiently so that we didn’t even try the knuckle joints), I managed to flesh out a proposal for a new Annotation System for OWL.

Currently, in OWL DL, annotations only occur on “entities” (i.e., classes, properites) via a specially declared set of AnnotationPropertys [sic!]. OWL 1.1 also allows annotations on axioms, so not only can you say that “The class Person was created by Bijan”, but you can also say, “person subClass Cheetah was created by notBijan yesterday”. Obviously, axiom annotations are critically important for any serious ontology engineering project, esp. if it is developed by a team.

In both cases, however, the annotations are “semantics free”. Now, obviously they have semantics in the sense that editors can react to them and so can applications and definitely so can people. However, from the reasoner point of view, they are nothing more than comments and can be thrown away without harm. But, as a result, the reasoner can’t help either! So if you use Dublin Core qualified elements (e.g., date-modified) in your annotations, you cannot use a statement like (date-modifed subPropertyOf date) in the way you might hope and expect. In fact, if you have a date-modified annotation statement and add that TBox statement, you end up in OWL Full.

OWL Full’s solution, roughly, is to be pretty smushy so that AnnotationPropertys [sic] can be ObjectPropertys [sic] and thus the reasoner is sensitive to them. In OWL 1.1, you can get much the same effect via punning: Instead of making your statements about entities using annotations, just use regular properties and let punning make that class an instance.

There are several problems with this approach. The most significant to my mind is that it seriously mixes annotations with domain statements. Most of the time, this is just wrong: As an ontology author, I don’t want to have to consider whether owl:Class is an instance of Person, esp. if all I did was put an annotation on Person.

Also, you might want the logic of your annotations to be fairly different than the logic of your domain. For example, you might want the logic of your annotations to be very simple. Or you might want your annotations to have closed world assumption semantics (e.g., so you can validate the annotations, rather than inferring stuff about them). Currently, the only way to get this in OWL is with a non-standard extension or by separating out your annotations into a separate document.

My proposal builds this separability into OWL, so you can annotate more or less as normal, but get the effect of isolating your annotations from your domain. I also introduce the notion of “mustUnderstand” annotations, that is, annotations which can affect the domain. One might use this for language extensions like Pronto or constraints, but also for more specific interventions. For example, you might want to have some production rules—or some Javascript!—that hack your domain axioms in a variety of ways. The new annotation system would allow for such hacks but, perhaps, in a more controlled way.

Anyway, it’s still in flux, so feedback is welcome. I imagine that you’ll be able to access the various annotation spaces via SPARQL easily enough (similar to what Boris proposed in his metaviews paper). Such annotations can also be used for metamodelling extensions, though that is a tricky area, for sure.

The new annotation system involves some extra complexity, alas. I tried to make it “dumb down” in a useful way, so that just a simple extra statement in your OWL 1.0 or OWL 1.1 documents will get you some nice new behavior. It’s still early days, of course.

Feedback welcome.

Waterboarding is torture; really really bad torture. People who refuse to acknowledge this should be barred from any government position wherein they have oversight over interrogation. At the very least!

Daylight Saving — or evil plot?

Tuesday, November 6th, 2007 · Bijan Parsia

It turns out that people don’t adjust to daylight savings very well at all. Our circadian rythems are, apparently, very grumpy and delicate dears which are inclined to through up to 10 week long hissy fits.

Of course, I have this problem licked—I just never sleep.