Archive for the 'SPARQL the QL' Category

OwlSight v.52 — Keeping up with the Jones’

Monday, May 5th, 2008 · Michael Grove

Following closely on the heels of the recent release of Pellet 1.5.2, we’ve updated OwlSight. Since OwlSight runs on raw Pellet power, the new version, .52, updates the back-end to take advantage of the recent Pellet release.

If you have not already taken a look at OwlSight, cruise on over to the OwlSight page and take it for a spin. For those not already in the know, OwlSight is a lightweight browser-based ontology browser utilizing both GWT (and GWT-Ext) and Pellet as its core technologies. Until next time, stay classy cyberspace.

Spread the word: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati

How Distinguished is Your Variable?

Thursday, December 20th, 2007 · Evren Sirin

In a recent post, Petr talked about optimizations for queries involving undistinguished variables. There is a close relationship between undistinguished variables and bnodes in RDF whose semantics is being discussed in the OWL working group for OWL 1.1. In this post, I’d like to discuss how they relate and explain how using (un)distinguished variables can change the results you get from a reasoner. First off, the term undistinguished variable is used interchangeably with the term nondistinguished variable so don’t be confused. In the literature you might also see the term existential variable being used for the same purpose.

Let’s start with reviewing bnodes in RDF and then we’ll get to bnodes in SPARQL. Contrary to common belief, bnodes in RDF are not just unnamed individuals but they are existential variables (see the bnode definition from RDF semantics). Treating bnodes as existential variables does not cause a difference in ontology consistency results or entailments regarding named individuals. The difference is visible when you are checking entailments where bnodes are used in the consequences. For example, consider the following simple ontology which I will call ontology A:

:Parent owl:equivalentClass [
       a owl:Restriction ;
       owl:onProperty :hasChild ;
       owl:someValuesFrom owl:Thing ] .
:Alice a :Parent .
:Bob :hasChild :Charlie .

Suppose we have another ontology B as follows:

:Bob :hasChild _:x .

According to the bnode semantics of RDF, ontology B is entailed by ontology A. This is because the bnode _:x in ontology B plays like a wildcard role and matches the individual Charlie from A. Suppose you have yet another ontology C:

:Bob :hasChild :Somebody .

Ontology C is not entailed by ontology A because there is no evidence that the named individual Somebody is same as the individual Charlie. If bnodes were just unnamed individuals then ontology B and C would be treated equivalently because we would just assign an arbitrary name to the bnode <em>:x and ontology B would not be entailed.

In contrast to the formal semantics of RDF, thinking of bnodes as unnamed individuals is common practice for people using RDF/OWL. The situation is similar for tool support. I’m not aware of any reasoner that treats bnodes in the consequences as existential variables (Pellet supports entailment checking but it does not support bnodes in consequences). These are some reasons why OWL working group is considering to adopt the unnamed individual semantics (a.k.a. skolemization semantics) for bnodes in OWL 1.1 that is in agreement with the current practice and users’ expectations. This is an ongoing discussion and there is no resolution on this topic yet.

Now let’s move to the semantics of bnodes in SPARQL queries which is defined similar to the RDF semantics. The SPARQL specification says:

Blank nodes in graph patterns act as non-distinguished variables, not as references to specific blank nodes in the data being queried.

A nondistinguished variable in the query traditionally means that the variable can be bound not only to asserted individuals but also to individuals whose existence is inferred. For example, in ontology A, since we know Alice is a parent we can infer the existence of her child even though there are no explicit assertions regarding that child in the ontology. This means the following SPARQL query:

SELECT *
WHERE { ?parent :hasChild _:a }

would return two results where variable parent is bound to Alice and Bob. That is, if you have a query engine that respects the semantics of bnodes. Pellet query engine supports bnodes in SPARQL queries with this semantics. As far as Pellet is concerned, the previous query is exactly same as the following query:

SELECT * 
WHERE { ?parent a [
       a owl:Restriction ;
       owl:onProperty :hasChild ;
       owl:someValuesFrom owl:Thing ] . }

However, if you change the query to use a distinguished variable in place of the undistinguished </em>:a as in:

SELECT * 
WHERE { ?parent :hasChild ?child }

then the query would have a single result (Bob as the parent and Charlie as the child). The difference in the results might seem surprising because two queries look very similar. But you can see the clear analogy between the first and last query and ontologies B and C. In both cases, replacing bnodes with named individuals/variables change the results we get.

Now let’s consider one last tricky case. Let’s change ontology A to contain one extra assertion as follows:

:Alice a :Parent .
:Bob :hasChild :Charlie .
:Dudley :hasChild _:c .

As expected, the first two queries (both using existential variables) will return three results (Alice, Bob, Dudley). But what do we expect for the last query which uses the distinguished variable child? If we are treating bnodes in the ontology as existential variables (as defined in the RDF semantics) then there are no differences between Alice and Dudley since owl:someValuesFrom is an existential restriction. This suggests we should still get a single answer for the last query (Bob). However, in practice, if you try to run this query against an OWL reasoner (I tried Pellet and Jena rule reasoners) you will get two answers (Bob and Dudley) where the child of Dudley is bound to a newly generated bnode identifier like _:b0.

This is just another example of how tools treat bnodes in the source ontology more like unnamed individuals not as pure existentials. And this is why I believe OWL 1.1 should adopt the skolemization semantics for bnodes and leave the existential variables to SPARQL. After all, existential variables only make a difference in query answering (note that you can think of ontology entailment as a SPARQL ASK query). It makes sense to have explicit existentials in the query language but not in the ontology language (which already supports existentials through restrictions like owl:someValuesFrom anyway).

Spread the word: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati

On Queries with Undistinguished Variables

Friday, December 7th, 2007 · Petr Kremen

After a bit of delay, I want to discuss some new optimizations for queries with undistinguished variables that are being implemented in the new Pellet query engine. For simplicity, I consider just connected conjunctive ABox queries with undistinguished variables.

The current Pellet query engine evaluates the queries with undistinguished variables as follows:

  • the query is rolled-up to each distinguished variable ?x resulting in a concept rollup(?x), instances of which are taken as candidate bindings for ?x
  • Two strategies are provided at this point : (i) optimized (for q. with at-least two distinguished variables)—the query is partially grounded by all bindings for some two distinguished variables and for each such grounding an instance check is performed. In each of the next steps the current variable binding is extended by all possible bindings for one of the remaining variables and an instance check is performed; (ii) simple—for each query grounding with a possible candidate binding for all variables an instance check is performed.

The problem with this approach is that rolling-up an edge PropertyValue(?x,p,?y) to rollup(?x)= ∃ p.T and retrieving instances of this concept in many cases ends-up finding also a valid binding for ?y, which is then thrown away. In the subsequent execution another instance check is often required to check that some binding for ?x and ?y is valid. Thus, it is preferable to evaluate a property value atom with no undistinguished variables using the query engine for distinguished variables instead.

To narrow the gap between evaluation of queries with only distinguished variables and those with also some undistinguished variables I introduce cores (see figure below) of undistinguished variables as the largest connected subqueries induced by non-distinguished variables. Cores are then treated as normal query atoms with signature consisting of neighboring distinguished variables and constants. There are different strategies to evaluate a core (to be described in another post). For now just consider the simple strategy (above), mentioned at the beginning of this post.

Cores example

I stick to the tradition of two sample queries in my posts :

  • “For all people (?X) leading a department tell me at which institution (?Z) they have obtained their degrees.”
    PropertyValue(?X, ub:headOf, _:a), Type( _:a, ub:Department), PropertyValue(?X, ub:degreeFrom, ?Z)
  • “Give me all teaching graduate students (?X) that take a course (?Y) taught by their advisor.”
    Type(?X, ub:GraduateStudent), PropertyValue(?X, ub:takesCourse, ?Y), PropertyValue( _:a, ub:teacherOf, ?Y), PropertyValue(?X, ub:advisor, _:a), PropertyValue(?X, ub:teachingAssistantOf, _:b).

Even for the simple strategy mentioned in the first paragraph we get significant performance improvement. For the first query this evaluation method produces results within 3 s. with just 2 consistency checks, comparing to 30 s. and 21 consistency checks in case of the original Pellet query engine. For the second query we get 23 results within 4 s. with 2 consistency checks. The original Pellet engine didn’t finish in 5 min., having performed by this time hundreds of consistency checks.

Spread the word: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati

Using Taxonomies for SPARQL-DL optimizations

Friday, November 9th, 2007 · Petr Kremen

Last post I spent showing directions for optimizing SPARQL-DL in general. Now I would like to touch a more particular case of SPARQL-DL queries, namely those that contain class/property variables in ABox part of the query.

I call down-monotonic all the class/property variables ?x that occur in either Type(•,?x), or PropertyValue(•,?x,•) atom. For these variables we can use class/property hierarchies to prune the search. Let’s take an atom Type(i, ?x) (i.e., i is either a binding for variable •, or constant •). If i is not of type C we can safely avoid considering subclasses of C as bindings for ?x ( and analogically for PropertyValue(i, ?x, j) and property hierarchy ).

However, reverse implication doesn’t hold, because of possible interaction with other TBox atoms. Finding a binding C for (down-monotonic) ?x we can’t take all superclasses/superporperties of C as valid bindings. Consider the query Type(i,?x), ComplementOf(?x, not(C)). If C is a subclass of D, it might happen that C is a valid binding for ?x, while D is not.

Of course, the best performance of the optimization is achieved when the search fails for a down-monotonic variables binding that is a root of a deep and complex sub/super hierarchy. Let’s take two sample queries run against LUBM :

  • “Give me all people (?X) together with their type (?A) that are advisors of themselves.”
    Abstract syntax:Type(?X, ?A), SubClassOf(?A, ub:Person), PropertyValue(?X, ub:advisor, ?X)

  • “Give me all people (?X) together with their type (?A) that are teaching assistants of some course (?Y)”
    Abstract syntax:SubClassOf(?A, ub:Person), Type(?X, ?A), PropertyValue(?X,ub:teachingAssistantOf,?Y), Type(?Y, ub:Course).

The first query does not return any result, thus the optimization prunes all subclasses of Person immediately, resulting in reduction of execution time by an order of magnitude (from 3 second to 0.3 seconds). On the other hand, the second query returns nonempty result set, resulting in the significant decrease of the pruning gain (from 3 seconds to 1.5 seconds).

Spread the word: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati

Optimizing SPARQL-DL

Friday, November 2nd, 2007 · Petr Kremen

To continue with the SPARQL-DL series, I will briefly discuss two implemented cost-based query reordering methods.

The current Pellet engine for conjunctive ABox queries with distinguished variables first explores all query orderings, and then it executes the best one (according to the cost function defined here). This reordering method (let’s call it static) of SPARQL-DL queries suffers from several problems, mainly :


  • As it evaluates each query ordering, even queries consisting of as few as 10 atoms or more are effectively uncomputable.
  • To keep cost evaluation of each query ordering computable in linear time w.r.t. the query length, the cost estimates have to be based on rough, and thus imprecise, estimates of various knowledge base statistics (like the number of instances in given class). Some of these statistics are computationally expensive as they, in general case, may require classification (like estimate for number of subclasses of given class) or realization (like estimate for number of named classes that given instance belongs to) of the knowledge base.

These problems led me to experiment with reordering queries dynamically during their execution. Even the first implemented method—greedy-like search—seems to be promising (contrary to the use of greedy search in most other applications). The algorithm starts with the the list L of all query atoms and in each step the atom with best score in L is evaluated and removed from L. The scoring function I’m using is simple: for a given atom it sums up an estimate of the number of consistency checks that are required to evaluate this atom and the number of new variable bindings to be explored in next steps. This heuristic is linear in the number of atoms and thus cheap to compute.

On the other hand, this scoring function does not overcome the second objection above. It still provides only a rough estimate, and does not reflect the ordering of remaining query atoms at all. However, since costs of relevant atoms are evaluated after each iteration, this issue is less critical than in the static ordering case. My initial experiments prove that the proposed dynamic ordering method is comparable to static ordering on short queries and significantly more scalable on longer ones.

As an example, let’s take a query consisting of 9 atoms :

  • “Give me all graduate students (?X) that takes some course (?Z) together with the teacher of the course (?A) and his/her official position (?B). Filter out those teachers who do not work for the department (?W) the student belongs to.”

    Abstract syntax:Type(?X, ub:GraduateStudent]), PropertyValue(?X, ub:takesCourse, ?Z), Type(?Z, ub:Course), PropertyValue(?X, ub:memberOf, ?W), PropertyValue(?A, ?P, ?W), Type(?A, ?B]), PropertyValue(?A, ub:teacherOf, ?Z), SubPropertyOf(?P, ub:worksFor), SubClassOf(?B, ub:Faculty)

Execution of this query against LUBM (0,1) takes approximately 3 seconds using the dynamic reordering, comparing to almost 30 seconds required in the static ordering case (plus additional approx. 4 seconds for KB consistency check in both cases). Both approaches have comparable execution times on the original LUBM queries. Detailed evaluation and comparison will be shown in a future post.

Spread the word: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati