Archive for the 'Dealing with Data' Category

Owlgres 0.1: First Release

Wednesday, May 7th, 2008 · Markus Stocker

We are proud to announce the first release, version 0.1 (alpha), of Owlgres, a very scalable OWL reasoner that uses Postgres. It implements DL-Lite, a tractable profile of the upcoming OWL 2 standard. Owlgres supports consistency checking and conjunctive query reasoning services—the latter via SPARQL-DL.

Downloads and documentation can be found at the Owlgres site. For bug reports, feel free to open a ticket on our issue tracking site for Owlgres, which also summarizes the first steps with Owlgres on the Wiki page. There’s a mailing list for discussion and support.

Owlgres is dual-licensed; for open source projects, it’s available under the AGPL v.3. For commercial projects, commercial support licenses are available.

We’d love feedback on Owlgres and encourage people to try it out, play with it, and report bugs, issues, and ideas.

Spread the word: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati

Owlgres, DL-Lite and Selectivity Optimizations

Monday, April 21st, 2008 · Markus Stocker

I have been blogging about Owlgres, our scalable OWL reasoner for DL-Lite. Recently, we have been playing with DBpedia, in particular the dataset with the Wikipedia Infoboxes. I have previously talked about the features of DL-Lite: querying a large persistent ABox and complete result sets w.r.t. a TBox.

In this post I’d like to underline the usefulness of the selectivity optimizer using a short and concrete example from DBpedia. We query over roughly 24 million triples (in Owlgres metrics 2,198,649 concept assertions and 21,503,343 data role assertions). The query asks for instances of Book and looks like this:


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX yago: <http://dbpedia.org/class/yago/>
PREFIX dbpedia: <http://dbpedia.org/property/>

SELECT ?name WHERE {
?s rdf:type yago:Book106410904 .
?s dbpedia:name ?name .
}

Without the optimizer, the query runs in 61,743 ms and with the optimizer in 14,336 ms.

Why? The TBox contains totally 58,108 concepts and 53,898 of them have no asserted instances (that’s around 93%). This isn’t very surprising since typically individuals are asserted in the leaf concepts of taxonomies.

But it makes a lot of sense to consider this during the query reformulation process in DL-Lite. In fact, without the selectivity optimization, the reformulated set size is 83, i.e. the algorithm reformulates the original conjunctive query into a set of 83 conjunctive queries.

By using some basic statistics (I’m not talking about complicated maths here!) and a trivial procedure to eliminate conjunctive queries that have a known zero selectivity, we can reduce the reformulated set of conjunctive queries from 83 to 16. This makes everybody happy, including Postgres, which returns the result set 4.3 times faster (not huge, yes, but worth the effort). Note that there is no caching here. If caching is involved, the query takes approximately the same time (with and without selectivity optimization, it takes roughly 3 seconds).

I’ll be at WWW2008 this week and I’m happy to chat about DL-Lite with interested people and demo Owlgres on DBpedia. I’ll have a talk on SPARQL Basic Graph Pattern Optimization Using Selectivity Estimation on Thursday, probably the easiest way to spot me. The talk is about some previous work that started in 2006 at the University of Zurich as a diploma thesis and was refined in 2007 at HP Labs with the implementation of the query optimizer for ARQ, the Jena SPARQL query engine.

Spread the word: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati

Owlgres: A Scalable OWL Reasoner

Sunday, March 23rd, 2008 · Markus Stocker

Hi, I’m Markus Stocker, the new intern at C&P and enjoying time here in DC working on a first project, i.e. Owlgres a DL-Lite implementation for PostgreSQL. DL-Lite is a fragment of expressive DLs with some interesting properties. First, together with the standard reasoning services (e.g. subsumption) it supports conjunctive query answering over an ABox maintained in secondary storage (typically a RDBMS). Second, DL-Lite has some nice computational properties, i.e. the standard reasoning tasks are polynomial in the size of the TBox and query answering is polynomial in data complexity. Third, DL-Lite allows us to expand ABox queries with TBox knowledge and translate the expanded ABox query into SQL queries that are subsequently evaluated over the RDBMS.

More practically, this means for instance that (after tweaking a bit the data) one can load the DBpedia Wikipedia infoboxes and their classification into PostgreSQL as ABox and TBox respectively and then query along the YAGO class hierarchy. For example, DBpedia explicitly states “The Lord of the Rings” as an instance of Book and a corresponding query returns the correct result set. But although Book is a subclass of Publication the result set of a query that asks for a Publication with name “The Lord of the Rings” is empty. Not in OwlGres. Somehow sweet, isn’t it?

The main parts of Owlgres were already implemented before I joined C&P in February. Though, Owlgres suffered from some query performance problems. So my task was to look at how we could possibly optimize the overall query performance.

In a nutshell, DL-Lite query answering includes a reformulation step that encodes the TBox knowledge into the DL-Lite ABox query. This reformulation typically expands a query into a set of (reformulated) queries that are subsequently translated into SQL queries. The overall result set is the UNION of the intermediate result sets for each of the queries in the reformulated set. A good starting point for optimization is, hence, to look at the size of the reformulated set. Is it minimal?

Let’s take a look at a real world case. For instance the LUBM query 9. The query can be seen as the following DL-Lite ABox query

{ Student(?X), Faculty(?Y), Course(?Z), advisor(?X, ?Y), teacherOf(?Y, ?Z), takesCourse(?X, ?Z) }

The PerfectRef reformulation algorithm described by Calvanese et al. returns a reformulated set of queries of size 726 for the LUBM query 9 and, hence, Owlgres executes 726 SQL queries on PostgreSQL to get the result set. The question is, can this set be reduced? The answer? Yes.

One thing that we quickly notice is that a significant number of queries have an empty result set. Further, many queries have result sets that are subsets of other queries. For instance, the result set of the second query in the following example is a subset of the first:

{{ advisor(?X,?Y), teacherOf(?Y,?Z) }, { advisor(?X,?Y), teacherOf(?Y,?Z), Faculty(?Y) }}

Thus, we can drop the second query as it doesn’t add any value to the overall result set. We found that a domain/range simplification practically eliminates such subset queries. Essentially, for each reformulated query we drop query atoms that are redundant constraints. For instance, in the second query above, the Faculty(?Y) query atom is redundant as the domain of the object property teacherOf is specified to be a class of type Faculty in the LUBM TBox. By removing the Faculty(?Y) query atom in the query above the two queries are equal and, hence, we reduce the set by one. So, domain/range simplification is the first optimization.

A DL-Lite ABox query is a set of query atoms which is executed as a conjunctive query. If one of the atoms has zero selectivity, the entire query has zero selectivity. It is, hence, useful to know if query atoms have a zero selectivity as this information allows us to safely drop queries from the reformulated set. This is our second optimization. During the ABox loading process, we gather statistics about the frequency of TBox names (concepts, roles). Whenever a TBox name with zero selectivity is used in a (reformulated) query we drop the query from the resulting set.

To come back to our example LUBM query 9, by applying the optimizations described above we reduce the initial reformulated set of 726 queries to just 2 queries. Pretty amazing, isn’t it? Needless to say, PostgreSQL is happy about this too. Finally, we create a single SQL query of the optimized reformulated set by creating an SQL UNION of the queries in the set. Thus, instead of 726 single queries we execute just one.

It’s time for a couple figures. We expect faster queries, but how much faster are they? We tested the 14 LUBM queries on the University0 dataset (i.e. roughly 100k triples). The timings below are in milliseconds. We first show the timing for the non optimized reformulated set and then the timing for the optimized set.

For instance, the LUBM query 9 takes 18 seconds to be executed without the optimizations and 0.1 seconds with the optimizations. The average difference between the optimized and unoptimized queries is 1,914.43 milliseconds.

We’re going to be demoing Owlgres publicly for the first time at OWLED 2008 in DC in two weeks; if you’re attending OWLED, we’d love to show Owlgres to you. We’ll be saying more in future posts about the kinds of reasoning problems Owlgres and DL-Lite are especially suited for.

Spread the word: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati

BIANCA, 2007 Semantic Summer, and OWLED

Friday, May 18th, 2007 · Kendall Clark

We’ve been busy with new internal projects, as well as some milestones for some customer work:

First, we’re getting very close to putting an RDF-powered data integration app into production at NASA HQ. There have been Semantic Web projects at NASA, but not very many have been put into production alongside ordinay business apps. If we’re not the first (and we probably aren’t), we’re at least early.

At the Semantic Technology conference next week our customer, Andy Schain will be giving a talk about this app, which is called BIANCA, and another one, called POPS, we’ve been working on for NASA.

BIANCA provides a single, integrated view of information about, including relationships between, applications, servers, network services, networks, and change items for NASA HQ. The integration is over four different data sources, and with this integrated view we’re able to provide some novel analysis services, including the ability to do disaster and other outage planning scenarios based on building dependency graphs of the relationships between BIANCA nodes. From these graphs we can generate outage repair plans (though these are not yet optimal plans, that’s coming in the next version) as well as productivity estimates per hour of downtime.

BIANCA is a Pylons web app (and RESTful web service) in front of an RDF database. (The next version of BIANCA will include live querying of DNS, SNMP, IDS, and other “network fabric” services. I suspect we’ll do something in Sesame by building a new Sail for some of these data services.)

POPS (People Organizations Projects and Skills) is the other app, which will go into user pilot soon: it’s an expertise locator service for NASA civil servants and contractors (all 80,000 of them), which also integrates disparate data sources (this time 6 or 7 of them) using RDF. (For more details, check out the 2006 XTech talk about POPS.)

The interesting bit about POPS is the client user interface, called JSpace, which started off as our clone of mspace, but has since diverged in some non-trivial ways. JSpace translates user input into RDF queries against a data aggregation accessible via HTTP.

JSpace is an example of what the cool kids are calling these days a linked data browser, though we haven’t yet done a good enough job talking about it publicly, so no one really knows anything about it at all. One project for the summer is to get more demo data sets up on our site so people can play around with the webstart version of JSpace.

Second, our first internship program, which we’re calling 2007 Semantic Summer, is already an unexpected success. Honestly, I didn’t think we’d get a single applicant, since we’re new and the program is even newer. But we’ve gotten about 10 so far, several of them from very strong candidates, mostly people working on a PhD in computer science and pursuing a diss topic in Semantic Web.

We’ve accepted four applicants for the summer, and there’s another who may intern with us in the fall. This is all very exciting: they’ll be working on a range of projects, including new stuff for Pellet, the next version of BIANCA, and some of our internal projects.

Third, the 2007 OWLED (the OWL Experience and Directions) Workshop—which we’re proud to sponsor—is coming up very soon, the first week of June, right after ESWC in Innsbruck. I wish more of us were able to attend, but it’s not a cheap or easy trip from DC and we’re swamped with engaging work. We did get two papers accepted; they’ll be presented by our European R&D, i.e., Bijan.

We’ve very excited to see the registration numbers looking good, as well as a very cool program of talks and papers. If you’re into OWL DL, OWLED is the conference.

Finally, watch this space on or around 7 June: we’ll have a couple of announcements to coincide with OWLED which will be worth hearing, especially if you’re into OWL.

Spread the word: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati

Managing NASA’s Data

Tuesday, April 24th, 2007 · Kendall Clark

NASA has an inconceivably annoying amount and diversity of data (these are good problems to have; but, damn, they’re hard); and the Agency is likely to have more management obligations that it has to satisfy with less budget. But at least it’s only nearly everyone in the world who’s interested in NASA’s core data!

Ouch. Ouch!

This week I’m spending most of my time at a NASA Enterprise Architecture meeting out at UMD; imagine: lots of smart people in a room trying to figure out how to do more—in some cases, a lot more—with less, sometimes a lot less. That sucks in the small, but it’s exciting in the large.

SemWeb mavens will be happy to know that ontologies are playing a growing role in efforts to manage NASA’s information, particularly around the Data Reference Model part of the Federal Enterprise Architecture. That’s a lot of syllables that come down to how the Agency manages its data across its core lines of business. C&P cares about all of this; but especially the integration and policy management pieces.

There are other technologies in use, of course, it’s not an OWL clean sweep; but it should be happy making to see OWL being used for real problems in the federal space that have everything to do with cool stuff like science (aeronautics, cosmology, physics).

Spread the word: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • Digg
  • del.icio.us
  • TwitThis
  • Technorati