Billions and billions…
by Bijan Parsia
First Franz got the distribution rights to Racer, now they are taking on RDF stores (PDF) with the new version of AllegroCache. Hey, they’ve got a Prolog in there, too.
The “white paper” entitled Processing Billions of RDF Knowledge Triples Made Possible with AllegroCache had me giggling. (It seems somehow associated with Dr. Dobbs.) I should go through point by point but let me just hit two:
However, as shown previously, the number of RDF triples can easily grow into millions or even billions for real-world applications, making it difficult to process efficiently with traditional means. Small wonder there is scarcely any deployment of practical RDF applications.
Yes, they are suggesting that unless you can scale to millions or even billions, you can’t deploy a practical RDF applications. Sigh. Perhaps this is an inditement of the proliferation being restricted to triples entails…who knows!
Plus, I’d like to see what other disk based systems are known to be an order of magnitude slower than their 10,000 triples/sec (for less than 200 million triples) and 4,000 triples/sec (for over 200 million triples). These are quite nice time but it’s still taking, what, 5 hours or so? Everything else is an order of magnitude worse?
Please post the study.
(Oh, ok…one more:
It includes an expressive query language, RDF Prolog, particularly suited for graph search and graph matching over an RDF network. It can find semantic relations between RDF nodes automatically, using complex Prolog clauses as needed without speed degradation.
So you are going to layer Horn clauses on top of billions of triples and show NO performance impact?
Oh, they seem to have a perfect index:
This remarkable result is achieved with 8 indices, which adds a disk storage overhead of about 300 bytes per triple but offers unparalleled performance.
Like, Kowari and YARS and the GOM store. Yum.)
More Abu Ghraib pictures have come to light. The actions depicted were done by the US government, as far as we can tell, as a matter of policy. It is to our shame.
February 22nd, 2006 at 4:04 am
Based on the white paper, I would say this is an impressive piece of work. The white paper is a bit low on details though, and leaves me with many questions. Does the reported storage time include any inferencing? Did they use full ACID compliant transaction (apparently, this is optional)? Considering that the Lehigh University Benchmark was used for scalability testing, what are the results of the queries included with this benchmark?
Further, the reported 4000 triples/sec for data sets larger than 200M seems a bit arbitrary. This figure will likely go down as data sets grow. Also, performance depends on lots of factors, including the specifics of the data set that was used (see also: Pitfalls in Benchmarking Triple Stores).
The low-level API that is mentioned surprises me a bit: it doesn’t include any operations for removing triples! Does this mean that you can’t remove anything from their database?
Anyway, interesting read but it lacks a lot of details. Dr. Dobbs probably constrains the length of the article too much to be able to include such details. This white paper gives some more info on the non-RDF parts of AllegroCache.
February 22nd, 2006 at 5:04 pm
Based on the white paper, I would say this is an impressive piece of work. The white paper is a bit low on details though, and leaves me with many questions. Does the reported storage time include any inferencing? Did they use full ACID compliant transaction (apparently, this is optional)? Considering that the Lehigh University Benchmark was used for scalability testing, what are the results of the queries included with this benchmark?
Further, the reported 4000 triples/sec for data sets larger than 200M seems a bit arbitrary. This figure will likely go down as data sets grow. Also, performance depends on lots of factors, including the specifics of the data set that was used (see also: Pitfalls in Benchmarking Triple Stores).
The low-level API that is mentioned surprises me a bit: it doesn’t include any operations for removing triples! Does this mean that you can’t remove anything from their database?
Anyway, interesting read but it lacks a lot of details. Dr. Dobbs probably constrains the length of the article too much to be able to include such details. This white paper gives some more info on the non-RDF parts of AllegroCache.