

Heart: a planet-scale RDF data store - helwr
http://wiki.apache.org/incubator/HeartProposal

======
jerf
I feel like this page is using "map-reduce" as magic juice. The problem
described is enormous, and saying you're going to use map-reduce in this case
is somewhat akin to saying you're going to be using CPUs. Yes, but so what?
That doesn't magically move you to "feasible".

I also get the sense that, far more than usual, this is technology without a
use case. I'm not saying it's useless or that the idea is bad, just that I
followed a couple of links and didn't find an even cursory discussion of what
this is for or why it is better than other things, existing or otherwise.

These aren't good signs.

~~~
blasdel
It's a proposed original ASF project -- the point is to churn some
bureaucracy, not produce something people want.

------
evgen
Ouch. A directed graph in a columnar store like HBase? I guess there are worse
approaches, but I would like to see a justification for this other than "we
have this hammer called Hadoop..."

~~~
blasdel
Got any better solutions? I don't. At least it ain't trying to do it
relationally!

Tuplestores are one of the only data stores I can think of that doesn't
actively fight you when trying to query EAV data or traverse Directed Graphs.

~~~
epochwolf
If it's a directed graph, why not use a graph database?

~~~
blasdel
Tuplestores are one of the most common backend formats for storing graphs —
you only need triples, or quads if you want timestamps. Everything's stored as
(Entity, Attribute, Value) in the same table, with all of the Es and As being
references, and with the Vs as references when the tuple is a graph edge but
as atoms when it's a graph node.

Pretty much every existing RDF database is a tuplestore underneath, or abuses
a relational database as a crappy tuplestore that won't get you fired for
installing it.

------
chime
There is so much useful and well-organized information in RDF format out there
that is completely being unused. Check out: <http://wiki.dbpedia.org/Datasets>
\- this includes almost the entire Wikipedia in a nice, organized way. Using
Sparql/Snorql ( <http://dbpedia.org/snorql/> ) you can actually find
everything that Bart Simpson wrote on the chalkboard during the opening
sequence in season 12: <http://www.snee.com/bobdc.blog/2007/11/querying-
dbpedia.html>

However, searching through RDF is a pain and not a small task. The tools
available to manage 50GB worth of RDF data are very basic and inefficient. I
used a MySQL-based RDF store and even on my extremely powerful server, doing a
simple query choked the machine. I'm hoping with a better data store, we can
now manage RDF data better and hopefully search it better. Think of how much
useful information there is on Wikipedia, Powerset.com etc. I don't mean just
the text, but rather the extremely well-organized human-edited information
boxes on the right that contain categories, hierarchical data, and
specifications. Using RDF, all of that could be instantly queryable.

I would love to write an app that queries a Google RDF store for local
listings or map information. APIs are flat and tabular. I can only query
Amazon for specific search words in items. Right now, I can't say with one
query: "show me all ratings for every book that was written by the authors who
currently have a book in the top 100 books list." RDF makes that possible. The
problem right now is that RDF stores suck and searching through them isn't
easy for the average developer. Here's hoping Apache makes it possible and
scalable.

~~~
elblanco
> "show me all ratings for every book that was written by the authors who
> currently have a book in the top 100 books list."

Why not? That seems like a pretty basic SQL query to me, with one nested
query. I think we even did something almost exactly like that in my undergrad
dB database management class in the first few weeks.

something like (sorry, I'm the world's worst SQL guy)

SELECT title, ratings FROM books WHERE author IN (SELECT author FROM
top100books);

If I just had one table with the sales figures, titles and ratings in it, I
could even derive the answer in one query also.

I'm sure my syntax is all wrong and messed up, but that's not the point. The
point is most of the notional "only RDF could make this possible!" examples
I've heard are pretty much completely doable in one or two queries in SQL, and
not even particularly complicated ones at that.

~~~
epochwolf
_> "show me all ratings for every book that was written by the authors who
currently have a book in the top 100 books list."

Why not? That seems like a pretty basic SQL query to me, with one nested
query. I think we even did something almost exactly like that in my undergrad
dB database management class in the first few weeks._

That's more like (sorry, I'm bad at sql too)

    
    
      SELECT ratings.*, book.name, author.name
        FROM ratings 
        JOIN books ON ratings.book_id = book.id
        JOIN authors ON books.author_id == author.id
        WHERE author.id 
          IN (
           SELECT author_id 
           FROM books 
           JOIN ratings ON book.id = book_id 
           ORDER ratings.score ASC LIMIT 100
          )
    

You better hope that sucker uses indexes or it's going to be a few minutes.
And you better cache that because you don't want that running every time
someone hits the recommended reads.

SQL can be pretty scary for this stuff. It would be even worse if you allows a
many to many relationship with books and authors.

~~~
elblanco
Like I said, I'm the world's worst SQL guy. But the good news is that I
haven't invalidated my point by my sloppy syntax.

So this raises two questions...

1) Who doesn't use indexes these days? Even light solutions like SQLite
support them.

2) The only real remaining advantage to RDF is that the data resources can sit
on the end of URLs if I remember. I think this is mainly because nobody has
put much effort into dangling indexed relational tables at the end of a URL.
There's nothing that says I couldn't have

    
    
      SELECT ratings.*, book.name, author.name
          FROM http://someplace.com/booklists.ratings AS ratings
          JOIN http://someplaceelse.net/booksales.books AS book ON ratings.book_id = book.id
          JOIN http://yetsomeotherplace.org/publisherdata.authors AS authors ON books.author_id == author.id
          WHERE author.id 
            IN (
             SELECT author_id 
             FROM books 
             JOIN ratings ON book.id = book_id 
             ORDER ratings.score ASC LIMIT 100
            )
    

Or some such...sure it's sloppy and it's slowish, but aren't those the same
problems as with RDF? And yet we're still using SQL to do this.

------
mattyb
Excited mark_l_watson?

~~~
mark_l_watson
No. I look at this site occasionally, and nothing seems to get done.

Also (pardon me if I take your good joke seriously :-) I don't see why we need
planet scale RDF data stores. Instead, we need a very large number small
SPARQL endpoints with some form of trust mechanism and a discovery mechanism.

~~~
mattyb
:-)

------
epochwolf
So... someone is making a massive XML database out of some other components?

~~~
blasdel
RDF is _absolutely not_ XML.

It's a directed graph, not a tree. It died primarily because the dimwits at
the W3C saddled it with an atrocious XML serialization, but the semantic web
bullshit didn't help, and neither did the poor use of it by Mozilla and others
that led to it being used ineptly for RSS.

~~~
epochwolf
Thanks, I wasn't clear on what RDF was. I skimmed the wikipedia article and I
thought it was just a variant of xml.

------
Raphael
Skynet.

