Stonebraker Explains Oracle's Obsolescence, Facebook's Challenge

nl · on June 2, 2015

Look at Facebook, it is one giant social graph, with the problem of how to find the average distance from anyone to anyone. You can simulate a graph as an edge matrix, and a connectivity matrix in an array-based system, and you model graphs in a table system, or you build a special-purpose engine to implement the graph directly. All three are being prototyped and commercialized, and the jury is out whether there is room for a new graph engine or if one of the other technologies would be good enough.

This.

So I'm dealing with this problem. There is nothing out there.

Neo4J doesn't really do in-graph processing[1]. BlazeFB/OrientDB/RDF Stores all are similar to Neo4J

Pregel/GraphX/Giraph are graph processing engines, but lack property stores.

I want a single system that does both. I want to run PageRank (etc) and query-by-property on the same system.

Titan was promising, but they stopped working on it when they were bought.

I'm surprised no one is fixing this.

[1] http://neo4j.com/blog/categorical-pagerank-using-neo4j-apach... note this bit: "I can scale each Apache Spark node to perform parallel PageRank jobs on independent and isolated processes all consuming a Hadoop HDFS file system where the Neo4j subgraphs are exported to." (ie, Spark runs against HDFS, not Neo4j)

herewego · on June 2, 2015

Shameless self promotion: My company is working on an open source distributed hypergraph database that tackles this problem, called PatternSpace. Written in C++14, uses Paxos, Cap'n Proto, lots of mechanical sympathy, and is queryable/traversable via subgraph isomorphism among other methods.

dunkelheit · on June 7, 2015

Could you provide some pointers to your work? I googled a bit and searched on github and could not find anything.

mooreds · on June 2, 2015

Does this get you closer to where you need to be? http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-doc...

nl · on June 2, 2015

No. My reference above is about that technique.

jerven · on June 2, 2015

Have a look at blazegraph it can do both sparql and gremlin. Has support for runninh on clustered gpu.

Was selcted by wikidata when titan was bought.

nl · on June 2, 2015

BlazeFB (wtf?) above was supposed to be BlazeGraph.

mrweasel · on June 2, 2015

Isn't the fact that Facebook still manages to make MySQL work an indication that unless you doing something special, current SQL server will support most users for a very long time?

Sure Facebook might have preferred a better solution, but most of us aren't Facebook, and the solutions that would work for Facebook might not be ideal for small companies.

exelius · on June 2, 2015

I think that it's more an indication that running SQL is a business requirement, not a technical one. To run your business, you need to be able to query your data in a number of ways. SQL offers the most flexibility and is essentially compatible with every reporting tool out there. Your awesome NoSQL database isn't as awesome if every software package you need to run your business requires a custom integration layer.

In other words, just because our data sets are exponentially larger doesn't mean we still don't need to be able to perform complex joins or other relational operations. The data sets are more complex, but the queries we're running on them are even more complex. While some other data stores may be optimal for niche use cases, SQL offers the best mix of speed and flexibility while still meeting the business requirements.

techdragon · on June 2, 2015

No, because both PostgreSQL and MySQL both make excellent key value stores and as a consequence Facebook are sticking with the data storage tools they are used to.

vgt · on June 2, 2015

I think the point of the article is that Facebook is making MySQL work, and that's exceptional. Facebook understands that they would rather not.

bane · on June 2, 2015

The linked profile on MarkLogic is quite good. [1]

I remember them from ages ago as basically an XML focused database. I'd see them at trade shows with a small booth. They had interesting technology but had very engineer-y marketing and not a lot of customers. My company had been on the lookout for such a technology and IIR our engineering team checked them out for a bit, but they weren't a great match for our product and we ended up using dtSearch instead. [2]

In the last year or two, I've started hearing MarkLogic show up again all over the place and wondered what was going on. Turns out they got new leadership (Gary Bloom) and have been making a big push to grow. It's funny how that happens, I wonder how many other serviceable companies with decent tech are hiding out just waiting for the right CEO to come along and push them into the spotlight (I'd also add that 'XML' is no longer one of their marketing keywords).

1 - http://blogs.barrons.com/techtraderdaily/2015/02/13/oracles-...

2 - http://www.dtsearch.com/

elgenie · on June 2, 2015

Only four years ago, it was a 'fate worse than death' rather than a mere challenge:

https://gigaom.com/2011/07/07/facebook-trapped-in-mysql-fate...

threeseed · on June 2, 2015

I agree about Oracle's lack of a strategy that incorporates Hadoop is definitely going to hurt them. With the addition of Spark the Hadoop platform is looking like being the first choice for analysing data sets whether small or big. And whilst I am sure SQL has and will continue to be a major part of that it won't be the only approach. There will be R, PMML, Python, Scala and a whole lot more.

That aside Facebook having a buy versus build decision to make seems pretty strange. What would they even buy given that Cassandra and HBase which they created are two of the most scalable databases right now. Strange observation.

Joeri · on June 2, 2015

Oracle partners with Cloudera for their hadoop servers: https://www.oracle.com/engineered-systems/big-data-appliance...

Hadoop is not what the article is talking about though, nor is Cassandra. The issue is not about whether it scales out or not, it's about whether it is memory-based or disk-based. If queries have to pass through a block API at any point, they can never perform as well as a database where all queries run out of main memory, like Hana or VoltDB. On block-based systems you can't run some categories of analytical queries interactively, they always end up being too slow.

He's saying oracle has a problem because they don't have a good in-memory story ready. Right now they're aiming at this problem from two directions: Oracle NoSQL (a distributed KV store comparable to Riak), and Oracle 12c In-Memory (in-memory engine bolted onto the oracle db). Neither are particularly convincing to me, and definitely not a match for Hana or VoltDB.

eitally · on June 2, 2015

I think your comment explains the problem clearly and should be at the top. To add more bluntness, Oracle's problem is that their typical enterprise customers who may be interested in, or require, in-memory databases have no practical option but to abandon Oracle. And if they move away from Oracle for analytics, what's stopping them from considering moving away for transactional work as the next step?

arjunrc · on June 2, 2015

According to the article, there are three strategies - Traditional Row based DBMS, Column based ones (Vertica, VoltDB) and MapReduce/Hadoop.

My current employer is making the switch from Row based to Hadoop - which I feel is because of hype & not justified technically, given the size of our cluster. The goal is to reduce speed of data delivery to clients, but I believe a column-based DBMS with an optimized ETL, would be the way to go.

Wonder how it'll look like in 5 years & if others companies' IT are buying into hype too.

jbergens · on June 2, 2015

I assume you mean "enhance speed of data delivery to clients". Otherwise you are trying to make the system slower.

orsenthil · on June 2, 2015

What is meant by array-based system that he is referring to ? Can someone point me to some other basic literature that explains array-based database systems?

Context from the interview: Sooner or later, the business intelligence world will move to the data science world, using things like regression analysis, Bayesian analysis — these are lots of big words, but all of these techniques, if you look at them, it’s an array-based, not a table-based calculation.

sjg007 · on June 2, 2015

You stream the arrays (column vectors) to the OLAP engine or data processor rather than constructing a big join.

DonHopkins · on June 2, 2015

No SQL => Not Only SQL => Not Yet SQL

mooneater · on June 2, 2015

Can anyone tell me what is fundamentally different between array and table processing ie. For data science?

baking · on June 2, 2015

Do you understand the differences and applications of "row stores" versus "column stores"? If SQL databases are optimized for the retrieval of entire rows or records, column stores are optimized for the storage and analysis of entire columns in a database (similar data types allows for greater compression and bringing much more data into memory.)

I don't know if anyone has implemented a database optimized for array processing, but the benefit for predictive modeling should be obvious. I'm sure optimization for GPU processing wouldn't hurt either.

CWuestefeld · on June 2, 2015

I'm wondering the same thing. I'm just not seeing how a table is different from a 2D array. I assume that I'm getting stuck in the conventional programming language definition of "array", and there's some finer point as applied to database theory.

applecore · on June 2, 2015

http://en.wikipedia.org/wiki/Relational_algebra

CWuestefeld · on June 2, 2015

I'm not sure how this answers the question. Personally, I'm reasonably conversant with the ideas of relational algebra, and quite good with relational databases.

But the article at the link you suggest doesn't even contain the word "array", so I'm still no closer to understanding how this concept differs from traditional tables.

ExpiredLink · on June 2, 2015

Stonebraker and his predictions - legendary!