
Stonebraker Explains Oracle's Obsolescence, Facebook's Challenge - mooreds
http://blogs.barrons.com/techtraderdaily/2015/03/30/michael-stonebraker-describes-oracles-obsolescence-facebooks-enormous-challenge/
======
nl
_Look at Facebook, it is one giant social graph, with the problem of how to
find the average distance from anyone to anyone. You can simulate a graph as
an edge matrix, and a connectivity matrix in an array-based system, and you
model graphs in a table system, or you build a special-purpose engine to
implement the graph directly. All three are being prototyped and
commercialized, and the jury is out whether there is room for a new graph
engine or if one of the other technologies would be good enough._

This.

So I'm dealing with this problem. There is _nothing_ out there.

Neo4J doesn't really do in-graph processing[1]. BlazeFB/OrientDB/RDF Stores
all are similar to Neo4J

Pregel/GraphX/Giraph are graph processing engines, but lack property stores.

I want a single system that does both. I want to run PageRank (etc) and query-
by-property on the same system.

Titan was promising, but they stopped working on it when they were bought.

I'm surprised no one is fixing this.

[1] [http://neo4j.com/blog/categorical-pagerank-using-
neo4j-apach...](http://neo4j.com/blog/categorical-pagerank-using-neo4j-apache-
spark/) note this bit: "I can scale each Apache Spark node to perform parallel
PageRank jobs on independent and isolated processes all consuming a _Hadoop
HDFS file system where the Neo4j subgraphs are exported to_." (ie, Spark runs
against HDFS, not Neo4j)

~~~
herewego
Shameless self promotion: My company is working on an open source distributed
hypergraph database that tackles this problem, called PatternSpace. Written in
C++14, uses Paxos, Cap'n Proto, lots of mechanical sympathy, and is
queryable/traversable via subgraph isomorphism among other methods.

~~~
dunkelheit
Could you provide some pointers to your work? I googled a bit and searched on
github and could not find anything.

------
mrweasel
Isn't the fact that Facebook still manages to make MySQL work an indication
that unless you doing something special, current SQL server will support most
users for a very long time?

Sure Facebook might have preferred a better solution, but most of us aren't
Facebook, and the solutions that would work for Facebook might not be ideal
for small companies.

~~~
exelius
I think that it's more an indication that running SQL is a business
requirement, not a technical one. To run your business, you need to be able to
query your data in a number of ways. SQL offers the most flexibility and is
essentially compatible with every reporting tool out there. Your awesome NoSQL
database isn't as awesome if every software package you need to run your
business requires a custom integration layer.

In other words, just because our data sets are exponentially larger doesn't
mean we still don't need to be able to perform complex joins or other
relational operations. The data sets are more complex, but the queries we're
running on them are even more complex. While some other data stores may be
optimal for niche use cases, SQL offers the best mix of speed and flexibility
while still meeting the business requirements.

------
bane
The linked profile on MarkLogic is quite good. [1]

I remember them from _ages_ ago as basically an XML focused database. I'd see
them at trade shows with a small booth. They had interesting technology but
had very engineer-y marketing and not a lot of customers. My company had been
on the lookout for such a technology and IIR our engineering team checked them
out for a bit, but they weren't a great match for our product and we ended up
using dtSearch instead. [2]

In the last year or two, I've started hearing MarkLogic show up again all over
the place and wondered what was going on. Turns out they got new leadership
(Gary Bloom) and have been making a big push to grow. It's funny how that
happens, I wonder how many other serviceable companies with decent tech are
hiding out just waiting for the right CEO to come along and push them into the
spotlight (I'd also add that 'XML' is no longer one of their marketing
keywords).

1 -
[http://blogs.barrons.com/techtraderdaily/2015/02/13/oracles-...](http://blogs.barrons.com/techtraderdaily/2015/02/13/oracles-
challenge-in-bloom-marklogic-redefines-the-database/)

2 - [http://www.dtsearch.com/](http://www.dtsearch.com/)

------
elgenie
Only four years ago, it was a 'fate worse than death' rather than a mere
challenge:

[https://gigaom.com/2011/07/07/facebook-trapped-in-mysql-
fate...](https://gigaom.com/2011/07/07/facebook-trapped-in-mysql-fate-worse-
than-death/)

------
threeseed
I agree about Oracle's lack of a strategy that incorporates Hadoop is
definitely going to hurt them. With the addition of Spark the Hadoop platform
is looking like being the first choice for analysing data sets whether small
or big. And whilst I am sure SQL has and will continue to be a major part of
that it won't be the only approach. There will be R, PMML, Python, Scala and a
whole lot more.

That aside Facebook having a buy versus build decision to make seems pretty
strange. What would they even buy given that Cassandra and HBase which they
created are two of the most scalable databases right now. Strange observation.

~~~
Joeri
Oracle partners with Cloudera for their hadoop servers:
[https://www.oracle.com/engineered-systems/big-data-
appliance...](https://www.oracle.com/engineered-systems/big-data-
appliance/index.html)

Hadoop is not what the article is talking about though, nor is Cassandra. The
issue is not about whether it scales out or not, it's about whether it is
memory-based or disk-based. If queries have to pass through a block API at any
point, they can never perform as well as a database where all queries run out
of main memory, like Hana or VoltDB. On block-based systems you can't run some
categories of analytical queries interactively, they always end up being too
slow.

He's saying oracle has a problem because they don't have a good in-memory
story ready. Right now they're aiming at this problem from two directions:
Oracle NoSQL (a distributed KV store comparable to Riak), and Oracle 12c In-
Memory (in-memory engine bolted onto the oracle db). Neither are particularly
convincing to me, and definitely not a match for Hana or VoltDB.

~~~
eitally
I think your comment explains the problem clearly and should be at the top. To
add more bluntness, Oracle's problem is that their typical enterprise
customers who may be interested in, or require, in-memory databases have no
practical option but to abandon Oracle. And if they move away from Oracle for
analytics, what's stopping them from considering moving away for transactional
work as the next step?

------
arjunrc
According to the article, there are three strategies - Traditional Row based
DBMS, Column based ones (Vertica, VoltDB) and MapReduce/Hadoop.

My current employer is making the switch from Row based to Hadoop - which I
feel is because of hype & not justified technically, given the size of our
cluster. The goal is to reduce speed of data delivery to clients, but I
believe a column-based DBMS with an optimized ETL, would be the way to go.

Wonder how it'll look like in 5 years & if others companies' IT are buying
into hype too.

~~~
jbergens
I assume you mean "enhance speed of data delivery to clients". Otherwise you
are trying to make the system slower.

------
orsenthil
What is meant by _array-based_ system that he is referring to ? Can someone
point me to some other basic literature that explains array-based database
systems?

Context from the interview: _Sooner or later, the business intelligence world
will move to the data science world, using things like regression analysis,
Bayesian analysis — these are lots of big words, but all of these techniques,
if you look at them, it’s an array-based, not a table-based calculation._

~~~
sjg007
You stream the arrays (column vectors) to the OLAP engine or data processor
rather than constructing a big join.

------
DonHopkins
No SQL => Not Only SQL => Not Yet SQL

------
mooneater
Can anyone tell me what is fundamentally different between array and table
processing ie. For data science?

~~~
applecore
[http://en.wikipedia.org/wiki/Relational_algebra](http://en.wikipedia.org/wiki/Relational_algebra)

~~~
CWuestefeld
I'm not sure how this answers the question. Personally, I'm reasonably
conversant with the ideas of relational algebra, and quite good with
relational databases.

But the article at the link you suggest doesn't even contain the word "array",
so I'm still no closer to understanding how this concept differs from
traditional tables.

------
ExpiredLink
Stonebraker and his predictions - legendary!

