
I Dreamed of a Perfect Database - marcopolis
https://newrepublic.com/article/124425/dreamed-perfect-database
======
jandrewrogers
Some perspective on this subject:

\- A graph database is essentially a relational database plus join recursion.
Many modern relational databases support join recursion; support for graph
models is ubiquitous, they just don't call themselves "graph databases". There
are other reasons graph-like computation is relatively rare.

\- There are some important types of relationships that are effectively not
representable in graph data models. Relationships that have a topological
nature, such as negative constraints, spatiotemporal, etc are notoriously
problematic.

\- General graph-like operations have terrible scalability and performance
characteristics as commonly implemented. Consequently, people almost never
organize their data models this way unless it is absolutely unavoidable. Even
Facebook materializes common traversals as non-graphs so that they do not have
to execute them dynamically. The heavy reliance on secondary indexing for
performance in most graph databases ensures they will be marginal for large-
scale data models.

\- Semantic Web fails at scale because no two people map reality to
definitions in the same way. This is particularly obvious at global scales
because of cultural influences on how we interpret the world. If you've ever
done global data model integrations, and I have, you quickly realize that the
only values that are approximately consistent globally are physic-based
measurements e.g. "650nm wavelength" (instead of "red"). The nodes have no
common definition in practice, just commonly overlapping parts of countless
subjective interpretations of the definition, which leads to pollution of the
data model in real systems.

~~~
espeed
Hi Andrew - Have you explored holographic algorithms/encodings that use shape
to represent elements such that you can succinctly and uniquely define an
object/relation in terms of the composition of its constituent parts in a way
that enables fast decomposition back into its constituent parts?

~~~
wolfgke
> Have you explored holographic algorithms/encodings[...]

As far as the wording in mainstream science is, holographic algorithms are
something entirely different, see
[https://en.wikipedia.org/wiki/Holographic_algorithm](https://en.wikipedia.org/wiki/Holographic_algorithm).

~~~
espeed
Vladimir Kornyak touches on some of these ideas in these papers:

1\. On Compatibility of Discrete Relations (2005) [http://arxiv.org/pdf/math-
ph/0504048.pdf](http://arxiv.org/pdf/math-ph/0504048.pdf)

2\. Structural and Symmetry Analysis of Discrete Dynamical Systems (2010)
[http://arxiv.org/pdf/1006.1754.pdf](http://arxiv.org/pdf/1006.1754.pdf)

3\. Discrete Dynamical Models: Combinatorics, Statistics and Continuum
Approximations (2015)
[http://mmg.tversu.ru/images/publications/2015-vol3-n1/Kornya...](http://mmg.tversu.ru/images/publications/2015-vol3-n1/Kornyak-2015-01-05.pdf)

On a somewhat related note -- applying quantum principles to graph database
processing -- see this new paper by Marko Rodriguez (the creator of the
Gremlin graph programming language):

4\. Quantum Walks with Gremlin (2015)
[http://arxiv.org/pdf/1511.06278v1.pdf](http://arxiv.org/pdf/1511.06278v1.pdf)

------
iamsohungry
Graph databases are exciting, but I'm far more interested in the potential of
append-only stores. Rather than recording data at all, you store events (item
added, item deleted, etc.).

This allows auditability and for you to look back at the state of the data at
any time, but the largest benefit, in my opinion, is that it decouples data
from its data structure. This allows you to treat data structures like
"caches" that are efficiently structured for how they will be used. If you
want, you don't have to choose between relational databases or graph databases
or anything else: you can play the same set of events into different
structures and query the appropriately structured database for the kind of
query you're doing. It also allows you to implement security at the data
storage level in a very simple and granular way: you can reject events based
on predicates which update as themselves as they receive modifications to the
permissions, and distribute filtered streams of data to users based on what
events they are allowed to see. Overall, the power of this method is very
large.

~~~
shotgun
Datomic much? :)

~~~
natrius
For folks who are interested, some other examples are the Kafka/Samza
ecosystem and blockchains like Ethereum.

------
lgas

      All of these tables relate to one another
    
      Where all this gets particularly interesting is wherever,
      by traversing the relations between table...
    
      These are the merits of the “relational” database. 
    

That's not what "relational" in "relational databases" means.

He even links to Codd's paper which defines a "relation" explicitly:

    
    
      The term relation is used here in its accepted mathematical sense.
      Given sets X1 , S, , . . . , S, (not necessarily distinct), R is
      a relation on these n sets if it is a set of n-tuples each of which
      has its first element from S1, its second element from Sz, and so on.
      ...

------
marknadal
Sorry, this is a terrible article. It reads more as a stream of consciousness
about his own mistakes and sad regrets, and does not add any new or
interesting insight into anything regarding databases.

There are plenty of new cool things we can do now - and the graph database
we're building ( [http://github.com/amark/gun](http://github.com/amark/gun) )
is doing several of those things. Like realtime by default, like what Firebase
and RethinkDB are starting to push. Fully decentralized and fault tolerant,
like what Riak and Cassandra tried to do. Graph (and not just triples) so you
can have _relational documents_ as ArangoDB and Neo4j allow. And absolutely
totally offline-first, like what Pouch/CouchDB wanted to pioneer.

The truth is the world of databases are only getting better and more exciting
- in large part due to new algorithms like CRDTs that push what we can do. GUN
is one player in that along with others. But this article? Just depressing and
overlooking all the new opportunities.

~~~
lobster_johnson
I often see you plug your project, Gun, whenever someone posts about
distributed databases. You always write about how great it is. But every time
I look at your repo, it's still the same few hundred lines of, I'm sorry,
quite shoddily written JavaScript, with huge chunks clearly being temporary
placeholders or even commented out. You even have scratch test data in there,
intermingled with your main code. Your web site also make some grandiose
claims about how it solves certain synchronization problems, but I can't find
any technical explanation of exactly how it is supposed to work.

The lack of code or documentation means it's very hard to take your comments
seriously. But please correct me if I'm mistaken.

~~~
ahachete
I attended @marknadal's talk at Distributed Matters Berlin
([https://2015.distributed-
matters.org/ber/speakers/#284925727...](https://2015.distributed-
matters.org/ber/speakers/#2849257271562)). It promised a lot.

After the talk, I was left with many unclear messages, lots of promises, few
detailed explanations, IMVHO many flawed statements, and only one clear
outcome: gun is a "script" (I wouldn't call it database) that offers "conflict
resolution" based on data's lexicographical order (!!!).

I just probably understood everything wrong, or I'm simply inexperienced
enough to appreciate gun. My apologies in advance.

~~~
marknadal
I'm sad to hear I did a poor job explaining the concepts, thank you for this
honest review. Next time around, I'll see if I can be more clear.

One minor detail - while lexical sort is used in the conflict resolution
algorithm it is only part of the equation (see
[https://github.com/amark/gun/wiki/Conflict-Resolution-
with-G...](https://github.com/amark/gun/wiki/Conflict-Resolution-with-Guns) ).
As I mentioned in my talk, lexical ordering is intentionally naive because I
rely upon deterministic behavior ( _not_ PAXOS, RAFT, consensus, or gossip
protocols) which guarantees convergence.

~~~
ahachete
Maybe it was just me. However, if my feedback would help, I'd say that there
was a lack of technical detail of the stuff mentioned. It was only covered
from a superficial point of view, and many different complex issues where
mentioned somehow together and lightly.

But the way, why do you easily discard consensus protocols and rather rely on
conflict resolution?

~~~
marknadal
What is the best medium for presenting more technical details? In my talk I
attempted to use stories (which is a very laymen approach) to explain
algorithms - this may have been too high level. So instead, what is easiest
for you to digest and verify? Actual code and working samples/demos that prove
the behavior? Mathematics? Case studies by large customers?

\-------

I dislike consensus protocols because they are difficult to scale.
Deterministic algorithms however are not. Why? Consensus requires
communication, and communication takes time. As you add more peers, you then
have to do more communication in order to maintain consensus. But as more
communication occurs, things bottleneck and get even slower. However _some
problems require consensus_ (like finances, traditionally) and therefore GUN
would not be a good choice.

Deterministic algorithms only need the same inputs and they then spit out the
same output as any other machine (running that algorithm) anywhere in the
universe. This is why "immutable data structures" are all the rage lately.
This is incredibly scalable because as long as the inputs are received (which
might have been sent slowly over the network) then all the databases will
converge to the same result (for the same inputs) regardless of whatever
current state those databases might be in. And because a database maintains
state, this type of guarantee is really important for databases.

WARNING NOTE: Missing input causes the databases to be temporarily out of sync
because their inputs are different, however when the inputs are finally
received (via retries or what not) the databases will sync up _regardless_ of
the ordering. Making sure the inputs can be sent in any order or retried is
the idea of idempotency and is explored more with CRDTs and commutativity.
This means GUN is "Eventually Consistent" and not Strongly Consistent, so
don't use GUN in those scenarios.

------
jeffdavis
"without the motive power of capital, things move slowly"

 _slowly_?! We've amassed huge collections of detailed knowledge in mere
decades, and it's all searchable and discoverable.

"not one big pool of knowledge"

The real fantasy is centralization and taxonomies of everything. Distributed
isn't a lazy thing we've settled on; it's the best approach. Each island grows
about as far as it makes sense, and connects to other islands where it makes
sense.

Sure, it may be imperfect and some datasets are hidden in corporations, but
open data is certainly a healthy and growing ecosystem in no danger of
extinction.

~~~
anon4
> The real fantasy is centralization and taxonomies of everything

Precisely. Currently there are plenty of graph-based databases to choose from
- GraphDB, ArangoDB, Neo4j, Allegro, etc etc. They are all pretty good, and
some also support rule-based inferencing, i.e. you can put a rule in the
database like "if X is a person and belongs to team Y and the manager for team
Y is Z, then Z is X's manager".

Systems like that have a few disadvantages:

1\. They absolutely implicitly trust all data you put in them. If in the above
example you add Z as a member to team Y, then it will infer that Z is Z's
manager.

2\. You cannot assert negatives, nor can you search based on negative
predicates.

The first is the absolute killer for such technologies if you try to deploy
them internet-wide and gather "data" from any random source publishing RDF
documents.

~~~
alex_hirner
ad 1: why not attach a credibility score to relations (probably aggregated
from multiple sources) and query against those too?

~~~
wolfgke
What inference rules for unreliable data do you suggest?

~~~
ZenPsycho
is there something wrong with inductive logic?
[https://en.wikipedia.org/wiki/Inductive_reasoning](https://en.wikipedia.org/wiki/Inductive_reasoning)

~~~
nl
Inductive logic needs an implementation. These are things like Bayes Nets,
Inference using Gibbs sampling, PSL, etc.

You can sorta, mostly get them to work if you have a clean graph, and are
prepared to spend a long time debugging.

The problem is that no one has managed to get them to work well enough to be
useful on dirty, web-scale knowledge graphs.

------
hcarvalhoalves
> There is as yet no absolute challenger to the relational model. When people
> think database, they still think SQL. But if there is a true challenger, it
> is in the graph model. Because graph data structures power social networks,
> and social networks are the dominant technological organism of the era.

You can use a fact database model (like Datomic's) and then you don't have to
choose between row, column, document and graph based databases, those are
projections on top of some index.

If entities are universally unique and attributes namespaced you can get a
"database of everything". Or you need at least a mapping between entities from
different vendors. That's the hard part :)

------
smacktoward
I'm old enough to remember being excited about FOAF too.

The idea came out of the world of blogs; if you have a blog, so the thinking
went, you basically have an online identity endpoint. So if there was a simple
way to describe relationships between those endpoints, you could write
software to follow those relationships, and then... _voila!_ A completely open
social network.

The dream collided unpleasantly with reality in two places. First was the
problem that running a blog implies a commitment to continual writing that
most people are never going to make, so when your first step is "set up a blog
somewhere" people think you're asking them to make a huge commitment and run
away. And second, FOAF was built on RDF; RDF makes my brain hurt just thinking
about it, and I'm a professional nerd! (The refrain was always "it doesn't
matter how complicated the format is, users will only interact with it via
tools that make it simple." But every time I hear this about a format it
reminds me of all the other formats I've heard that about in the past, all of
which are deader than doornails now.)

------
jacques_chester
Perhaps I'm stereotyping, but whenever I read a love letter to graph
databases, it usually begins with taking MySQL as the limits of the relational
world.

------
perlgeek
The real power of relational databases is:

* it's fantastic for many kinds of queries you didn't have in mind when you created the data model

* it's good enough for most use cases

* it's been around long enough that safety and correctness features like transactions, foreign keys, UNIQUE and NOT NOLL constraints etc. are well implemented and understood.

Any DB model that wants to challenge the ubiquity of RDBMS must address these
points, and offer something on top. So far I haven't seen anything that would
come close. Sure, there are lots of specialized use cases and specialized
tools that are great in some areas (like geospatial data, graphs, time series,
...), but nothing that I'd use as a default choice for a project with vague
requirements.

------
dankohn1
I'm in love with Paul Ford's writing about technology. His What Is Code? cover
story for Business Week was amazing:
[http://www.bloomberg.com/graphics/2015-paul-ford-what-is-
cod...](http://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/)
[https://news.ycombinator.com/item?id=9698870](https://news.ycombinator.com/item?id=9698870)

This is just a short piece, but I think it is so interesting to try to
introduce non-technical readers to some of the knotty problems that everyone
on HN takes for granted. And, the fact that the world's largest graph database
(Facebook's) is built on top of a SQL database really is ironic, and it's
great that he succeeds at explaining all the concepts well enough to get that
across.

For dinosaurs like myself who still use an RSS reader (I'm a big fan of
Inoreader), here is a full-text feed of his New Republic writing, made with
Feedity and Five Filters:
[http://ftr.fivefilters.org/makefulltextfeed.php?url=https%3A...](http://ftr.fivefilters.org/makefulltextfeed.php?url=https%3A%2F%2Ffeedity.com%2Fnewrepublic-
com%2FVFRWVFJa.rss&max=3)

------
ClayFerguson
I think the future "may" be the JCR Standard (Java Content Repository), or at
least something build 'on top of' JCR. The only real contender IMO is Apache
Jackrabbit Oak. In general what this article describes is a Content
Repository, but a real world-wide semantic web has not yet been built with it.
My little project meta64.com is a mobile front end for some of this kind of
data processing/storage.

------
mrwilliamchang
"There is as yet no absolute challenger to the relational model. When people
think database, they still think SQL. But if there is a true challenger, it is
in the graph model."

This article is quite biased towards graph databases with regards to the SQL
versus NoSQL tension. This video presents a much more balanced view of SQL
versus NoSQL, in my opinion.
[https://www.youtube.com/watch?v=qI_g07C_Q5I](https://www.youtube.com/watch?v=qI_g07C_Q5I)

------
peter303
This article doesnt address what is useful as we migrate into the cloud.
First, the data may be larger than any disk image in the cloud. The data could
be distributed across hundreds of disks in disparate geographical locations.
Plus there may be hundreds of process or services reading and modifying the
data. Certain NonSQL architectures seem to work better then.

------
gaius
A guy writing about "graph databases" without once mentioning IMS, hmm
[http://www-01.ibm.com/software/data/ims/](http://www-01.ibm.com/software/data/ims/)

------
badwolf93
I woke up and find riak, hope they make search on by default but one could
argue that perfection is not a thing.

------
david927
Check out Brodlist [http://brodlist.com](http://brodlist.com)

