

Rich Hickey on Datomic, CAP and ACID - sethev
http://www.infoq.com/interviews/hickey-datomic-cap

======
drostie
There's a bit of a barrier to entry for Datomic because it really doesn't look
quite like anything that you're familiar with if you're just used to SQL and
NoSQL. The basic starting point is one big many-to-many relation, what in SQL
might be one big table of (db_id, subject_id, verb_id, object_id, timestamp)
of these "facts" which are (subject verb object) tuples, and these queries can
be limited by the transaction timestamp. I'd say more about it but honestly,
because it's closed-source, I've had real trouble figuring out what exactly is
going on in the backend; a lot of the organization is done with namespaced
keywords so I am not sure how exactly the datatypes are organized, but you can
query based on subject, verb, or object quite rapidly. As Rich Hickey says,
one instrumental thing is that these tuples are only inserted -- they are
never updated or deleted. This enables a "many readers, one updater"
architecture without violating consistency when you're updating the data:
nobody uses a timestamp greater than their recorded one until the model (in
the MVC sense) tells that view that it's ready to update. There is a mechanism
for naming new elements in your transactions so that you can insert a bunch of
facts at once.

It's also a little difficult because it is substantially _meta_ \-- instead of
a special syntax ("CREATE TABLE" etc.) for handling structure, the information
about how the database is organized is stored in the same "fact" tuples that
make up the rest of the database, just with some automatic verbs that come
bundled in. I'm always a fan of self-expressing systems, but that might throw
some newcomers for a loop.

I guess my main comment about Datomic is, "I wish there were an open-source
version out there." I'm interested in seeing what the plumbing of such a
system looks like and perhaps learning something from it -- and I'd like to
peek at their Datalog engine and so forth. Unfortunately, just from peeking
inside the JAR, it appears that one might need a good understanding of Google
Guava and Apache Ant and the Jetty server and a bunch of other things to be
able to read the source, which might make it prohibitive to peek inside the
code. I'm more interested in learning from the system than I am in using it.

~~~
JPKab
The description you have for datomic is pretty much spot on for how graph
databases work, with schema and data both stored as data, subject verb object
tuples, etc.

It's sad that this is a new concept for so many people in the software
industry, considering that graphs are one of the core data structures taught
about in CS.

~~~
kinleyd
Like most technologies, Datomic builds on ideas and concepts that already
exist, and would find it hard to set itself apart by simply latching on to a
single idea or concept. So if it were "just another" graph database and
nothing more, I'd agree with you.

However, Datomic puts together a whole host of other concepts - data
immutability and all its resulting benefits; separation of query, storage,
write, coordination (with carefully considered trade-offs); Datalog-based
querying in place of SQL; in-app access to data with a great solution for the
string concatenation hell that we've all gone through trying to juggle our
SQL, and the list goes on.

I don't think any of these individually make Datomic special, but
collectively? IMHO, outstanding!

Edit: For those interested, we have a new Datomic Community on G+ where we
have collected a number of links to Datomic resources and videos:
<https://plus.google.com/communities/109115177403359845949>

------
dustingetz
datomic claims to solve the O/R impedance mismatch -- the "vietnam of computer
science"[0] -- which is why datomic matters to application developers.

"... The facts the Storage Service returns never change, so Peers do extensive
caching ... Once a Peer's working set is cached, there is little or no network
traffic for reads ..." [1, paragraph 4]

this basically means that your reads are VERY fast -- in-memory fast -- which
means you can program as if your data wasn't in a database.

[0] [http://www.codinghorror.com/blog/2006/06/object-
relational-m...](http://www.codinghorror.com/blog/2006/06/object-relational-
mapping-is-the-vietnam-of-computer-science.html) [1]
<http://docs.datomic.com/architecture.html>

ps video is only 14 minutes long, worth a watch

------
deweller
From <http://www.datomic.com/>:

    
    
        Datomic is a database of flexible, time-based facts, supporting queries and
        joins, with elastic scalability, and ACID transactions. 
    
        Datomic can leverage highly-available, distributed storage services, and puts
        declarative power into the hands of application developers.

------
systems
the more i see people speak about databases this way, the more i learn to
appreciate chris date

almost all the explanations i've seen of datomic explain the software
architecture or component (check this for example
<http://www.datomic.com/overview.html>)

the more important aspect of a db is the conceptual data model, the
abstraction it offer , the one which you as a developer will use to describe ,
create and query data and information

the relational model is not about the components of a DBDMS, its not about
indexes, and optimizer, its about describing your data as relations, using
relational operators and is even more primarily about data integrity

if it is not an implementation of the relational model, what is the model its
trying to implement, and why should we use this data model?

~~~
martinced
_"the more important aspect of a db is ..."_

No. To _you_ it's that.

To me the ability to be able to easily recreate the state, _any_ state the DB
was in at any time is much more important.

A big part of my job consist in asking DBA a dump of the production DB to
PREPROD or DEV environments so that I can try to recreate the state at which
the shit hit the fan. And it's more than painful. And it's not my fault: I'm
inheriting apps that I have to maintain/bugfix/enhance. And it's hell.
Mutability hell.

One day CRUD DB shall be regarded as dumbly as languages in which any variable
can be globally modified, from any thread. We'll look back at these CRUD DB
and wonder how stupid we've been not to listen earlier to the ones advocating
a saner world.

You _really_ should listen to several of Rich Hickey's talk because his ideas
are more than sound and comes from a lot of Real-World [TM] immense pain
suffering.

All the time he says: "Have you ever ...? Not fun!" he's 100% on spot. He's
language and DB are the most pragmatic things ever.

It's amazing.

~~~
freework
You can implement an immutable database system with SQL databases today. For
instance, just design your application to only add new records, and not to
issue any updates.

Why is an entirely new database system needed?

~~~
chongli
Datomic does a lot more than just avoiding updates/deletes. It splits the
database into multiple parts. The transactor doesn't do any data storage. The
clients handle all their own query locally. This means that the transactor
(the only bottleneck in the system) does not have to handle any queries or
storage and can spend 100% of its time handling write transactions. All other
components of the system are trivially cacheable and scalable.

The other big advantage is the concept of "the database as a value". As a
client, you can easily obtain the database as an immutable value right inside
your application. This allows you to do all the queries you want without
affecting anyone else.

------
capkutay
I have immense respect for and look up to anybody who can do something as
complex and low level as DB kernel development (mem management,
persistence/file system, caching), especially with the long list of features
you have to support to make the DB desirable/useful (ACID transactions,
connectors).

I also hope DB implementers know how to get acquired by big, companies,
because I don't see how they can really compete with these big legacy
companies that have been developing their core DB's for decades and have an
army of support/sales to back it up. Not to mention, where's the safety and
accountability with mission critical data with a new product versus say DB2 or
oracle? Maybe the new DB is a better implementation/more fault tolerant, but
there's no big corporation to blame if something goes wrong, rather a small
company you gambled on.

------
astangl
One thing I haven't heard discussed about the "save data forever" model is
that in some scenarios you WANT to purge old data and not have it recoverable,
for legal reasons.

I guess Datomic is not suited to these situations?

~~~
kinleyd
This seems to be a frequently asked question. From what I've seen on the
videos and forums, the Datomic team is also well aware of situations in which
this may be legally necessary and are working on providing ways to handle
this.

------
martinced
I think most people don't know that Datomic can be seen as a layer between
your code/date and nearly any backend (including SQL: for example you can use
PostgreSQL "behind" Datomic).

"CRA" (Create Read Append) _is_ the future for 99% of all the application out
there IMNSHO. I know most people don't believe it but you're simply not coming
back once you understand how easy it is to "recreate the state" using such a
tool. In addition to CRA, the fact that queries are forever cacheable is huge.
The (lifelong) problem of "cache invalidation" simply doesn't exist anymore.

I'm not saying that Clojure is going to replace Java/C# or that Datomic is
going to be used everywhere. But the concepts at the heart of these (like,
say, lockless concurrency in Clojure and the "grow-only" property of Datomic)
are what the future's gonna be made of.

~~~
est
> "CRA" (Create Read Append) is the future for 99% of all the application out
> there IMNSHO

How about counters e.g. number of upvotes ?

It's typical upsert operations but needs powerful aggression for sorting.

~~~
grayrest
Nathan Marz has been promoting this as the Lambda Architecture, has a couple
presentations/blog posts about the idea and is writing a book. While other
people probably have similar ideas, I'm unaware of anybody else attempting to
teach them in a similarly cohesive manner. I don't know what I'm doing but I'm
working my way through understanding this area so I'll work through the
queries as an exercise.

For your counter, you'd record the individual votes as tuples of (timestamp,
voter, vote, subject). These get dumped into a distributed data store. From
there, they get batch processed into the equivalent of a database view. The
cascalog query for your 'posts' view would be something like:

    
    
        (defn post-view [post-source vote-source output]
          (?<- output [?timestamp ?postid ?upvotes ?downvotes ?title ?body]
               (post-source ?timestamp ?postid ?title ?body)
               (vote-source _ _ ?vote ?postid)
               (> ?vote 0 :> !up)    (c/!count !up :> ?upvotes)
               (< ?vote 0 :> !down)  (c/!count !down :> ?downvotes)))
    

Produces a set of (timestamp, id, timestamp, upvotes, downvotes, title, body)
tuples. You can then do another pass on it to run your sort function:

    
    
        (defn post-score [now-ms time-ms upvotes downvotes]
          (let [mins-since (div (- now-ms time-ms) (* 60 1000))
                (div (- upvotes downvotes) (+ mins-since 1))]))
    
        (defn post-rank [post-view-source output]
            (?<- output [?postid ?votes ?title]
                 (post-view-source ?timestamp ?postid ?upvotes ?downvotes ?title _)
                 (- ?upvotes ?downvotes :> ?votes)
                 (post-score (System/currentTimeMillis) ?timestamp ?upvotes ?downvotes :> ?score)
                 (:sort ?score) (:reverse true))
    

This would give (id, vote total, title) tuples that you can send to your
templating engine (assuming url is based entirely on id) to make the front
page.

The neat thing about the approach is that you can change everything later. If,
for example, you wanted to do weighted upvotes/downvotes, you could produce a
scoring function for the up/down votes like the one used here for scoring
posts and aggregate the votes using sum instead of count.

~~~
est
That's cool thanks!

But for a counter of millions, count vote action one by one seems inefficient.
I think we still need some sort of `update` mechanism in a pure CRA db?

~~~
grayrest
It's not particularly efficient but that's not a design goal of the system.
The tradeoff is robustness–can't lose or corrupt data as long as the master
data set is safe–and flexibility–generate whatever views you like on the data
whenever you choose. The design is Twitter's analytics system which was
running this sort of thing over a 27TB raw dataset using Hadoop so apparently
it scales if you throw more hardware at it.

There's a second layer using Storm (not written up in the book yet, so I don't
know the details) that handles all data newer than the most recent batch run
and you somehow merge that new data with the old data (also not written up
yet). I don't need to have this sort of system implemented immediately so I'm
content to sit around and wait for new book chapters rather than try to muddle
through.

