
Datomic by Rich Hickey [video] - sethev
http://g33ktalk.com/datomic-by-rich-hickey/
======
chimeracoder
I haven't used Datomic at all myself, though I was at this event and I found
it very appealing.

The issue of having an easy audit trail is absolutely _critical_ for many
businesses, and is ill-served by most existing databases.

This is one thing I liked about CouchDB - CouchDB does not enforce this
strictly, the way Datomic does, but the model encourages "append to edit"
rather than strict modification. I haven't had the time yet, but I'm excited
to check out Datomic specifically for this reason.

------
juliangamble
From watching the screencasts that Hickey produces for Clojure - he talks
about working on a 'Real Time Broadcasting System' (National Election data
streaming). It sounds like he spent lots of time thinking about how to
guarantee what a real-time system would do when reporting on snapshots data as
it was streaming through. The ideas in Clojure and Datomic seem to be the
result of that thinking process. (ie in Datomic you have the ability to go
back and see what the state of the system was at that point in time. In
Datomic the agent model gives you the ability to see the state of a value in
the middle of an STM transaction.)

------
nabla9
I think this might be good time to mention HiBase and Shades projects and
related research for data structures and functional programming environments.

[http://www.cs.hut.fi/~siva/hibase/](http://www.cs.hut.fi/~siva/hibase/)

HiBase was project funded by Nokia. The Goal was to develop a persistent
functional programming environment with for telecom environments.

The reason why it was functional was the observation that in functional
programming environment you can model data as graph where references point
only backwards in time. In other words, if you just constantly write your data
into the database and have smart GC, you can recover very fast from crashes.
Just like telecom environment likes it.

The project was canceled, some say for reasons to do with internal company
politics, but it showed huge promise. It was already 10X faster than competing
commercial databases that were tweaked by their developers to perform well in
testing.

------
coolsunglasses
Happily using Datomic at work, released a Datomic migration library based on
that work recently.
([https://github.com/bitemyapp/brambling/](https://github.com/bitemyapp/brambling/))

If you have any questions about pros/cons, what it's like to use, operational
constraints/tradeoffs, please ask.

~~~
iampims
I’d be curious to hear how Datomic scales, ops wise and application wise. Is
adding a “peer” all there is to do to handle more reads?

~~~
coolsunglasses
The more specific you are about your concerns, the better.

>Is adding a “peer” all there is to do to handle more reads?

Yes, modulo storage backend throughput capacity. It uses storage backends in a
fairly dumb and efficient manner so in all likelihood peers will be the
bottleneck.

>ops wise

PostgreSQL and DynamoDB are pretty popular, the ops scenario for the latter is
pretty obvious. You can just scale the provisioned IOPS according to your
storage backend throughput. Pretty sure DynamoDB can exceed anything you'll
ever need in that regard. A sufficiently beefy PostgreSQL machine (SSD, etc)
can go ham too.

The main shift in mentality with Datomic ops and application-wise is that
you're prioritizing vertical scaling before horizontal scaling. I actually
think that's the smarter thing to do from a cost and complexity stand-point
anyway. I've worked in environments where horizontal scaling was deployed
prematurely and we were managing 100 machines where ~3-5 would've sufficed.

Part of this enhanced emphasis on scaling is that you're incentivized to make
the most of each peer. If you have peers that are making "lightweight" use of
(Datomic/your data) but you need many of them running (for compute?) then call
into a REST peer or custom REST interface to read/write the data. You
implement sprocs through database functions which are just Clojure code, super
nice.

Main scaling choke-point is just the transactor itself.

>Application wise

The two X-factors for Datomic in my experience are datalog and the deeply
"historical" nature of the database and querying. Being able to arbitrarily
time-slice queries with complicated relationships without having to change
them in order to accommodate the "time-slicing" is amazing.

There are separate interfaces for querying and transacting, which is nice.

I would advise avoiding doing more migrations than necessary since that means
you'll have to use my library Brambling though :)

The Datomic guys are working on implementing trivial schema changes (like
adding indexes), but for the more complicated migrations, you'll have to use
Brambling or something comparable.

Any other queries?

~~~
dustingetz
Last week in the official cognitect training they said that the transactor can
handle 10k writes/sec with appropriate hardware, which is enough to handle all
the credit card transactions in the world.

[This next part is me, not them] So despite being theoretically write
bottlenecked, it still does better than postgres, because the transactor does
less work (though they're in the same ballpark). What do you think?

I hope I'm quoting them right, from memory.

~~~
coolsunglasses
I could construct scenarios where identical hardware for PostgreSQL-alone and
Datomic+PostgreSQL were alternately beating each other in transactions per
second.

The main limitations are that if you exceed the datoms the b-tree indexes can
handle, you'll need to partition-walk or shard, and if you exceed the
transactions/second a single fully vertically scaled Datomic transactor can
handle, you'll need to evaluate sharding strategies similar to what PostgreSQL
deployments do.

I don't think a Datomic transactor in-and-of-itself is more efficient in
$/performance for transactions per second than the same exact machine just
running PostgreSQL. That's supposition, I don't have data conclusively proving
that even though we directly compared some workloads on a realistic PostgreSQL
schema vs. Datomic using PostgreSQL as the backend. The design does make
Datomic being more efficient hypothetically possible, I just don't think it's
_yet_ the case, I could easily be proven wrong though.

I don't think any difference between the two is meaningful other than the fact
that if you created separate sharded transactor instances that would be with
the caveat that you can no longer get a whole-database atomic view of what's
going on. Which is the same thing that happens when you have multiple sharded
writers in PostgreSQL, it's just that it's explicitly part of the point of
Datomic, but if you only need an atomic view of shardable graphs of
relationships of the historical data instead of the whole database, then the
sky is hypothetically the limit.

It'd still make me a little uncomfortable, but it's better than implementing
your own event sourcing on Cassandra, haha.

It's worth keeping in mind that if you just want to throw transactions over
the wall from your client, you can do that with transact-async. HornetQ has no
problem just holding onto the txes as they get worked through ;)

The main thing I would keep in mind is that while you can get away with some
modest OLAP'ish stuff, PostgreSQL is a lot better at OLAP (presently) than
Datomic due to the way peers work. There's not a nice way currently to shard
reads, even though that's the only kind of sharding that's really possible at
present.

You could dispatch range queries against peers that become responsible for
their sections of the sharded ring, letting the LRU caching do the
"population" per shard for you (with minimal penalties for exceeding the
memory capacity for each shard). I could see getting pretty far with that
alone. But again, Datomic is kinda like Lisp right now in the sense that it's
a solid building material but there's no Staples easy-button Magic like, say,
Riak at present.

To do something like Redshift with Datomic would require query sharding (can
be client-side library) against a buncha warmed up peers. You wouldn't want to
just churn the data through a single peer just "doing" all the work.

I'm perfectly happy to use Datomic as my default OLTP store and lazily
replicate the data into something designed for OLAP workloads.

Sorry for babbling, I just got back from a type theory meetup that I got a
couple of fellow Clojurians to tag along with me for and we ended up having
some insanely fun conversations with other FPers. I'm pumped tonight :)

Apologies if I babbled so much that I didn't really answer your question, I
hope I encompassed the general subject of write-throughput.

tl;dr

s/PostgreSQL/Datomic/g for OLTP workloads, despite having been a PGSQL nut for
years.

Possible biases:

I hate MangoDB because I've actually used it in non-trivial production
deployments.

I really liked PGSQL for many years despite using it in non-trivial production
deployments.

I'm sort of a maintainer of Korma (SQL library for Korma).

I'm an unapologetic Clojure and Haskell fan.

I technically worked on a library for RethinkDB. I happen to think it's a nice
document store.

------
Estragon
Anything new here for people who've seen him talk on Datomic before?

~~~
brandonbloom
I think this is the same talk he gave at another NYC meetup a few months ago.
The beginning of the talk is an interesting take on thinking about "processes"
and "machines", in true Rich Hickey style. The later half of the talk involves
some REPL exploration, which gives a good flavor of actually using the thing.

~~~
Estragon
Thanks, Brandon.

