I'm becoming more and more convinced that your canonical data store should be append-only whenever possible (see eg  for detailed arguments). It's nice to see first class support for this.
EDIT: Just read through the whitepaper. Looks like the indexes / storage engine form an MVCC (http://en.wikipedia.org/wiki/Multiversion_concurrency_contro...) key-value store, similar to Clojure's STM. Peers cache data and run datalog queries locally.
This could be either an available or consistent system, depending on how cache invalidation in peers works. In the available, eventually-consistent case you have the added benefit that all queries see a consistent snapshot of the system, even if that snapshot is not totally current.
Like most of Hickey's work, the whole thing seems really obvious in hindsight. It also bears a lot of similarity to Nathan Marz' recommendations for data processing and schema design.
1. The use of the term "whitepaper". It's very "enterprisey"
2. It took me a bit of perusing to figure out what the product IS. I think the lead paragraph may need some tweaking
In all, the landing page makes the product feel intimidating. Contrast to Parse's landing page (https://www.parse.com/) where it feels like I'm free to jump right in and tinker with it, but I also get the impression that it will scale up if I need it to. (Yes, I know the two services aren't offering the same thing).
But judging from that opening description, I gather that Datomic puts the data and analysis in the same application. As a description of what the thing IS, it's about as informative as saying, "This new language allows you to take control of your computer by allowing you to give it coded instructions!" or "Our storage solution allows you to persistently store data!"
On the other hand, it's very new so maybe they'll add more developer-friendly pages soon. Or maybe it's only meant for "enterprise" environments? Time will tell.
Here's the shortest what and why I could come up with:
Many relational databases today operate based on assumptions that were true in the 1970s but are no longer true. Newer solutions such as key-value stores ("NoSQL") make unnecessary compromises in the ability to perform queries or make consistency guarantees. Datomic reconsiders the database in light of current computer set-ups: millions of times larger and faster disks and RAM, and distributed architectures connected over the internet.
Instead of using table-based storage with explicit schemas, Datomic uses a simpler model wherein the database is made up of a large collection of "datoms" or facts. Each datom has 4 parts: an entity, an attribute, a value, and a time (denoted by the transaction number that added it the database). Example:
John, :street, "23 Swift St.", T27
Like Clojure, Datomic incorporates an explicit model of time. All data is associated with a time and new data does not replace old data, but is added to it. Returning to our previous example, if John later changes his address, a new datom would be added to the database, e.g.
John, :street, "17 Maple St.", T43
Move Data and Data Processing to Peers
Traditionally databases use a client-server model where clients send queries and commands to a central database. This database holds all the data, performs all data processing, and manages the data storage and synchronization. Clients may only to access the data through the interface the server provides - typically SQL strings which may include a (relatively small) set of functions provided by the database.
Datomic breaks this system apart. The only centralized component is data storage. Peers access the data storage through a new distributed component called a transactor. Finally, the most important part, data processing, now happens in the clients, which, considering their importance, have been renamed "peers".
Queries are made in a declarative language called Datalog which is similar to but better than SQL. It's better because it more closely matches the model of the data itself (rather than thinking in terms of the implementation of tables in a database). Additionally, it's not restricted like SQL. It allows you to use your full programming language. You can write reusable rules that can then be composed in queries. Additionally, you can call any of your own functions. This is a big step up in power and it's made practical because of the distribution. If ran your query on central server, you'd have to be concerned about tying up a scare resource with a long-running query. When processing locally, that's not a concern.
When a query is performed that data is loaded from central storage and placed into RAM (if it will fit). Later queries can use this locally cached data for fast queries.
That's definitely not all it does or all the benefits, but hopefully that's a good start.
*Transactions as first-class entities
Transactions are just data like everything else, and can add facts about them like anything else. For example, who created the transaction. What did the database look like before and after transaction.
Additionally, you can subscribe to the queue of transactions, if you wanted to watch for and react to events of a certain nature. This very difficult in most other systems.
Do transaction numbers have total order or just partial order? Total order is serializing. (And no, using real time as the transaction number doesn't help because it's impossible to keep an interesting number of servers time-synched.) Partial order is "interesting".
The transactor is a single point of failure.
However, since its only job is doing the transactions, the idea is it can be faster than a database server that does both the transactions and the queries.
How does somebody do read-"modify" style of transactions ?
Say I want to bump some counter. So I delete old fact and I establish new fact. But new fact needs to be exactly 1 + old value of counter. With transactions as simple "add this and remove that" you seemingly cannot do that. So it's not ACID. Right?
We are still finalizing the API for installing your own data functions. The :db.fn/retractEntity call in the tutorial is an example of a data function. (retractEntity is built-in).
If that was not the case, you could still model such an order-dependent update as the fact that the counter has seen one more hit. Let the final query reduce that to the final count, and let the local cache implementation optimize that cost away for all but the first query, and then incrementally optimize the further queries when they are to see an increased count.
That said, I'm pretty sure I've seen the simpler CAS semantics support. (The CAS-successful update, if CAS is really supported, is still implemented as an "upsert", which means old counter values remain accessible if you query the past of the DB.)
Huh? How is that consistent with:
> access the data storage through a new distributed component called a transactor.
If "doing the transactions" consists of more than passing out incrementing transaction tokens, won't the transactor be a bottleneck?
The transactor is involved in just writes, not reads. (So that helps.) It's not distributed and cannot be distributed, in this system, because it ensures consistency, so yes, it is potentially a bottleneck. In blog comments by Rich Hickey, he states:
"Writes can’t be distributed, and that is one of the many tradeoffs that preclude the possibility of any universal data solution. The idea is that, by stripping out all the other work normally done by the server (queries, reads, locking, disk sync), many workloads will be supported by this configuration. We don’t target the highest write volumes, as those workloads require different tradeoffs."
Presumably, 1) the creators of Datomic think that performance can be good enough to be useful, 2) this is a new model that probably requires testing to prove is practical.
 Multiple people have linked to it, but for convenience: http://blog.fogus.me/2012/03/05/datomic/comment-page-1/#comm...
As a tangent, I'm really curious as to why this document is in a PDF file instead of simply being a web page. I can't see that doing much other than making it less convenient to read.
That said, it sounds like a database-as-a-service? If so, is the primary benefit the reduced database management load? Or is there some special sauce in here that makes it more capable than other RDMS or NoSQL databases?
The "special sauce" is that much of the work is done localy (in memory), you can use very powerful datamanipulation, and its ACID. That my understanding so far.
Tickets for the conference are available, including Friday-only tickets for $250. Friday will include Rich's keynote and a keynote by Richard Gabriel as well as lots of other Clojure-y goodness. http://regonline.com/clojurewest2012
Edit: Rich's response here:
Seems to imply that non-cached performance won't be so bad anyway. Looking forward to seeing some benchmarks.
1. No concept of inference/reasoning
2. No mention of a graph
3. Interesting use of clientside caching / data-peering
4. Clojure serialization vs N3/Turtle/RDF
1. Quadstores have are parameterized by graph, Datomic by time
2. subject-predicate-object model
3. query-anything ( including [ ?s ?p ?o] ??)
4. query anywhere (sending an rdf to a client for local query seems similar)
edit- I give up trying to get HN to render an ordered list. Any help would be... helpful.
 "A Note on Distributed Computing" (http://labs.oracle.com/techrep/1994/smli_tr-94-29.pdf)
 Please correct me if that synopsis is wrong
It seems as though a 'transaction' is defined as an atomic set of updates, but doesn't involve reads.
Any MVCC-style model allows full concurrency between readers and writers. The bigger problem is managing concurrency between conflicting writers in what amounts to a distributed database system. None of the material on Datomic's website explains how they intend to tackle that issue, which seems especially tricky with their model of distributed peers. All they say is that the Transactor is responsible for globally linearizing transactions and that this is better than existing models. However, if there is a genuine conflict, the loose coupling among peers seems to make the problem much worse than existing models, not better.
I'd love to know more details.
1. you can do synchronous transactions.
2. transactions can include data functions.
"The database can be extended with data functions that expand into other data functions, or eventually bottom out as assertions and retractions. A set of assertions/
retractions/functions, represented as data structures, is sent to the transactor as a transaction, and either succeeds or fails all together, as one would expect."
That describes many methods of optimistic concurrency control, but it doesn't answer my question of how this supposed to work in practice with high write contention, the higher latency of a distributed peer model, the long-running transactions the video mentions (or maybe that remark only applied to long-running queries), etc. My point being, if the distributed transaction problem was easily solved by sprinkling on optimistic multi-versioning concurrency control, it would have been solved a long time ago. There must be some special sauce they're not mentioning.
Thus, Datomic is well suited for applications that require write consistency and read scalability.
However, that doesn't mean it has slow writes - it should still do writes at least on a par with any traditional transactional database, and probably a good deal faster since it's append-only.
I'd also like to know more about how the app-side caching works. If I've got a terabyte of User records and want to query for all users of a certain type, does a terabyte of data get sent over the wire, cached, and queried locally? Only the fields I ask for? Something else?
2. The database is oriented around 'datoms', which are an entity/attribute/value/time. Each of these has its own (hierarchical) indexes, so you only end up pulling the index segments you need to fulfill a given query. You'd only pull 1TB if your query actually encompassed all the data you had.
If you write a traditional shared-nothing web app client with a traditional bag-o-sprocs database server, you'll probably be good as long as your workload doesn't change much. Assuming your write volume never exceeds that of a single (as beefy as necessary) server (which seems to be working out so far for Hacker News!) then you're ok.
However, products/services evolve and requirements change. Let's assume, for example, that you want to do some heavy duty number crunching. This number crunching involves some critical business logic calculations. Some of those calculations are in sprocs, but some of them are in your application code's native language. How do you offload that work to another server? You may have to juggle logic across that sproc/app boundary back and forth. It's pretty rigid; change is hard.
You can think of Datomic as a way of eliminating your sproc language and moving the query engine, indexes, and data itself into your process. Basically, you get everything you need to write your own database server. Furthermore, you can write specialized database servers for specialized needs... as long as you agree to allow a single Transactor service to coordinate your writes.
Back to the big number crunching. You've got the my-awesome-app process chugging along & you don't want to slow it down with your number crunching, so you spin up a my-awesome-cruncher peer & the data gets loaded up over there. Now you have the full power of your database engine in (super fast) memory and you can take your database-client-side business logic with you!
Now let's say you're finding that you're spending a lot of CPU time doing HTML templating and other web-appy like things. Well, you can trivially make additional my-awesome-app peers to share the work load.
You can do all this from a very simple start: One process on one machine. Everything in memory. Plain-old-java-objects treated the same as datums. No data modeling impedance mismatch. No network latency to think about. You can punt on a lot of very hard problems without sacrificing any escape hatches. You get audit trails and recovery mechanisms virtually for free.
Again, all this assumes the write-serialization trade offs are acceptable. Considering the prevalence and success of single-master architectures in the wild, that's not a hugely unacceptable tradeoff. Furthermore, the append-only model may enable even higher write speeds than something like Postgres' more traditional approach.
I hope my rant is helpful :-)
If you performed this type of calculation before with a traditional database, you had to have a powerful enough to computer to perform the calculation. In this model, you would still have that computer; it's just now a "peer".
If millions of people want the same piece of data that requires a huge calculation to get, then you would set up one powerful machine of your own just to do this calculation and then write the result to the database, so the many "thin" peers can just read the result.
The SaaS model they're offering won't work for the sorts of things I'm interested in.
Read the comment here, it gives some information on the problem your discribing: http://blog.fogus.me/2012/03/05/datomic/
This looks new: local query peers that cache enough data to perform queries (I don't understand how that works, but it looks like indices might be local, with some data also cached locally).
Also interesting that it seems to use DynamoDB under the hood.
The data storage system is strongly decoupled from the transactor and the peer, so there are a number of options, ranging from a filesystem to S3 to DynamoDB.
So if such a highly distributed system was to use Datomic, it would be harder to guarantee that each peer can work both for reads & (local) writes while being partitioned from the transactor. One would need to program the software to log those new facts (writes) locally before submitting (syncing) them to the transactor. And make that durable. Also, one might also need to make the query/read cache durable, since there's no network to fetch it back in case of a reboot of the peer. So it seems there's a missing local middleman/proxy that needs to be implemented to support such scenarios. At least, thanks to Datalog, the local cache would still be able to be used with this log, using db.with(log).
What do you think, is this use case quite simply implementable over/with Datomic, without asking it to do something out of its leagues?
Thus my question is: is introducing such a middleman in the system going to denaturate Datomic?
1. the app developer can confidently predict which queries the app will need through its lifespan, and
2. the app developer is willing to program and configure a layer that can persist and make durable a cache that spans all the data needed to run those queries (thus, persisting locally what amounts to a dynamic shard of the DB), and
3. the app developer is willing to program a layer that can persist and make durable all writes intended for the Transactor, and synchronize those to the Transactor when the app recuperates from a network partition, and
3.1. the app developer is willing to plan-or resolve-potential conflicts in advance of-or when-eventual conflicts, thus he's willing to sacrifice global consistency in the event of a network partition, in order to obtain availability, and
4. the app developer is willing to plug into the query engine in such a way that queries will include the local write log when there's a network partition.
1. depends on the requirements but most small to medium apps can predict the queries they'll need;
2. seems to be quite easy for small to medium apps:
2.1 run all possible queries at regular times, and
2.2 use a durable key-value store to keep the db values;
3. (1) make sure you're subscribed to events on partition and recovery; (2) coordinate writes over the same key-value store, probably using Clojure's STM and/or Avout; (3) on network recovery, replay those writes not present in the central DB;
3.1 due to the immutable nature of things and total ordering of the DB transactions, I expect to see no issue regarding eventual consistency when write logs are replayed centrally after a local Peer recovers from a network partition;
4. considering how Datalog works and is integrated into the Peer, this seems like a piece of cake.
So isn't this quite feasible to support the highly distributed case for apps in which each local Peer represents its own logical, dynamic and relatively natural and autonomous shard of the database?
Additionally, If local means the users client, how is security of the data ensured?
Terracotta and other data grid architectures do something similar.
Things should be like this - intuitive, some seed data and kickstart code w/ just enough documentation for when you get stuck.
Is Datomic just for JVM languages?
At the moment, yes. We have ideas for how to enable Datomic on non-JVM languages while preserving as much of the embedded power as possible.
However, my big concern here would be security. You'd need to be able to supply a predicate for which datums are allowed to be synced to the client.
There is other novel stuff.