
Datahike: Durable Datalog database powered by an efficient Datalog query engine - tosh
https://github.com/replikativ/datahike
======
synthc
Nice to see an open-source Datomic clone, but it's not mature enough IMO. I
hope Datomic gets opensourced now that the company behind, Cognitect, was
acquired by Nubank.

~~~
whilo
Hey, one of the core architects of Datahike here. As a Clojure company we are
also super happy that Nubank made Clojure and Datomic much more credible with
this move. While Datomic is obviously much more mature, it is important to
understand that we have a different scope in our goals than Datomic. Datomic
is mostly built as a more convenient backend database for corporate
environments and is highly tailored towards AWS and a business environment
where the costs of operating and depending on these cloud services is
acceptable, which is only the most profitable, but small, slice of the whole
market. Even when Datomic gets open-sourced it will not be automatically built
with other than these design goals in mind.

Our goals on the other hand go even beyond thinking about this backend market;
we want to use Datalog really as a distributed systems environment and extend
Datahike to all endpoints including the browser and IOT development:

[https://www.youtube.com/watch?v=A2CZwOHOb6U](https://www.youtube.com/watch?v=A2CZwOHOb6U)

Our main intention was never to just build an open-source Datomic. But it made
too much sense not to do it as a first step. In fact we also really hope that
Datomic will be open-sourced such that we can merge our efforts. But given the
current governance model of Clojure and Datomic we do not yet foresee that
open-sourcing Datomic alone would address a large section of our plans. We are
ahead of Datomic already in a few areas:

We have funded development for ClojureScript support going for instance and in
comparison to Datomic all our efforts where from the beginning aimed at this,
we in fact provide more a set of libraries and abstractions that can stand on
their own and that you can compose in different ways than having a top-down
design that we then unbundle into libraries. This made it much easier for us
to evolve and reuse our stack despite the pivot we did from replikativ to
Datahike.

Regarding maturity we have worked hard during quarantine to address some of
our pain points:

1\. We significantly improved our write throughput and Datomic performance is
in reach (close to release),
[https://github.com/replikativ/datahike/pull/201](https://github.com/replikativ/datahike/pull/201)
2\. We have a first version of our server API available and will extend this
in the next months to provide Datomic-style local querying
[https://github.com/replikativ/datahike-
server/](https://github.com/replikativ/datahike-server/) 3\. We recently
provided Java bindings [https://lambdaforge.io/2020/05/25/java-
api.html](https://lambdaforge.io/2020/05/25/java-api.html)

Over the last year we also built a cooperative that has more than 5 people
working on this full-time and we aim to grow even faster next year and really
bring Datalog to the community and the masses. If Datomic gets properly open-
sourced, we will get there even faster.

~~~
synthc
Thanks for clarifying. I agree that Datomic is expensive to run and that open
sourcing it would not improve this overnight, a lightweight and cost-effective
alternative would be great.

How will you handle security when accessing a Datahike backend from the
browser? I've used Datomic from the browser indirectly in the past for
internal tools, using a custom rest api to run the queries, but for external
access it was not clear how to limit the queries to the parts of the database
the user had permission to view.

What are your plans for IOT development? I found that Datomic is not a good
fit for timeseries data, does Datahike offer any advantages?

~~~
whilo
Interesting, honestly speaking we have not thought about time series data a
lot yet, but I think we should be able to provide custom indices and extend
Datalog with more efficient query primitives, if this is necessary. Can you
elaborate a bit? I have used HDF5 binary blobs for tensors of experimental
recordings (parameter evolution in spiking neural networks) in Datomic a few
years ago and it is definitely possible to integrate external index data
structures, but eventually the query engine will need to be aware of how to
join them efficiently.

W.r.t. security, our current approach is to shard access rights and encryption
on a database level and just provide many databases, one for each user. This
is obviously not the most space efficient, but the most general approach. If
users can share access keys and data we can also do structural sharing between
these instances and factorize further. We envision doing joins potentially
over dozens of distributed Datahike instances in a global address space during
single queries. Since the indices are amortized data structures it does not
make too much sense to encrypt chunks of them for different users as this
defeats the optimality guarantees of B+-trees, i.e. you could have very bad
scan behaviour over huge ranges of encrypted Datoms. How have you tried to
partition the data? This is an interesting problem.

We can also expose the datahike-server query endpoint directly and you can
write static checks for access right restrictions. We only do this so far to
limit the usage of builtin functions to safe ones, but you could also go ahead
and do the same for more complex access controls. Some work in this direction
for Datahike has also been done here:
[https://github.com/theronic/eacl](https://github.com/theronic/eacl) Doing
this openly on the internet will also require a resource model to fend of
denial of service attacks, fortunately Datalog engines can have powerful query
planners and we can restrict our runtime to limited budgets as well.

~~~
synthc
For timeseries data I encoded a [entityId,timestamp,attribute] tuple to a big
integer, using a order preserving mapping to ensure that the datoms are sorted
by the timestamp. This provided the right functionality, for example using
seek-datoms we could retreive the datoms with timestamps between some range,
but performance was poor. I think a custom index could help a lot here. We
also had problems with the database growing to large, and needed to manually
shard the database over time.

A datalog equivalent to TimescaleDB (which extends Postgres with timeseries
optimiziations and time based table partitioning) would be great.

For client access I tried to define access rules based on attributes (similar
to how many graphql frameworks handle this), I tried to express this using
datalog rules. For example, users hava permission to access :user/items, and
:items/blabla, so a user X can access [X :user/items Y] and [Y :items/blabla
Z] Some experiments were promising, but it was slow and I did not find a good
way to integrate this.

~~~
whilo
I see, so your problem was that you wanted to scan over all Datoms for one
entity over a time period and you would have needed to have an EVAT index? In
Datahike it would be fairly simple to add new indices like this.

Yes, access management must not incur a large overhead, that is why many
systems have a separate restricted way to express and track rules. My hunch is
that it still would be better to keep it in Datalog and specialize the query
engine that it is fast on these (potentially restricted) rules and relations.

