
Crux: Open-source document database with bi-temporal graph queries - memexy
https://opencrux.com/
======
adamkl
Rather than just looking at the home page for Crux (which has a lot of
impressive tech jargon), I recommend that you watch the related video,
“Temporal Databases for Streaming Architectures”, on YouTube which I found did
a much better job of explaining things.

I read a quote the other day where somebody suggested that the Clojure
ecosystem feels like it’s filled with a bunch of tools from the year 3000, and
I think that Crux is a good example of that. It looks very cool.

[0] [https://youtu.be/ykbYNBE-V3k](https://youtu.be/ykbYNBE-V3k)

~~~
yogthos
And here's a talk from JUXT (company developing Crux) about it
[https://www.youtube.com/watch?v=3Stja6YUB94](https://www.youtube.com/watch?v=3Stja6YUB94)

~~~
refset
The Strange Loop 2019 talk linked by the GP is more recent and contains more
technical details, but yes the Clojure/north talk is a nice bit of history
showing the unveiling :)

My personal favourite is the talk from ClojuTRE by the main architect behind
Crux, which looks under the hood a lot more and discusses many of the key
design trade-offs during initial implementation:
[https://youtu.be/YjAVsvYGbuU](https://youtu.be/YjAVsvYGbuU)

~~~
yogthos
Ah I haven't watched the ClojuTRE talk, thanks for the link. :)

------
memexy
I looked through the docs and these are the main points of "bi-temporal" and
"unbundled":

> Record the true history of your business whilst also preserving an immutable
> transaction record - this is the essence of bitemporality. Unlock the
> ability to efficiently query through time, as a developer and as an
> application user. Crux enables you to create retroactive corrections,
> simplify historical data migrations and build an integrated view of out-of-
> order event data.

> Apache Kafka for the primary storage of transactions and documents as semi-
> immutable logs.

> RocksDB or LMDB to host indexes for rich query support.

~~~
Scarbutt
Sounds good except for "is a document db".

~~~
memexy
What's wrong with document DBs?

~~~
adamkl
Document DBs are ok if you only ever want to use the data in the same way that
you stored it. There are definitely use cases where that makes sense, but you
lose a lot of query power when compared to a relational database, which makes
it less adaptable to changing requirements.

That being said, I believe (it’s been a while since I looked at Crux), it
actually maintains a separate set of Entity-Value-Attribute indices generated
from the contents of the immutable series of documents stored in
Kafka/RocksDB.

This should give you similar level of query power when compared to something
like Datomic.

Disclaimer, I haven’t had any actual hands-on experience with either Crux or
Datomic, so I could be wrong.

~~~
refset
That's a pretty accurate summary. Documents are very straightforward to reason
about at a transactional level, particularly when handling unpredictable
changes to data models.

Once documents are ingested into the EAV-like indexes you can query everything
much the same as a typical graph database, because all the ID references
within the documents translate into edges within a schemaless graph of
entities.

Crux's native query language is a flavour of Datalog but we also have a SQL
layer in the works using Apache Calcite, for a more traditional relational
feel.

------
refset
Hi everyone, thanks for the interest!

I help steer the roadmap for Crux, which is very openly visible on GitHub, and
we have a 1.9 release coming up in the next couple of weeks which will be the
most significant milestone since we launched last year.

Shortly after 1.9 we will be adding SQL query support, which uses Apache
Calcite to compile SQL into Datalog on-the-fly, and end-to-end JSON APIs. Both
of these together should really broaden the appeal of Crux far beyond Clojure
and the Java/JVM communities. Stay tuned!

I would be very happy to answer any questions and hear feedback, as ever.

------
lichtenberger
Great to see the advances made in Crux :-)

I'm working on a versioned database system[1] in my spare time which offers
similar features and benefits. The core is Java based, whereas the Server is
written in Kotlin using Vert.x. A Python client as well as currently a
TypeScript based client for a non-blocking, asynchronous HTTP-Server exists.

The data store can easily be embedded into other Java/Kotlin based projects
without having to use the Server indirection.

Lately I changed a lot of stuff regarding the set-oriented query engine which
uses an XQuery derivate to process both XML and JSON as well as structured
data. For now I've integrated rule based index rewriting stuff to answer path
queries and path queries with one predicate through the index, filtering false
positives if necessary. Next, I'll add a twig based operator and AST rewrite
rules to answer multiple path queries with a single scan over smaller
subtrees.

Furthermore, I want to use the new Java Foreign Memory API to memory map the
storage file and change the tuple-at-a time iterator model to fetch batches of
tuples at once for better SIMD support through the JVM.

I've also thought about replacing the Kafka backend with SirixDB itself, as
SirixDB doesn't need a WAL for consistency... so SirixDB can hopefully
horizontally scale at some point in the future.

Some of the futures so far:

    
    
        - storage engine written from scratch
        - completely isolated read-only transactions and one read/write transaction concurrently with a single lock to guard the writer. Readers will never be blocked by the single read/write transaction and execute without any latches/locks.
        - variable sized pages
        - lightweight buffer management with a "kind of" pointer swizzling
        - dropping the need for a write ahead log due to atomic switching of an UberPage
        - rolling merkle hash tree of all nodes built during updates optionally
        - ID-based diff-algorithm to determine differences between revisions taking the (secure) hashes optionally into account
        - serialization of edit operations for changed subtree roots to make comparisons between consecutive revisions or subtrees thereof incredibly fast (no-diff computation needed at all)
        - non-blocking REST-API, which also takes the hashes into account to throw an error if a subtree has been modified in the meantime concurrently during updates
        - versioning through a huge persistent and durable, variable sized page tree using copy-on-write
        - storing delta page-fragments using a patented sliding snapshot algorithm
        - using a special trie, which is especially good for storing records sith numerical dense, monotonically increasing 64 Bit integer IDs. We make heavy use of bit shifting to calculate the path to fetch a record
        - time or modification counter based auto commit
        - versioned, user-defined secondary index structures
        - a versioned path summary
        - indexing every revision, such that a timestamp is only stored once in a RevisionRootPage. The resources stored in SirixDB are based on a huge, persistent (functional) and durable tree 
        - sophisticated time travel queries

As I'm spending a lot of my spare time on the project and would love to spend
even more time, give it a try :-) Any help is more than welcome.

Kind regards Johannes

[1] [https://sirix.io](https://sirix.io) and
[https://github.com/sirixdb/sirix](https://github.com/sirixdb/sirix)

------
slifin
I might need to look at Crux again for my prospective it was missing pull
syntax and schema on write

But reading from this thread that documents are converted into EAVT for every
attribute? Sounds interesting

~~~
dustingetz
Without schema-on-write, the A in EAVT has no meaning. A tells us nothing
about V. That makes it basically impossible for declarative systems like
Hyperfiddle to understand your data structures and auto-generate UIs and such
from the attributes. The fourth goal of Datomic is to "enable declarative data
programming in applications" [1]. That's the design goal that makes Datomic so
interesting to me, and Crux does not share it. Which is fine. Crux have given
great reasons for doing things that way.

[https://www.infoq.com/articles/Datomic-Information-
Model/](https://www.infoq.com/articles/Datomic-Information-Model/)

~~~
refset
Within Crux's new transaction functions you are able to run queries against
the entire database and (very soon) also against speculative transactions, to
enforce all manner of constraints and control whether the transactions
succeed/fail. Schema-on-write is therefore a matter of funnelling all `put` &
`delete` operations through these functions. I suspect someone could even
create a library of transaction functions to emulate the entire Datomic data
model, if so desired.

However, I still believe schema-on-read is more desirable in general though,
as you can simply extend a query to only care about the AV combinations which
conform to some definition for that A. The Calcite SQL integration we've been
building works like this. Conformance could also be processed asynchronously
and documents be labelled as "conformed" as/when to reduce the burden during
general queries.

(also, hey!)

------
acd
The time history feature is very smart!

~~~
refset
Aha, well it's certainly been a learning curve for me!

I found the public MIT lectures on "retroactive data structures" (by Erik
Demaine) very helpful to clarify the mental model and properties of temporal
databases. Point-in-time bitemporal queries are fairly simple compared to the
kinds of things some organisations use data warehouses for (e.g. temporal
range queries across N dimensions), but surprisingly few teams think about
representing time so clearly within their transactional OLTP/HTAP systems, so
even point-in-time queries seem more novel than they really should.

Everything is moving in the right direction though as immutable data becomes
the default.

~~~
lukashrb
Ty for the link. Do you have more material that helped you with your work on
crux? And do you mind sharing?

~~~
refset
The best place to start for context would be Håkan's ClojuTRE talk:
[https://www.youtube.com/watch?v=YjAVsvYGbuU](https://www.youtube.com/watch?v=YjAVsvYGbuU)

And here are the slides for that talk, which include a few references:
[https://juxt.pro/hakan-raberg-the-design-and-
implementation-...](https://juxt.pro/hakan-raberg-the-design-and-
implementation-of-a-bitemporal-dbms.pdf)

Feel free to reach out to the team, or find us on Zulip/Slack, if you want to
hear more or just chat :)

