Hacker News new | past | comments | ask | show | jobs | submit login
Crux: Open-source document database with bi-temporal graph queries (opencrux.com)
161 points by memexy 5 months ago | hide | past | favorite | 33 comments

Rather than just looking at the home page for Crux (which has a lot of impressive tech jargon), I recommend that you watch the related video, “Temporal Databases for Streaming Architectures”, on YouTube which I found did a much better job of explaining things.

I read a quote the other day where somebody suggested that the Clojure ecosystem feels like it’s filled with a bunch of tools from the year 3000, and I think that Crux is a good example of that. It looks very cool.

[0] https://youtu.be/ykbYNBE-V3k

And here's a talk from JUXT (company developing Crux) about it https://www.youtube.com/watch?v=3Stja6YUB94

The Strange Loop 2019 talk linked by the GP is more recent and contains more technical details, but yes the Clojure/north talk is a nice bit of history showing the unveiling :)

My personal favourite is the talk from ClojuTRE by the main architect behind Crux, which looks under the hood a lot more and discusses many of the key design trade-offs during initial implementation: https://youtu.be/YjAVsvYGbuU

Ah I haven't watched the ClojuTRE talk, thanks for the link. :)

I wonder what stop people adopting those tools/tech. E.g. a database like crux sounds pretty compelling to use for me.

Well, I think Postgres will remain a very safe default choice in new projects for most organisations for a long time yet, because the skills to work with it are common and relatively cheap.

However, Crux is much easier to justify with stakeholders if you have a complex problem revolving around ad-hoc graph joins or bitemporal queries & audit requirements. Or if you simply have high single-threaded transaction throughput requirements. Postgres in particular doesn't have pleasant answers to this intersection of problems.

I think the tables will inevitably turn against currently-safe choices life Postgres though, as systems like Crux are going to continue to be able to take advantage of the latest and greatest ideas, relatively unencumbered by decades of antiquated design decisions and legacy code. This is where the whole Clojure philosophy really shines in Crux, as all the layers interact through clean and composable Clojure protocols that facilitate new kinds of database modularity. These protocols allow the community to take advantage of whatever wonderful cloud services they might want to use, without any additional ceremony to interact with the core development team.

EDIT: Finally, as new generations of developers emerge the pressure for the industry to standardise on a better relational query language will build. Fingers are crossed for Datalog!

I looked through the docs and these are the main points of "bi-temporal" and "unbundled":

> Record the true history of your business whilst also preserving an immutable transaction record - this is the essence of bitemporality. Unlock the ability to efficiently query through time, as a developer and as an application user. Crux enables you to create retroactive corrections, simplify historical data migrations and build an integrated view of out-of-order event data.

> Apache Kafka for the primary storage of transactions and documents as semi-immutable logs.

> RocksDB or LMDB to host indexes for rich query support.

It's worth noting that Crux supports a healthy range of options for transaction log and document storage beyond Kafka, including a variety of JDBC backends (SQLite, Postgres etc.)

Kafka is usually more than most people want or need. There was a blog post recently about a new Firebase-like Clojure web framework which uses Crux + Digital Ocean's managed Postgres service: https://findka.com/blog/migrating-to-biff/

Disclosure: working on Crux :)

> including a variety of JDBC backends (SQLite, Postgres etc.)

How straightforward would it be to add a new JDBC-based node implementation? I'm noticing that SQL Server is apparently missing, which is a bit of a shame given how many off-the-shelf products use it (and could therefore probably integrate with Crux if Crux did so, too).

It's usually no more than a day or two's work for us to get a new dialect figured out and to wire up CI tests with containers etc.

It wouldn't be infeasible for someone new to Clojure to manage it about the same time also. The hard part is the SQL, not the Clojure.

Fortunately there actually is SQL Server support already: https://github.com/juxt/crux/blob/master/crux-jdbc/src/crux/...

...and I'll make sure the full range of backends is made clearer somewhere. Good feedback!

Sounds good except for "is a document db".

What's wrong with document DBs?

Document DBs are ok if you only ever want to use the data in the same way that you stored it. There are definitely use cases where that makes sense, but you lose a lot of query power when compared to a relational database, which makes it less adaptable to changing requirements.

That being said, I believe (it’s been a while since I looked at Crux), it actually maintains a separate set of Entity-Value-Attribute indices generated from the contents of the immutable series of documents stored in Kafka/RocksDB.

This should give you similar level of query power when compared to something like Datomic.

Disclaimer, I haven’t had any actual hands-on experience with either Crux or Datomic, so I could be wrong.

That's a pretty accurate summary. Documents are very straightforward to reason about at a transactional level, particularly when handling unpredictable changes to data models.

Once documents are ingested into the EAV-like indexes you can query everything much the same as a typical graph database, because all the ID references within the documents translate into edges within a schemaless graph of entities.

Crux's native query language is a flavour of Datalog but we also have a SQL layer in the works using Apache Calcite, for a more traditional relational feel.

Not GP but my guess is that "document DB" is really a term that can mean all things to all people, and is even used as a trademark (by AWS as name for their Mongo-API-compatible product). As used in this context, "JSON store" would be more appropriate maybe. I know that in academic DB literature, "document DB" refers to any storage that leaves the original serialization intact (if it has a serialized form to begin with). But outside of that context, if people were to google "document database", they probably commonly expect to find a product that can store, retrieve, fulltext-search, version-control, manage the access permissions and transform pipelines of real documents, those being either arbitrary binary document formats, or more specifically markup (XML, SGML, HTML) documents. I'd also suggest vendors come up with more precise nomenclature in their own best interest, because the way it's now is that laypeople will be directed towards the products with the most ad spent rather than what they're looking for in the majority of cases.

These are definitely valid points, thanks. I would like to see Crux evolve into something more comprehensive like MarkLogic eventually, partly to not disappoint anyone with those common expectations of what a "document database" ought to have, but also because it feels like the right direction to move in if we want to see an offline-first world succeed in the future...and pick-up where Lotus Notes stopped innovating!

If we had native JSON support already I suspect we would have opted for "JSON Store", but we've only been able to support edn since we launched and "edn Store" isn't too helpful.

An official Lucene index is very likely going to be added for full-text search at some point. I know of two users in the community who have already integrated Lucene themselves, which is actually fairly straightforward as long as you don't need to do searches in the past or can cope with returning results from across all of time.

MarkLogic today advertise with data integration and NoSQL though rather than being a "document DB", which I guess is even worse as a moniker.

Indeed, it's an impressively eclectic product, considering how niche it seems to still be.

Incidentally they have some fairly nice materials about their own bitemporal query support and it's use across industries, e.g.:



Perhaps you don't know enough about the subject ;)

I too had that opinion once, untill I started real DDD and event sourcing.

An opinion can change quickly ;)

Could you expand on this a bit or point to articles that talk about it?

It sounds interesting

I can only guess what was meant, but I suspect the main argument is that projection code in an event sourced system is much more nimble when it's not tied down to an underlying schema. Documents make focusing on language level types much easier.

That said, we've not built Crux with domain-level event sourcing in mind in particular. There is certainly overlap in the bitemporal model Crux has and what I gather some people are doing with retroactive events.

This blog post on Datomic and event sourcing is interesting and relevant for Crux also (Crux has been heavily inspired by Datomic): https://vvvvalvalval.github.io/posts/2018-11-12-datomic-even...

I would advise you to look up domain events and follow the rabbit hole from there ;)

Perhaps "ddd quickly" is a good 66 page summary.

Are there any tools you recommend for DDD? Have you seen much justification for bitemporal or retroactive events, as in https://dddeurope.com/2018/speakers/thomas-pierrain/ ?

Hi everyone, thanks for the interest!

I help steer the roadmap for Crux, which is very openly visible on GitHub, and we have a 1.9 release coming up in the next couple of weeks which will be the most significant milestone since we launched last year.

Shortly after 1.9 we will be adding SQL query support, which uses Apache Calcite to compile SQL into Datalog on-the-fly, and end-to-end JSON APIs. Both of these together should really broaden the appeal of Crux far beyond Clojure and the Java/JVM communities. Stay tuned!

I would be very happy to answer any questions and hear feedback, as ever.

Great to see the advances made in Crux :-)

I'm working on a versioned database system[1] in my spare time which offers similar features and benefits. The core is Java based, whereas the Server is written in Kotlin using Vert.x. A Python client as well as currently a TypeScript based client for a non-blocking, asynchronous HTTP-Server exists.

The data store can easily be embedded into other Java/Kotlin based projects without having to use the Server indirection.

Lately I changed a lot of stuff regarding the set-oriented query engine which uses an XQuery derivate to process both XML and JSON as well as structured data. For now I've integrated rule based index rewriting stuff to answer path queries and path queries with one predicate through the index, filtering false positives if necessary. Next, I'll add a twig based operator and AST rewrite rules to answer multiple path queries with a single scan over smaller subtrees.

Furthermore, I want to use the new Java Foreign Memory API to memory map the storage file and change the tuple-at-a time iterator model to fetch batches of tuples at once for better SIMD support through the JVM.

I've also thought about replacing the Kafka backend with SirixDB itself, as SirixDB doesn't need a WAL for consistency... so SirixDB can hopefully horizontally scale at some point in the future.

Some of the futures so far:

    - storage engine written from scratch
    - completely isolated read-only transactions and one read/write transaction concurrently with a single lock to guard the writer. Readers will never be blocked by the single read/write transaction and execute without any latches/locks.
    - variable sized pages
    - lightweight buffer management with a "kind of" pointer swizzling
    - dropping the need for a write ahead log due to atomic switching of an UberPage
    - rolling merkle hash tree of all nodes built during updates optionally
    - ID-based diff-algorithm to determine differences between revisions taking the (secure) hashes optionally into account
    - serialization of edit operations for changed subtree roots to make comparisons between consecutive revisions or subtrees thereof incredibly fast (no-diff computation needed at all)
    - non-blocking REST-API, which also takes the hashes into account to throw an error if a subtree has been modified in the meantime concurrently during updates
    - versioning through a huge persistent and durable, variable sized page tree using copy-on-write
    - storing delta page-fragments using a patented sliding snapshot algorithm
    - using a special trie, which is especially good for storing records sith numerical dense, monotonically increasing 64 Bit integer IDs. We make heavy use of bit shifting to calculate the path to fetch a record
    - time or modification counter based auto commit
    - versioned, user-defined secondary index structures
    - a versioned path summary
    - indexing every revision, such that a timestamp is only stored once in a RevisionRootPage. The resources stored in SirixDB are based on a huge, persistent (functional) and durable tree 
    - sophisticated time travel queries
As I'm spending a lot of my spare time on the project and would love to spend even more time, give it a try :-) Any help is more than welcome.

Kind regards Johannes

[1] https://sirix.io and https://github.com/sirixdb/sirix

I might need to look at Crux again for my prospective it was missing pull syntax and schema on write

But reading from this thread that documents are converted into EAVT for every attribute? Sounds interesting

Things have been moving quickly in recent months! We've already done some prototyping on a pull syntax (using EQL) and will start proper work on it after the 1.9 release is out.

Alongside some less-exciting changes 1.9 will include full transaction function support, so schema-on-write is much simpler to achieve now. It took us time figure out how to resolve it whilst not completely breaking bitemporality & eviction. Here are a few tests showing how it to use transaction functions now: https://github.com/juxt/crux/blob/b13b4da988ed7e91dc7685e5d5...

> documents are converted into EAVT for every attribute?

The indexes are certainly schemaless, which means that every document attribute gets turned into a set of one or more triples that can be joined against without needing to declare anything about the attributes upfront, but internally the indexes don't follow the same EAVT pattern you may be familiar with from Datomic etc. checkout the ClojuTRE talk (linked in these comments) for a good overview on roughly what the internal indexes look like. It's a little out of date now, but still gives a strong impression for how things work today.

Without schema-on-write, the A in EAVT has no meaning. A tells us nothing about V. That makes it basically impossible for declarative systems like Hyperfiddle to understand your data structures and auto-generate UIs and such from the attributes. The fourth goal of Datomic is to "enable declarative data programming in applications" [1]. That's the design goal that makes Datomic so interesting to me, and Crux does not share it. Which is fine. Crux have given great reasons for doing things that way.


Within Crux's new transaction functions you are able to run queries against the entire database and (very soon) also against speculative transactions, to enforce all manner of constraints and control whether the transactions succeed/fail. Schema-on-write is therefore a matter of funnelling all `put` & `delete` operations through these functions. I suspect someone could even create a library of transaction functions to emulate the entire Datomic data model, if so desired.

However, I still believe schema-on-read is more desirable in general though, as you can simply extend a query to only care about the AV combinations which conform to some definition for that A. The Calcite SQL integration we've been building works like this. Conformance could also be processed asynchronously and documents be labelled as "conformed" as/when to reduce the burden during general queries.

(also, hey!)

The time history feature is very smart!

Aha, well it's certainly been a learning curve for me!

I found the public MIT lectures on "retroactive data structures" (by Erik Demaine) very helpful to clarify the mental model and properties of temporal databases. Point-in-time bitemporal queries are fairly simple compared to the kinds of things some organisations use data warehouses for (e.g. temporal range queries across N dimensions), but surprisingly few teams think about representing time so clearly within their transactional OLTP/HTAP systems, so even point-in-time queries seem more novel than they really should.

Everything is moving in the right direction though as immutable data becomes the default.

Ty for the link. Do you have more material that helped you with your work on crux? And do you mind sharing?

The best place to start for context would be Håkan's ClojuTRE talk: https://www.youtube.com/watch?v=YjAVsvYGbuU

And here are the slides for that talk, which include a few references: https://juxt.pro/hakan-raberg-the-design-and-implementation-...

Feel free to reach out to the team, or find us on Zulip/Slack, if you want to hear more or just chat :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact