Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: SirixDB – Bitemporal binary JSON database system and event store (github.com/sirixdb)
109 points by lichtenberger on Nov 13, 2023 | hide | past | favorite | 16 comments
I had already posted the project a couple of years ago, and it gained some interest, but a lot of stuff has been done since then, especially regarding performance, a completely new JSON store, a REST API, various internals refactored, an improved JSONiq based query engine allowing updates, implementing set-oriented join optimizations, a now already dated web UI, a new Kotlin based CLI, a Python and TypeScript client to ease the use of Sirix... First prototypes from a precursor stem already from 2005.

So, what is it all about?

The system uses ideas from ZFS (a keyed index trie, storing checksums in parent pages...) and Git (a persistent index structure that shares unchanged pages between revisions) but appends new tree roots on each commit [1][2].

It is a JSON DBS. The system stores fine granular JSON nodes. Thus, there's almost no limit to the structure and size of an object. Objects can be arbitrarily nested, and updates are cheap.

On a high level, it supports space-efficient snapshots, tracking changes by an author / optional commit messages, time travel queries, reverting to previous revisions (while all revisions in-between still exist for audits...), or retrieving the changes of whole (sub)trees.

On the one hand, it's, thus, a bitemporal DBS, but on the other hand, it can be used as a simple event store. It stores the state after an event or a change occurs and tracks the changes.

Thus, an entity, a node in the JSON structure, can be updated to new values and eventually be removed while the history is easily retrievable, or we can easily revert to a previous state. The system assigns a unique ID to each new node, which never changes and is never reused (even after the deletion of the node). Thus, the system stores the state after the change/event and the event itself (the change event).

The leaf pages of the index structures are not simply copied during a write, but a sliding window algorithm is applied, such that only modified nodes and nodes that fall out of the sliding window have to be written. A predefined window length is configurable. The system avoids write-peaks, which would occur due to full snapshots and having to read a long chain of incremental changes in between.

Thus, it's best suited for fast flash drives with fast random reads and sequential writes. Data is never overwritten thus, audit trails are given for free.

Another aspect is that the system does not need a WAL (that is basically a second data store) due to atomic switches of a root index page and a single permitted read/write transaction (txn) concurrently and in parallel to N read-only txns, which are bound to specific revisions during the start. Reads do not involve any locks.[2]

A path summary, an unordered set of all paths to leaf nodes in the tree, is built and enables various optimizations. Furthermore, a rolling hash is optionally built, whereas all ancestor node hashes are adapted during inserts.

A dated Jupyter notebook with some examples can be found in [3], and overall documentation in [4].

The query engine[5] Brackit is retargetable (a couple of interfaces and rewrite rules have to be implemented for DB systems) and especially finds implicit joins and applies known algorithms from the relational DB systems world to optimize joins and aggregate functions due to set-oriented processing of the operators.[6]

I've given an interview in [7], but I'm usually very nervous, so don't judge too harshly.

Give it a try, and happy coding!

Kind regards

Johannes

[1] https://sirix.io | https://github.com/sirixdb/sirix

[2] https://sirix.io/docs/concepts.html

[3] https://colab.research.google.com/drive/1NNn1nwSbK6hAekzo1Yb...

[4] https://sirix.io/docs/

[5] http://brackit.io

[6] https://colab.research.google.com/drive/19eC-UfJVm_gCjY--koO...

[7] https://youtu.be/Ee-5ruydgqo?si=Ift73d49w84RJWb2




Lets see if I can make the links clickable.

Was coding for fun today and was looking for a toy “database”. This is a bit too much i guess :-) ended up with flat file json. Will probably regret that later today. Good luck with the project, hope it will be (even more?) successful!

[1] https://sirix.io | https://github.com/sirixdb/sirix

[2] https://sirix.io/docs/concepts.html

[3] https://colab.research.google.com/drive/1NNn1nwSbK6hAekzo1Yb...

[4] https://sirix.io/docs/

[5] http://brackit.io

[6] https://colab.research.google.com/drive/19eC-UfJVm_gCjY--koO...

[7] https://youtu.be/Ee-5ruydgqo?si=Ift73d49w84RJWb2


Thank you, @BozeWolf for the clickable links!

I'd be honored if you take a closer look, of course and best of all obviously are contributions. Thanks!


Congrats for this - Love the bitemporal aspect. It was a real struggle for me in past analytics experiences where we spent a lot of time recomputing key metrics 'as of' certain dates for reporting / auditing.

Been following this https://news.ycombinator.com/item?id=38108044 as well, might interest you!


Thanks, Dolt it awesome. I think it's probably the only DBS with branching/merging capabilities as of now.

Sirix from the ground up was built having (bi)temporality and easy audits / append-only paradigm in mind.

The other very similar DBS in this regard seems to be Datomic (as it also uses a persistent index structure), but it doesn't version the pages itself.


TerminusDB also spports branching and merging. [1]

1: https://terminusdb.com/


Oh yes, I've heard about TerminusDB, but completely forgot, that it exists. Thanks :)


This looks really cool, but it would be nice if there was a clear "how to run" section in the README. I saw the dockerfile and docker-compose and tried to give it a shot - immediately fails. Digging deeper it seems like running on a mac isn't supported at all. And the instructions assume a great deal of familiarity with gradle.. useful for java devs who want to contribute, maybe, but I have no idea how to build this to even test it out as a REST user.

Some clear guidance on how (and on what) to get it running would be useful!


This could be very interesting, too (our query engine now with full JSON support -- using sophisticated set-oriented join and aggregate optimizations):

Separating Key Concerns in Query Processing - Set Orientation, Physical Data Independence, and Parallelism

http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/publicatio...

I've also ported their indexing ideas regarding XML to JSON, such that we can easily index whole paths with typed values as described in the README :-) the indexes itself are also versioned, of course and always updated.

First AST index rewrite rules for the query engine Brackit have also been added (Brackit is regargetable, so other data stores can easily implement a couple of interfaces).


Cool seeing this posted here. I remember making a logo for a Hacktoberfest[0].

It was stylized 'S' to look like a space-time light cone (or rotated infinity).

[0] https://github.com/sirixdb/sirix/pull/105


Thanks, I remember your PR, thanks again :-)


It was my best work, well really only work--being a back-end dev. An exciting diversion to have the PR accepted.


Same here, was a backend software engineer, now even embedded ;-)


If anyone is up to building a new frontend, that would be awesome (of course, work could also be split between interested people) :-)

https://github.com/sirixdb/sirix/issues/627


Can someone (an admin) maybe add the direct link to https://github.com/sirixdb/sirix in the URL field. Totally forgot, that it's possible to add both a text and a URL ;)


> storing checksums in parent page

A Merkle tree?


Yes, basically :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: