
Show HN: SirixDB – versioning through efficient snapshotting - lichtenberger
I&#x27;ve already posted yesterday, but I&#x27;d really love to get comments, any kind of questions, suggestions and help would be greatly appreciated as it&#x27;s an Open Source project of mine (and was for others during my studies at the University of Konstanz 6 years ago).<p>Since then I spent countless ours to bring forth the idea of a versioned storage system, especially well suited for analytical tasks for timd-varying data.<p>Especially I&#x27;d love to discuss what documentation you need, which next steps are necessary (JSON, Cloud...), API additions or changes...<p>I&#x27;ve updated the README quiet a bit, such that the set up of the asynchronous, RESTful HTTP(S) Server is easier :-) however I could use some help with the Docker stuff.<p><a href="http:&#x2F;&#x2F;sirix.io" rel="nofollow">http:&#x2F;&#x2F;sirix.io</a>
======
lichtenberger
Key features are:

\- log-structured storage system with copy-on-write semantics especially well
suited for SSDs (random reads, writes are batched and synced to disk when a
commit is issued)

\- implements a novel versioning algorithm called sliding snapshot, which
balances read/write-performance and has other beneficial characteristics. We
also implement full, incremental and differential versioning at the page-level

\- stores hashes of the page-fragments in parent pointers in our main hash
array based trie structure in the indirect pages as in ZFS. In the future
these can be used to validate the integrity of the whole resource

\- compression of each page, as well as encryption in the future

\- for each XDM/XML-node in our on-disk structure we optionally store the
descendant-count, the child-count as well as a hash of the content

\- a diff algorithm, which uses our stable node-identifiers and optionally
hashes for comparisons of node-pairs in different revisions

\- a RESTful, asynchronous, temporal API written with Vert.x in Kotlin

\- several temporal XPath axis extensions, which could also be used for JSON
in the future

\- several XQuery functions to open, diff, commit... a resource

\- opening revisions of resources either by an ID or via a given timestamp. In
case of the given timestamp the revision is searched by binary search and
either the revision is found or the revision, which is closest to the given
point in time is opened

~~~
platform
could you describe major differences and overlaps between your solution and
Datomics, MarkLogic temporal documents
([https://docs.marklogic.com/guide/temporal/intro](https://docs.marklogic.com/guide/temporal/intro)).

Some compare with any other database 'time travel' feature would be helpful as
well.

Thank you

~~~
lichtenberger
Thanks so much for asking. I'm not sure how they are actually storing their
bitemporal documents, but I think it must be some kind of a huge B(+)-tree
maybe. I think the cool thing is that we have revision root pages and are able
to reconstruct each revision in merely the same time.

We currently have a tree, very similar to how ZFS stores the objects on-disk.
It's more or less a form of a hash array based trie. The number of levels of
the indirect pages currently is static, but I'll change this and only create a
new level once it's really needed as in ZFS.

During every change to the resources in Sirix never in-place changes occur,
instead it does a copy-on-write of the involved page and a pointer to the
former version is created. This is necessary for our versioning algorithms as
it has to fetch at most N former versions of the page, depending on if the
page has been modified (usually N should be between 2 and 5 or something like
that). The index structures currently are AVL-trees (thus also versioned),
simply stored as data records in the leafes, but it would be best to plug in a
B-tree in the future.

On a higher level I have added many operations, which are usually not found in
XML databases, for instance as our internal tree structure is a kind of
persistent DOM firstChild/rightSibling/leftSibling/parent encoding I added
move operations to move subtree. A path summary stores all paths in the
resource and is kept updated at all times as are optional indexes on paths,
elements, attributes or content-and-structure.

The interesting thing is with the help of extensions to Brackit(.org), which
also has some basic JSON navigation primitives already built in I was able to
add temporal axis. I added several XPath axis to navigate in time, for
instance first:: _, last::_ , all-time:: _, next::_ , previous:: _, future::_
, past::*... we also allow diffing, opening a resource in a specific
revision...

~~~
platform
interesting. last:: -- I assume, means 'most-recent'.

maintaining pre-calculated 'most recent' is very useful (as long as I can ask
'what was most recent, say, yesterday at 1pm').

Most of the 'hand-made' append-only data schema design suffer from not being
able to return 'joined' most-recent datums, quickly.

Because a 'usual' implementation requires doing join with select max on
business or system (or both) time stamp

My understanding that both DBs I mentioned previously do copy of write for
temporal data, but I am not sure if they do any document level-diffs.

~~~
lichtenberger
Yes, the revisions can be reconstructed in merely the same time, for instance
it doesn't matter if you open the most recent revision or any past revision
regarding system/transaction time. We also do not have to store the
transaction time more than once (in the revision - root page).

And if you do a fine granular modification, we usually do not copy and rewrite
the whole page with currently 512 records (it depends on the versioning
algorithm you use).

------
jarym
I’ve been looking for something like this for a while for a project I had in
mind.

I’m going to spend the weekend looking into this and will be providing
feedback after.

~~~
jarym
Ok, I read through the github wiki and everything I could find - pretty
comprehensive stuff.

Currently chunking my way through the phd paper. I created something similar
to this but wholly in-memory in JavaScript for a client project.

My idea is a policy management tool so versioning and frequent change are
quite important.

One thing I don't yet have a grip on is how this performs for my use-case
(which will be a multi-tenant SaaS solution). I'm going to run some tests to
evaluate what's possible here.

All in all, really good effort. I think for this project to be a success it
will take people who know where to apply it / what to do with it. Maybe a
suggestion for you is to add a section on the gh readme to highlight potential
use cases as you see them (will give people who come across it a better idea
of how they can apply this type of software).

Thank you!

~~~
lichtenberger
Very interesting, let me know when you need help or simply create an issue or
whatever. Keep me updated:-)

------
lichtenberger
I just published an article:

[https://medium.com/@jojolichtenberger/how-we-built-an-
asynch...](https://medium.com/@jojolichtenberger/how-we-built-an-asynchronous-
temporal-restful-api-based-on-vert-x-4570f681a3)

------
lichtenberger
Has anyone an idea, why Travis doesn't build with Java11 because of Mockito
and how to fix it?

[https://github.com/sirixdb/sirix/issues/59](https://github.com/sirixdb/sirix/issues/59)

I've mentioned, that I guess it's a transitive dependency and has to do with
the order the dependencies are used, but I'm not sure. Does anyone know how to
tell Maven to use the newer version?

Locally I'm getting a build success.

~~~
lichtenberger
[https://travis-ci.org/sirixdb/sirix/builds/473332290](https://travis-
ci.org/sirixdb/sirix/builds/473332290)

~~~
lichtenberger
Okay, I guess simply there needs to be a newer Maven version on Travis and I
messed up the release flag for the `maven-compiler-plugin`. That said, locally
it already works with Java11 :-) ... just Travis fails as of now

~~~
lichtenberger
Working with Java11 now :-)

------
diggernet
A documentation suggestion:

"Why should you even bother?" and "Features in a nutshell" should be near the
top, between "Table of contents" and "Getting started", because those are the
things people will want to know first.

~~~
lichtenberger
Hmm, I wasn't sure or if a really short description and then setting up
everything to get started isn't the best.

That said, I have to look at that section, as it might even be a bit dated.

~~~
lichtenberger
What do others think?

