
Graviton Database: ZFS for key-value stores - autopoiesis
https://github.com/deroproject/graviton
======
aftbit
>Graviton is currently alpha software.

More like the "BTRFS for key-value stores" ;)

Kidding aside, I dislike when new unproven software claims the name of
industry standards like this. When I saw the headline, I was hoping this
somehow actually leveraged ZFS's storage layer, but actually it is just a new
database that thinks Copy-on-Write is cool.

~~~
innagadadavida
Title is very click baity, this is just another kv store, completely unrelated
to ZFS.

~~~
mayama
Even their README is click baity then. I quickly glanced at their repo and
thought it somehow is related to ZFS, before reading comments here.

------
ysleepy
Nice!

I implemented pretty much the same trade off set in an authenticated storage
system.

single writer, radix merkle tree, persistent storage, hashed keys, proofs.

I guess it is a local maxima within that trade off space.

I like how the time travelling/history is always touted as a feature (which it
is), but it really just means the garbage collector/pruning part of the
transaction engine is missing. Postgres and other mvcc systems could all be
doing this, but they don't. The hard part of the feature is being able to turn
it off.

I'll probably have a look around later, the diffing looks interesting, not
sure yet if it's done using the merkle tree (likely) or some commit walking
algorithm.

~~~
mulander
> I like how the time travelling/history is always touted as a feature (which
> it is), but it really just means the garbage collector/pruning part of the
> transaction engine is missing. Postgres and other mvcc systems could all be
> doing this, but they don't.

Postgres actually did tout it as a feature in "THE IMPLEMENTATION OF POSTGRES"
by Michael Stonebraker, Lawrence A. Rowe and Michael Hirohama[1] search for
"time travel" in the PDF. I added the relevant quotes below for easier access
;)

This was back when PostgreSQL had the postquel language (before SQL was added)
there was special syntax to access data at specific points in time:

> The second benefit of a no-overwrite storage manager is the possibility of
> time travel. As noted earlier, a user can ask a historical query and
> POSTGRES will automatically return information from the record valid at the
> correct time.

Quoting the paper again:

> For example to find the salary of Sam at time T one would query:
    
    
        retrieve (EMP.salary)
        using EMP [T]
        where EMP.name = "Sam"
    
    

> POSTGRES will automatically find the version of Sam’s record valid at the
> correct time and get the appropriate salary.

[1] -
[https://dsf.berkeley.edu/papers/ERL-M90-34.pdf](https://dsf.berkeley.edu/papers/ERL-M90-34.pdf)

~~~
rmetzler
Is this still possible with Postgres?

~~~
mulander
Yes and no or to be precise - to a certain degree but not through an exposed
language feature.

PostgreSQL still does copy-on-write so the old versions of the row exist and
are present in storage. However now there is an autovacuum process going over
the records regularly marking those no longer seen by any transactions as re-
usable so eventually the old records would get overwritten.

You can get at the older versions of the rows directly on disk or perhaps it
would be possible to get the db to return such older versions of the rows. It
seems that by default even trying to get at them with `ctid` is not possible
so that may require hacking PostgreSQL itself or using some extension which
seem to actually exist[1].

[1] - [https://github.com/omniti-
labs/pgtreats/tree/master/contrib/...](https://github.com/omniti-
labs/pgtreats/tree/master/contrib/pg_dirtyread)

------
derefr
Does anyone know of an embedded key-value store that _does_ do
versioning/snapshots, but _doesn’t_ bother with cryptographic integrity (and
so gets better OLAP performance than a Merkle-tree-based implementation)?

My use-case is a system that serves as an OLAP data warehouse of
representations of how another system’s state looked at various points in
history. You’d open a handle against the store, passing in a snapshot version;
and then do OLAP queries against that snapshot.

Things that make this a hard problem: The dataset is too large to just store
the versions as independent copies; so it really needs _some_ level of data-
sharing between the snapshots. But it also needs to be fast for reads,
especially whole-bucket reads—it’s an _OLAP_ data warehouse. Merkle-tree-based
designs really suck for doing indexed table scans.

But, things that can be traded off: there’d only need to be one (trusted)
writer, who would just be batch-inserting new snapshots generated by reducing
over a CQRS/ES event stream. It’d be that (out-of-band) event stream that’d be
the canonical, integrity-verified, etc. representation for all this data.
These CQRS state-aggregate snapshots would just be a cache. If the whole thing
got corrupted, I could just throw it all away and regenerate it from the
CQRS/ES event stream; or, hopefully, “rewind” the database back to the last-
known-good commit (i.e. purge all snapshots above that one) and then
regenerate only the rest from the event stream.

I’m not personally aware of anything that targets exactly this use case. I’m
working on something for it myself right now.

Two avenues I’m looking into:

• something that acts like a hybrid between LMDB and btrfs (i.e. a B-tree with
copy-on-write ref-counted pages shared between snapshots, where those
snapshots appear as B-tree nodes themselves)

• “keyframe” snapshots as regular independent B-trees, maybe relying on L2ARC-
like block-level dedup between them; “interstitial” snapshots as on-disk HAMT
‘overlays’ of the last keyframe B-tree, that share nodes with other on-disk
HAMTs, but only within their “generation” (i.e. up to the next keyframe), such
that they can all be rewritten/compacted/finalized once the next keyframe
arrives, or maybe even converted into “B-frames” that have forward-references
to data embedded in the next keyframe.

~~~
random3
You need something like HBase, but embedded. MVCC would give you the snapshot
isolation (perhaps there's something with less guarantees?) and you'd need key
lexicographic ordering to do efficient scanning. You'd only need the memory
layout (e.g. LSM) if you'd keep a write-ahead-log from which to recover.

LevelDB / RocksDB (and related) may be close, but not sure about MVCC aspects
(see [https://www.cockroachlabs.com/blog/cockroachdb-on-
rocksd/](https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/))

~~~
derefr
You misinterpreted, I think. The point isn’t “snapshot isolation” in the MVCC
sense (working with multiple snapshots-in-progress); it’s the ability to, in
est, work with the database the way git works with commits: opening a
transaction “on top of” a base commit, then “committing” that transaction to
create a new commit object, with its own explicit ref, where you can later
“check out” an arbitrary ref.

Except, unlike git, this database wouldn’t need to be able to create new
commits off of anywhere but the HEAD; and also wouldn’t need to be able to
have more than one in-progress write transaction open at a time. No need for
MVCC at all; and no need for a DAG. The “refs” would always just be a dense
linear sequence.

Also, unlike git (or a cryptographically-verified / append-only store),
there’s no need to keep around “deleted” snapshots. It would actually be a
huge benefit to be able to purge arbitrary snapshots from the database,
without needing to do a mark-and-sweep liveness pass to write out a new copy
of the database store.

The key constraint that differentiates this from e.g. a Kafka Streams-based
CQRS/ES aggregate, is that you should be able to reopen and work with any
historical database version _instantly_ , with equal-amortized-cost lookups
from any version, without needing to first do O(N) work to “replay” from the
beginning of time to the snapshot, or to “rewind” from the HEAD to the
snapshot. This system would need all snapshots to be equally “hot” / “online”
for query access, not just the newest one.

In other words, such a database should work just like filesystem snapshots do
in a copy-on-write filesystem.

~~~
ec109685
It seems like coupling a database checkpoint process with file system’s
snapshot process should be theoretically possible: 1) Database informed
snapshot needed 2) Database finalizes any in progress writes and starts
logging new writes to another file 3) Take file system snapshot 4) Inform
database snapshot is fine

Between #3 and the file system snapshot, you should have a perfect and quick
representation of the database at that point in time (when the database was
informed it should stop logging).

~~~
derefr
File system snapshots are a system that have the analogous desired properties
_for files_ ; but filesystem snapshots are actually quite heavyweight, because
they deal with dirents, inodes, extents, etc. CoW filesystem snapshots are
designed for ops-task-granularity usage, e.g. daily backups; not for per-
transaction historical archiving. CoW filesystems tend to fall over once you
get to 100K snapshots. (I tested!) A database that took a snapshot after every
CQRS/ES transaction, could be expected to potentially have _billions_ of
snapshots.

A system that did its snapshots “inline” to itself, by e.g. managing a pool of
pages with a free-list the way LMDB does — but where Txs ultimately add a new
version _to_ the root bucket as a snapshot, rather than replacing the root
bucket page with themselves — would get a lot closer to allowing one to have
at least tens-of-millions of snapshots online. At that point, to achieve a
billion snapshots online, you “only” need to shard your timeline across a
cluster of 100 nodes.

This is precisely one of the experiments I’m trying. :)

------
moralestapia
I love the idea but I think you (author) need a lot of time/support polishing
this. You need a team probably.

Also,

>Superfast proof generation time of around 1000 proofs per second per core.

Does this limit in _any_ way things like read/write perfomance or usability in
general?

~~~
jjirsa
> You need a team probably

The cardinal rule of database development:

[http://www.dbms2.com/2013/03/18/dbms-development-
marklogic-h...](http://www.dbms2.com/2013/03/18/dbms-development-marklogic-
hadoop/)

~~~
moralestapia
Yup, I do not mean to discourage the authors.

I truly like the project and I have a few things in mind that could make use
of it already. (Heck, one of them is pretty much Graviton + a front end).

But I cannot just jump into it as there's some real money involved and no one
wants to _experiment_ with that.

I see a bright future for Graviton, once it becomes tested and stable in
production environments.

~~~
jopari
I'm sure the devs would be interested in hearing about your use cases, should
you open an issue on the repo, or got in touch via [https://dero.io/#contacts-
section](https://dero.io/#contacts-section)

(I'm not on the team, just interested in the project.)

------
Rochus
What is the use case? Why is it important that "All keys, values are backed by
blake 256 bit checksum"?

~~~
naivedevops
ZFS stores the checksums of files to prevent bit rotting. Since they are
comparing their database to ZFS, I guess it stores the checksums for the same
reason. If bit rotting occurs, you don't need to discard the entire database,
just the affected entry. If the entry was already there for some time, you
might even be able to restore it from a backup.

~~~
GordonS
Isn't a 256-bit Blake hash a little OTT, versus a simple CRC, or even a
faster, smaller hash like MurmurHash or Jenkins-one-at-a-time?

~~~
jlokier
It's a cryptographic hash, so it will detect tampering with the data, which a
simple CRC, MurmerHash or Jenkins would not.

~~~
GordonS
Still, I'd like an option to use a faster, more efficient CRC or hash - bit
rot is usually the main threat, rather than tampering. Not to mention that if
a user can tamper with the data they can probably just create a new hash at
the same time.

Using a cryptographic hash as a souped-up CRC seems rather odd, given how many
more CPU cycles and RAM it will use, but I don't know the reasoning behind the
decision; there must be one.

~~~
jlokier
> if an attacker can tamper with the data they can probably just create a new
> hash at the same time

That's true for ordinary databases, but this was developed for a blockchain
and uses a Merkle hash tree.

An attacker can only tamper with the data and create a new hash for a data
item by also creating a new hash for every node up to the root of the tree. In
a blockchain context, even that isn't enough, they'd have to modify the
blockchain nodes as well, as I presume they periodically record tree root
hashes.

The hash tree gives it some other interesting features too. O(n) diff time,
where n is the number of changes output in the diff, is probably due to having
a hash tree.

The fast diff would also work with a non-cryptographic hash, but it would be
considered not quite reliable enough against occasional, random errors. With a
cryptographic hash, for non-security purposes we treat the values as reliably
unique for each input. For example, see Git which depends on this property.

------
bdcravens
You can run a Graviton database. You can also run a database on a Graviton:

[https://aws.amazon.com/about-aws/whats-
new/2020/07/announcin...](https://aws.amazon.com/about-aws/whats-
new/2020/07/announcing-preview-for-amazon-rds-m6g-and-r6g-instance-types/)

For best results, run Graviton on a Graviton:

[https://aws.amazon.com/ec2/graviton/](https://aws.amazon.com/ec2/graviton/)

~~~
Mister_Snuggles
Naming things is difficult.

~~~
ethbr0
Not according to git!

We can just call this: ad58cd9088995cfb528187b11c275dad60ce2ec5

And the chip: 59b54f61dd17c27744e884542e35b34172e2cc79

So easy!

------
TomTinks
This is definitely something to look into. so far dero looks like a pretty
solid project with out of the box thinking.

------
byteshock
If latency and performance is an issue there are also solutions like RocksDB
or LevelDB

~~~
jopari
There's a brief comparison with RocksDB and LevelDB in the README file, which
concludes: "If you require a high random write throughput or you need to use
spinning disks then LevelDB could be a good choice unless there are
requirements of versioning, authenticated proofs or other features of Graviton
database."

~~~
byteshock
This was a reply to another comment in the thread that suggested a user use
sqlite. I commented using the Octal ios app. Not sure why it didn’t post it
correctly....

------
ramoz
Comparison to Badger? Badger is also go-native and, for me, has been
exceptional at scale and for read-heavy workloads on SSD.

Ref: [https://github.com/dgraph-io/badger](https://github.com/dgraph-
io/badger)

~~~
jopari
I think the key differentiating feature of Graviton is the tree of
authenticated proofs of data consistency. (AFAICT this is particularly
important for scalably updating and verifying a large blockchain history.)

~~~
ramoz
Ah, figured as much but am not as familiar with that use case. Thanks!

------
BryanG2
Someone paste timing results of diffing for very large data sets.

------
nickcw
What I'd really like is a multiprocess safe embeddable database written in
pure Go. So a database which is safe to read and write from separate
processes.

Unfortunately I don't think this one is multiprocess safe.

~~~
sneak
I too feel the “pure Go” pull, but is your use case so precarious or latency-
sensitive that you can’t simply use SQLite? That’s what I do in these
situations.

------
AtlasBarfed
... doesn't cassandra do a lot of this?

~~~
ramoz
Cassandra is not, traditionally/practically, an embedded db.

