
Show HN: Noms – A new decentralized database based on ideas from Git - ahl
https://medium.com/@aboodman/noms-init-98b7f0c3566#.ojb6eaz94
======
nartz
So, i realize this project is early, but it would be EXTREMELY helpful to walk
through someone's use case - like, who is the target here? A business analyst
who iterates on cleaning / analyzing small excel csvs? Or someone else?

After watching the screencast, all I saw was a bunch of commands explained
(could have read the docs for that), instead, I'd like to walk through a use-
case where this solves _someones_ problem.

~~~
wingerlang
He mentioned at least one. His friend went to a cabin where there were little
to no internet connection, then he updated the database on a local device. And
later on other database-nodes would just pull the updated data.

Seems like it is for personal use. Maybe something to build apps on top of.

~~~
tamana
That doesn't differentiate noms from postgres or any other multimaster
database.

Does it do clever merging?

~~~
wingerlang
I just went by the use case he mentioned in the video. I don't really know
about these technical details for databases etc.

------
im_down_w_otp
GC I can see a shape of solution for since you can use something like a per-
object DVVset to determine the minimum set of unresolved histories required to
avoid losing data during conflicts while not unnecessarily ballooning the size
of the dataset.

However, the inner-object conflict-resolution problem seems a lot harder to
solve given that there's no obvious join-semilattice for arbitrary
fields/data. Can you discuss what conflict-resolution strategies you're
working on for auto-resolution and/or what metadata you intend to provide to
the end-user in the event that you're going to punt resolution to them to
handle?

Given this is supposed to be for collaborative workloads, the conflict-
resolution issue seems to be a cornerstone. Git handles this by inserting
sibling sections into the documents and forcing the end-user to manually deal
with fixing problems, which is often fraught with pain and peril, and doesn't
seem like a strategy that would work for something that's a database (as
opposed to something that's a workflow).

~~~
rafaelweinstein
This is a question that we've gotten quite a bit. It's our view that there's
no magic solution to conflicts. There are logical conflicts in the real world
that must be arbitrated.

That said, it's a surprisingly basic thing, but just knowing _what changed_
from party (a) and party (b)'s perspective (relative to their most recently
agreed-upon state) is somewhat rare or ad-hoc in existing systems. In noms,
you can directly compute exactly how state diverged and apply whatever
resolution strategy is suitable.

We have plans for applying default conflict resolution for changes to data-
types that - in many cases - will be correct, but in the end, there's no
avoiding that correctness can only be defined within a given specific domain.

~~~
espadrine
To me, there are two separate concerns: intent preservation and coherence.

Operations cause a change of state. The sequence of operations that are
performed given a user action need to ensure that they will cause a
modification of the state of the database in the way intended by the user.

Coherence is about maintaining a design. A set of rules are conceived. For
each state, it is possible to determine whether the state is valid according
to them, and if so, the database is coherent.

To maintain coherence, it is possible to deny an operation, therefore breaking
user intent. To maintain user intent, it is possible to accept an operation
that leads to an invalid state. From what I understand, noms is heavily tilted
towards coherence, like git, but unlike git, it doesn't have the social "pull
request" aspect, nor a support for running tests.

You mention diffing as a plus, but realistically the only use of diffing is to
guess intent. Real intent can only be provided by a maximally rich set of
operations. Logs of SQL queries, for instance, are more likely to provide
insight than a cold diff. For a database that is tilted so far towards
maintaining coherence at the expense of user intent, it may be a good idea to
compensate by staying close to the operations.

Finally, if you decide to tilt towards intent preservation, there are
definitely approaches to automate merges, avoiding nagging the user to fix
conflicts. The most successful trivial solution remains latest-write-wins,
which gives surprisingly good results assuming a high data granularity and a
rich set of operations. But unless there is a way to automate coherence
validation (which the equivalent in git projects is, I suppose, running the
tests), we can only rely on user attention… in which case, relying on the user
to fix coherence after the fact on a database that heavily preserves user
intent would be pretty much the same.

So… do you plan on supporting custom coherence rules? Alternatively, which
conflict resolutions are you leaning towards?

~~~
im_down_w_otp
What is "latest" in a situation where you're experiencing concurrent writes?

~~~
espadrine
Google's Spanner, for instance, relies on their TrueTime design, which
requires having a GPS clock and an atomic clock on each datacenter, I believe.
Most designs simply rely on NTP or a similar time synchronization system.

Another approach is to maintain total order of writes. Assuming some form of
consensus protocol to determine write order, the unicity of the order ensures
synchronization. That design, however, tends to preserve user intent less.
Bitcoin has a form of that.

~~~
im_down_w_otp
Time isn't a reliable resource in this context.

~~~
espadrine
They mention striving to support many contexts. In the demo, they showcase
offline editing of a single CSV entry. If the granularity is the atomic types,
then a conflict only occurs when the very same field in the very same row is
concurrently edited.

Then, the system can show the conflict and offer a default that keeps the
operation with the highest timestamp, or if the timestamps are identical, the
one with the highest hash.

------
zphds
Going through the SDK docs, why was a scheme like
'[http://localhost:8000::people'](http://localhost:8000::people') chosen
instead of the plain old
'[http://localhost:8000/people'](http://localhost:8000/people')? Are there any
benefits? If yes, curious to know what they are.

~~~
aboodman
Thanks for the help everyone with this most important aspect of the system ;).

To clarify, we don't think of these specs as URLs. The part before the final
double-colon is a URL. To parse one, you get the final double colon, and take
everything to the left as a URL.

There's some info on the syntax here:

[https://github.com/attic-
labs/noms/blob/master/doc/spelling....](https://github.com/attic-
labs/noms/blob/master/doc/spelling.md#spelling-databases)

Though it's not presented as a formal grammar in that doc, our most important
criteria for the syntax was:

    
    
      - unambiguousness
      - interacts well with the shell, since we frequently use these as part of command lines

~~~
batbomb
> To clarify, we don't think of these specs as URLs.

But everyone else will because you are including the protocol, and at the end
of the day, they are a uniform way of identifying a resource, so they are
functionally URIs.

Otherwise, you should probably either conform to the HTTP(S) protocol spec or
makeup your own, e.g.
noms+[http://dbinstance.noms.foo::database/dataset](http://dbinstance.noms.foo::database/dataset)

SQLAlchemy and most DB URIs are good examples on how to do this. For example,
you can connect to a MySQL database instance and give it a default
namespace/schema/database.

Part of the issue here is the ambiguity between a database, a database
instance/server/host, a dataset/table, a catalog/namespace/schema, and what
all those words and concepts mean. There's little consensus across fields,
because even if computer scientists say "Okay, this is what a dataset actually
is", somebody, whether it's a biologist or a physicist, will throw up their
arms in protest.

------
pinko
My queston is on scalability. You say "large datasets" on the website. What is
large? 1x/10x/100x Terabytes? 1x/10x/100x Petabytes?

What kind of access rates? Etc.

Very general answers are okay -- I'm trying to wrap my head around whether
this is even in the right ballpark for my world.

Distinguishing current proof-of-concept vs. design-goal scale is okay too.

Thanks!

~~~
aboodman
Honest answer is: we don't know yet -- we're working our way up from the
bottom.

But we (cautiously) don't see any reason why the basic design shouldn't scale
to very large (e.g. petabyte) datasets, and that is our eventual goal.

That said, we do think there are a lot (even maybe the majority) of use cases
in the GB-TB range.

~~~
aartur
Isn't the append-only design unsuitable for scenarios where many
updates/deletes are made? If you update/delete 1GB of your 2GB database each
day, then after a year the database is 365GB in size, but the live data is
only 2GB.

I think the git-like features (history, merging) are very helpful for internal
work, but when the dataset must be published, I think in most cases only the
newest snapshot should be made available. But then the question is what format
should it have...?

~~~
aboodman
It just depends on the details. If you have a dataset in which 50% of values
changes every day, and it doesn't compress well, then yeah, your Noms archive
of that entire dataset is going to grow quickly.

In such situations, you could either (eventually, when it is implemented)
prune old data, or aggregate the changes into bigger blocks.

------
tlb
Strawman marketing alert: "The most common way to share data today is to post
CSV files on a website". Maybe there are a bunch of people that still do that
somewhere, but if so, they ain't early adopters of decentralized database
technology and so not your target customers. It's always better to talk about
what your most likely customers are doing now.

~~~
archgoon
This is actually extremely common.

For example, if you browse the UC Irvine ML datasets

[https://archive.ics.uci.edu/ml/index.html](https://archive.ics.uci.edu/ml/index.html)

You'll find that many are in csv format.

If you do a search on data.gov

[http://catalog.data.gov/dataset#sec-
res_format](http://catalog.data.gov/dataset#sec-res_format)

You'll see that it's about as popular as JSON.

Also, the World Health Organization

[http://www.who.int/tb/country/data/download/en/](http://www.who.int/tb/country/data/download/en/)

Also, many of the datasets at kaggle are in csv format.

[https://www.kaggle.com/datasets](https://www.kaggle.com/datasets)

And this isn't that surprising, it's human readable, and gets the job done,
and zipping will give decent compression.

I'm not sure who you think the target market for this would be, but I'm sure
that if it's an efficient local format, you could probably get the ML crowd on
board.

~~~
aboodman
Right. A shocking amount of public data is distributed this way.

Also, we routinely talk to developers who complain about the difficulty of
consuming data snapshots from partners, parsing it, trying to understand how
it has changed since last time, etc.

With high value datasets, people frequently build an API to combat these
problems. But it's hard to design a good API, and even if you succeed, it has
to be secured, documented, scaled, and maintained indefinitely.

~~~
spotman
So if nom takes off would you see 'download a nom dataset by clicking here'

or would it be 'use this hostname to sync the nom to your own computer'

Or would it be a dsn sort of and you just instantiate a client and your on
your way?

Or, some combo?

~~~
aboodman
In some glorious future world you might see things like:

``` <a href="[http://www.who.int/tb/country/data/download/en/::case-
data/b...](http://www.who.int/tb/country/data/download/en/::case-data/by-
country">Download) the Data</a> ```

~~~
kragen
Hmm, if you want people to be able to link to Noms datasets on the web, maybe
you should switch to using URLs to name the datasets, instead of a two-part
identifier with an URL separated from a dataset name by a "::"? Darcs and Git
seem to get by more or less with URLs and relative URLs; do you think that
cold work for Noms too?

The super REST harmonious way to do this would be to define a new media-type
for Noms databases with a smallish document that links to the component parts.
Like torrent files, but using URLs (maybe relative URLs) instead of SHA1
hashes for the components, maybe?

~~~
aboodman
This is a good point. We never thought of these strings as URLs, but there are
places where it would be nice to use them that only want URLs (the href
attribute, for example).

The way we have it now is nice in that any valid URL can be used to locate a
database. I am loathe to restrict that.

Interesting point though - thank you!

~~~
kragen
Sure, I hope the ideas are useful! As some other commenters have said, if you
just use # instead of ::, I think the problem goes away?

~~~
aboodman
The hash portion of a URL is not transmitted to the server by browsers, so it
wouldn't help in the case of putting the string into a URL bar or a hyperlink.

~~~
kragen
If the resource you're linking to is a database (or to speak more strictly, if
its only representation is a resource of a noms-database media-type), rather
than an HTML page or something, can't the browser can be configured to pass it
off to a Noms implementation, complete with the dataset identifier within? I
mean, that's what people do with page numbers in PDF files, right?

~~~
aboodman
Hm. True.

------
joshmarlow
Very cool. I would love to have something like this production ready. Some
day...

Anyone who finds this interesting may also be intrigued by Irmin [0] - a
library for applications to persist data in a git-compatible format.

[0] - [https://github.com/mirage/irmin](https://github.com/mirage/irmin)

~~~
seliopou
Docker for Mac > About Docker > Acknowledgements, Cmd+F: Irmin

Not only has Irmin been around longer (with full JS support thanks to
js_of_ocaml) it also has a pretty big deployment under its belt.

------
latortuga
At first glance, this reminds me a little bit of datomic - all data history is
preserved/deduplicated, fork/decentralization features. Can you comment on how
it compares?

~~~
aboodman
Thanks, we will take that as a compliment.

I feel weird speaking for them, but at a product level, I think it's fair to
characterize Datomic as an application database -- competing with things like
mongo, mysql, rethink, etc.

While Noms might be a good fit for certain kinds of application databases
(cases where history, or sync, is really important) we're really more focused
more on archival, version control, and moving data between systems than being
an online transactional database.

Also, at a technical level, unless I'm wildly mistaken, I don't believe that
Datomic is content-addressed, and I wouldn't call it "decentralized" (though
that word is a bit squishy).

------
lachenmayer
This looks really exciting, congrats to the team for launching!

Could you tell us a bit about how this compares to dat? [http://dat-
data.com/](http://dat-data.com/)

~~~
aboodman
Dat is (currently) focused on synchronizing files in a peer-to-peer network.

Noms can store files, but it is much more focused on structured data. You put
individual values (numbers, strings, rows, structs, etc) into noms, using a
type system that noms defines, and this allows you to query, diff, and
efficiently update that data.

Also Noms isn't peer-to-peer (although we hypothesize that it could run
reasonably on top of an existing network like IPFS).

------
aboodman
Hi all. I'm one of the creators of Noms. Happy to answer any questions!

~~~
bra-ket
does noms understand sql and can it do joins?

~~~
daveloyall
It doesn't support queries. It's a datastore, not a database. This is from
their FAQ. [https://github.com/attic-
labs/noms/blob/master/doc/faq.md](https://github.com/attic-
labs/noms/blob/master/doc/faq.md)

~~~
fizzbatter
I mean, by their definition it is a database, but i can understand your usage.
Then again, they both say it is a database, and in your link, they say it
"isn't quite there yet", so /shrug heh.

------
was_boring
It's an interesting idea.

The HN title suggested it's a database, which made me really curious as I can
finally stop using history tables (or wal logging, or the other myriad ways of
seeing a point in time). However, that doesn't seem to be the case here?

That said, the idea of "git as a datastore" does seem akin to "blockchain as
data verification". Combine those two ideas together, get PWC involved and you
have multimillion dollar deals coming in for audit protection.

~~~
aeijdenberg
I've been working on something pretty akin to what you describe, hosted
verifiable data structures (logs and maps). Rather than Blockchain it uses the
same data structures as Certificate Transparency to provide equivalent
functionality. Would love to get some feedback if you had the time to look:
[https://www.continusec.com/](https://www.continusec.com/)

------
pinko
Here's a relevant (albeit 4-year-old) StackExchange thread, "Is there a Git
for data?":

[http://opendata.stackexchange.com/questions/748/is-there-
a-g...](http://opendata.stackexchange.com/questions/748/is-there-a-git-for-
data)

~~~
abhishivsaxena
I once wrote super simple `git in js` for data objects. Was less than a couple
of hundred of lines.

But there's also -
[https://github.com/mirage/irmin](https://github.com/mirage/irmin)

------
kragen
I've been wanting something like Noms for a while. Prolly trees sound really
promising.

In intro.md, you suggest, "If you wanted to find all the people of a
particular age AND having a particular hair color, you could construct a
second map having type Map<String, Set<Person>>, and intersect the two sets."
In that case, how should I keep the two maps in sync? Do I need to atomically
update the logic of all the instances of the application to modify both maps
instead of just one? Or do I keep the second map (the hair color index) in a
separate index database and update the index whenever I pull changes from a
remote database? (What does the API look like for getting notified of new
changes that haven't been indexed yet?)

I see that "noms sync" does both push and pull. Does that mean I can't pull
data from a database I can't write to? How does that work over HTTP — do I
need to use a special HTTP server that knows how to accept and authenticate
write requests, or can I just dump a Noms dataset in a directory and serve it
up with Apache?

Forgive me if these questions are obvious — I've read the docs I could find,
but I haven't read any of the code beyond the hr sample.

~~~
aboodman
> Do I need to atomically update the logic of all the instances of the
> application to modify both maps instead of just one? Or do I keep the second
> map (the hair color index) in a separate index database and update the index
> whenever I pull changes from a remote database? (What does the API look like
> for getting notified of new changes that haven't been indexed yet?)

Currently, you have to manually keep an index up to date. But keep in mind
that internally this is what all databases are doing -- manually reflecting
changes into indexes -- they just hide it from you.

Eventually, we imagine that there will be tools to declare indexes you want to
maintain and we'd do it for you. Note that because Noms is good at diffing,
calculating the changes that need to be re-indexed comes for free!

------
tombert
I'm surprised you didn't use a functional language like Haskell or OCaml or
Rust to do this, since the article talks about love for functional
programming.

I'm not criticizing Go at all, it's just not really a functional language.

------
nathancahill
Excellent! This has been on my "things to build someday" list for a while now.
Excited to start playing with it.

~~~
DanWaterworth
Same here, though in my case it was on my list of things to continue building.

------
paxcoder
Pretty impressive work but seems like reinventing wheels. Why wasn't it built
upon existing tech?

I think the docs should enumerate the most important differences and use cases
for which it should be a better fit.

~~~
2bitencryption
To play devil's advocate, Git "reinvented the wheel," but it was a much nicer
wheel.

Not saying this is to databases what Git was to versioning, but there's a
reason to strive for that.

~~~
paxcoder
Git's author felt the alternative (gratis) systems were lacking. Noms's
author, on the contrary, praises Git but doesn't build on it. He chooses to
implement the same technology himself, and from the docs it's not clear to me
why that is.

~~~
shykes
Off the top of my head: git can only use sha1 which makes it unsuitable for
any use case where you need to cryptographically verify the origin of data (so
far nobody was able to tell me definitely how secure git signed commits and
tags really are).

~~~
eridius
Assuming SHA1 has second pre-image resistance (which it currently still does),
the security of git signed commits/tags is the same thing as the security of
the private key used to sign the commits/tags.

------
fizzbatter
This is really interesting! What are some ideal use cases for the current
implementation? I've seen Git is considered a competitor, but Noms also
appears to be a generic database, so i would just like to hear some basic use
cases, if possible.

Eg: If used as a database, what applications would benefit from Noms?
Could/should this be used for personal storage? Could/should this be used for
code versioning (ie, Git)?

~~~
hibbelig
The way I read it, git is not a competitor but rather an inspiration. They are
taking ideas from git to apply them to a different domain.

~~~
fizzbatter
Fwiw, i was referring to this:
[https://news.ycombinator.com/item?id=12212276](https://news.ycombinator.com/item?id=12212276)

The author explicitly says Git is a competitor

------
robzyb
Wow, this could be quite interesting.

Firstly, it would be cool if this could be a single gateway to "all the data
in the world". Right now its a pain to find, say, energy generation statistics
for, say, Portugal, but it would be great if I could do something like:

    
    
      noms get statistics.industry.energy.portugal.all();
    

Secondly, the versioning idea could have some really cool applications. For
example, I work in data analytics, and sometimes I want to transform some data
in an SQL table.

Doing transformations nicely is a bit difficult. Either I'm doing the
calculations in a column of a view, with the associated performance hit, or
I'm tacking columns onto the table, which quickly leads to a mess, especially
during the initial stages of analyses.

It would be so cool if I could treat the database as a constantly-evolving git
tree.

------
juol
Your mascot looks like it giving an 'air' blowjob.

Otherwise looks like a cool project, keep up the good work!

~~~
czbond
I could not stop laughing after I read this...

------
shruubi
I really like the idea in theory, but seeing it in practice I feel the whole
thing is too concerned with being a wrapper around git handling for their
dataset files. I would much rather see diffs based around the records
themselves, and not so much the structure of the data.

~~~
mikergray
While Git is referenced as an inspiration, the implementation of Noms does not
use Git. Noms performs diffs on the data - as records or whatever other
structure you used in importing your data. CSV is but one example of a way to
import data into Noms, but since so much data is available in that format it
is an easy one to reference that most people know. Noms can also import JSON,
XML and many other data types if you are willing to write JS or Go code (more
to come). Thanks for taking a look at Noms!

------
phantom_oracle
I don't want to downplay this idea, it really is nice to see people doing
different/unique things with technology.

However, 1 question I have is:

Couldn't you just put a CSV/JSON file(s) behind VCS?

Eg. Drop my CSV/JSON file(s) onto github.com and then it will be version-
controlled ?

~~~
aboodman
You can, and people do that today. It has limitations though:

    
    
      * The data must be sorted in order for Git to provide good diffs
      * It does not scale very well. On my machine, Git refuses to diff files over 1GB (maybe there is a setting for that)
      * You must clone the entire repository onto your machine to work with it
      * There is no programmatic API -- you must work with the data and changes as text and line diffs
    

See
[https://www.youtube.com/watch?v=Zeg9CY3BMes](https://www.youtube.com/watch?v=Zeg9CY3BMes)
for a little bit more on this topic.

------
chenster
"...inspired by the elegance and power of Git for years.."

Definitely powerful, but elegance?

~~~
aboodman
If you ever look into the internals of how Git works, it is beautiful. Yeah,
the UI is kinda a mess, but the idea is inspired.

~~~
chenster
"UI is kinda a mess" \- amen.

------
woodcut
We've been struggling managing a collection of periodically updated CSVs &
binaries over a few GB's in size, we struggled with Git-LFS and gave up, and
we were considering (dreading) SVN, this looks really promising. Cheers!

------
ah-
Can you elaborate a bit on how the hashing and chunking works? There's a
rolling hash for determining chunk boundaries, and also SHA-512/256 somewhere.

Does the same data chunked differently have a different hash?

~~~
aboodman
We never chunk the same data differently. An inviolable rule of Noms is that
the same logical value is always chunked the same way and always has the same
hash.

If I start with integers 1-1000000 and you start with integers 0-999999, and
we both make mutations to converge at the same list, we will end up with the
exact same tree, with the exact same hashes.

This is what makes efficient synchronization and diff of noms data possible.

~~~
ah-
Thanks. So it's building a hash tree with deterministic chunking and that
means you can cheaply update the hash after updating parts of the tree as you
only have to rehash certain bits?

Does that mean that your chunk sizes are kind of fixed? Do you think there's a
way to retain that advantage and be able to coalesce smaller chunks into
larger ones?

Say your smallest nodes are 4KB but for more efficient storage you might want
to go up to 4MB chunks. Could that be done while retaining the same hash for
the same underlying data?

~~~
bkalman
Keep in mind (if this wasn't clear) that the chunks are only probabilistically
4K: [https://github.com/attic-
labs/noms/blob/master/go/types/roll...](https://github.com/attic-
labs/noms/blob/master/go/types/rolling_value_hasher.go#L16). I.e. the thing
that's "fixed" here is the chunk size we're aiming for. The chunks themselves
could be of any size.

In any case, that's a good question - we might want to do something about that
down the line. But, if we did change that constant, the structure of the trees
will change, and all[1] the hashes will change.

[1] a small number will stay the same

------
anilgulecha
No ones mentioned this yet, but with good (mongo-like) query interface, this
can add an important database to the offline-first movement.

(Right now pouchdb or gundb are the only available options.)

------
musicmatze
This looks really interesting. I've been thinking about the problem of
distributed issue tracking lately... and the set of sub-problems it has
(authorization and authentification, synchronization and so on) ... I'm not
sure all these problems could be covered by this, but I guess at least the
"distributed"-part could be covered by something like this.

------
cdbattags
I had an idea for this with a buddy in college after doing case study research
into Git. I've always considered this the next step into a decentralized world
outside of code and non-typed "text". I know .csv's where mentioned a few
times; are you looking to narrow into a few specific file types for proof of
concept?

~~~
erikarvidsson
We have implemented a bunch of importers. One of them is CSV. Take a look at
[https://github.com/attic-
labs/noms/tree/master/samples/go/cs...](https://github.com/attic-
labs/noms/tree/master/samples/go/csv)

We envision there to be tools that work on certain data types (Noms has a full
type system), for example an app that displays all geo locations in a dataset.

------
billconan
I'm curious about merging.

When there is a conflict, like when a file gets changed by different people,
how merging is performed?

~~~
aboodman
Not implemented yet, but here is the plan: [https://github.com/attic-
labs/noms/issues/148](https://github.com/attic-labs/noms/issues/148)

------
kfk
Those are exactly the kind of ideas the finance world needs to get out of its
ethernal mess of spreadsheets.

------
nkohari
This is really interesting, thanks for sharing it!

I haven't had a chance to dig into the code yet, but I notice that you say two
replicas of the same database can be disconnected, altered, and then merged.
Could you explain how Noms takes care of that, particularly in the case of
collisions?

------
ianai
This really piqued my interest and "next big thing" sense

------
pbkhrv
Something like this could be used as a backing store for package managers like
npm or apt or ruby gems or pypi.

------
sigi45
How do you handle hash collision?

~~~
aboodman
We assume that within a given version of the database format, there will never
be a collision. The chances of a sha2 collision are beyond astronomical, and
if you can create one, there are better things to do with your time that
bother us.

That said, hashes only get weaker over time. The chances of an md5 collision
used to be astronomical, now they are not.

So it was important to us to have an escape hatch - a way to increase the
strength of the hash we use over time.

That's why we built a format version into Noms from the beginning. Our design
is predicated on the fact that within a given version of the format, there is
a 1:1 correspondence between hashes and values. Every value has exactly one
hash, and every hash encodes exactly one value.

In future versions of the format, we might change the hash function. In this
situation, we'd need to import data from the old format to the new format,
just like how you have to sometimes migrate traditional databases across
versions.

------
mschaef
First off... I'm excited to see this project. There's a lot of potential here
and this looks like a good implementation of a nice concept. I have at least a
bit of authority behind that statement, since a few years ago, I had the
opportunity to build something similar (although smaller in ambition.) A
couple things to think about:

 __* Type accretion - This doesn 't change the fact that database clients need
to be able to accept historical data formats if they need to access historical
data. The schema can't be changed for the older data objects without changing
the hashes for that data, so there's no way to do something like a schema
migration would work in SQL. For simple schema changes like adding fields,
this might not be so hard to deal with, but some changes will be structural in
nature and change the relative paths between objects. (This adds complexity to
the code of database clients, as well as testing effort.)

 __* Security - Is there a way to secure objects stored within noms? Let 's
say I store $SECRET into noms and get back a hash. Does it then become the
case that every user with access to the database and the hash can now retrieve
the $SECRET? What if permissions need to be granted or revoked to a particular
object after it's been stored? A field within a particular object? What if an
object shouldn't have been stored in the database at all and needs to be
obliterated? (This last problem gets worse if the object to be obliterated
contains the only path to data that needs to be retained.)

 __* Performance - The CAS model effectively takes the stored data, runs it
through a blender, and returns you a grey goo of hashes...this is good for
replication, but it means you can 't get much meaningful information out of a
hash. This tends to mean a lot of operations like you might find in an old-
school navigational database, and a huge dependency on the time to fetch an
object given a hash. Indices can help by reducing the complexity of the
traversals you need to do, but only if they're current and you have the index
you need.

 __* Data roll off - How do you expire off data so that it doesn 't just
monotonically increase in volume? Let's say there's an API to mark an object
as purgeable, the problem of identifying other purgeable objects turns into
effectively a garbage collection process. (git gc, etc.) There's also the
issue of the sheer number of objects that can be involved. The system I was
involved with had something like 500K objects/day that had to be purged after
120 days in the system. (Total of 60MM objects line and around 6TB or so)
Identifying 500K objects to purge and then specifying those to the data layer
for action is not necessarily an easy thing....

 __* Querying - Server side query logic (and an expression language) is
basically essential to performance. Otherwise, you wind up with a network
round trip for every edge of the graph you follow. Going back to my first
point, whatever querying language is used has to be flexible enough to handle
a schema that might be varying over time (through schema accretion).

All four of these bullet points are worthy of a great deal more discussion,
and I haven't even broached issues around conflict resolution, differencing,
UI concerns, etc. I think there are good approaches to managing lots of these
issues, but there's a bunch of engineering involved, as well as some close
attention to scope and goals...

~~~
aboodman
\- Type accretion: I don't think in general that schema changes like what
happens in sql databases works very well (I say this having worked on such
systems). In big systems, it's hard to get everyone to agree on a moment to
CHANGE THE SCHEMA. You can certainly do something like that in Noms -- just
write a new dataset and replace the old one. But being able to read old data
and leave old clients working I think is powerful. Couple this with the
structural typing that falls naturally out of Noms and - I think - you have a
more flexible way to change schemas over time.

\- Security: current thoughts: [https://github.com/attic-
labs/noms/issues/1183](https://github.com/attic-labs/noms/issues/1183)

\- Perf: I'm not really following you here. CAS has some positives and some
negatives for performance.

\- expiration: 1. there are a huge number of systems today that never delete
data. Taking advantage of that to make other operations faster makes sense. 2.
yeah, it's a gc problem. luckily gc is a well-studied problem. Also, as Noms
is a merkle tree and merkle trees are good at diff, we have some additional
leverage. We don't need to do a full scan everytime.

\- querying: disagree that it is essential to perf. Another option is to have
a schema that matches your access model. You can do that server-side in
addition (or instead) of having a query language.

===

It sounds like you have thought a lot about all of this! If you are
interested, your brain would be very appreciated in the github or slack.

~~~
mschaef
> It sounds like you have thought a lot about all of this!

Up until around 2014, I was heavily involved in the construction of a small
CAS (100MM objects online, around 5-6TB in size) for a client that needed to
replicate certain periodic calculations in a reliable way. It worked well, but
something like noms would have eliminated the need for a bunch of custom work.

> If you are interested, your brain would be very appreciated in the github or
> slack.

I'll take a look... thanks for the invite!

------
rejschaap
Interesting project, would just like to say that the Git workflow isn't that
great and CVS isn't that bad.

The Git workflow is quite complicated and will probably not appeal to people
who typically just use Excel for everything.

It is true that CVS is messy, but its strength is that it is really simple,
and it can easily be fixed.

Also, CVS can be versioned with Git quite well in many cases.

