
Noms – A versioned, forkable, syncable database - jaytaylor
https://github.com/attic-labs/noms.git
======
niftich
Open-source tech like this is nice. This could be used to build a distributed
document editing application, for example. Or any application where you want
to spin off multiple instances and reconcile the data later.

EDIT: At least one team is investigating layering Noms on top of IPFS [1]. I
guess the idea would be to construct something similar to GitTorrent [2];
layering various version-controlled datastores on various p2p protocols could
result in several viable architectures.

[1] [https://github.com/attic-
labs/noms/issues/2123#issuecomment-...](https://github.com/attic-
labs/noms/issues/2123#issuecomment-245162820)

[2] [https://github.com/cjb/GitTorrent](https://github.com/cjb/GitTorrent)

------
jbverschoor
Weird choice for the uri:
[http://localhost:8000::dbname](http://localhost:8000::dbname)

Why not [http://localhost:8000/dbname](http://localhost:8000/dbname) ?

~~~
lost_my_pwd
Wow, I missed that. That violates the URI spec, assuming the author(s) were
intending to use a URI.

[https://tools.ietf.org/html/rfc3986](https://tools.ietf.org/html/rfc3986)
(see "3.3. Path").

~~~
drostie
Right, it intentionally violates the URI spec by appending something to the
end of it. The data structure they're storing has a natural pair-of-structures
at its top level:

    
    
        data Database = InMemory | LevelDB Path | ViaHTTP URL
    
        newtype DataSet = DS Text
        newtype Hash = Hash Text 
        data Accessor = AccessDS DataSet | AccessValue (Either DataSet Hash) Path
        
        type DBAccessor = (Database, Accessor)
    

They elected to basically encode a DBAccessor above as a string which you can
split on "::", with the URL above being stored on the left in the case of the
ViaHTTP databases.

~~~
aboodman
The string wasn't originally intended to be a URI, but I've been subsequently
convinced that it would be useful for it to be one. We'll change it
eventually.

------
tominous
Adam Leventhal (DTrace, OpenZFS) took a look at building a FUSE filesystem on
Noms using Go.

[http://dtrace.org/blogs/ahl/2016/08/09/nomsfs/](http://dtrace.org/blogs/ahl/2016/08/09/nomsfs/)

[https://news.ycombinator.com/item?id=12255450](https://news.ycombinator.com/item?id=12255450)

~~~
orblivion
Sounds like what you'd want for a self-hosted Dropbox clone. I wonder what
Syncthing, for instance, uses for reconciling differences on different
clients.

~~~
davidron
Syncthing doesn't reconcile differences. Instead a copy of the file is
created.

    
    
      Syncthing does recognize conflicts. When a file has been modified on two devices simultaneously, one of the files will be renamed to <filename>.sync- conflict-<date>-<time>.<ext>. The device which has the larger value of the first 63 bits for his device ID will have his file marked as the conflicting file. Note that we only create sync-conflict files when the actual content differs. 
    

[https://docs.syncthing.net/users/faq.html](https://docs.syncthing.net/users/faq.html)

------
aboodman
Hi Hacker News. I'm one of the founders of the Noms project and Attic Labs,
the company behind it. Happy to answer any questions.

In the meantime, as long as I've got your attention, here's a few new stuffs
we've been working on since last time Noms was discussed here in August:

\- A prototype query language, and a demo of how to create indexes in Noms:
[https://www.youtube.com/watch?v=fv6_T5yaWns](https://www.youtube.com/watch?v=fv6_T5yaWns)

\- Support for merging concurrent (and potentially conflicting) changes:
[https://www.youtube.com/watch?v=--7dgoJBdjU](https://www.youtube.com/watch?v=--7dgoJBdjU)

~~~
rwmj
There's quite a dramatic claim on the website, "merge [...] changes
efficiently and correctly days, weeks, or years later." How does that work?
For example if you have two records saying userid 3's name is "ann" and userid
3's name is "jane", I don't see how you could merge those without extra
information or human input.

~~~
aboodman
The claim on the website is not meant to suggest that _any_ two changes can be
automatically merged. I will try to clarify that.

The world contains logical conflicts because physical constraints mean that
processes can operate disconnected from each other. No database can wave that
away.

Noms will automatically, efficiently, and correctly merge changes that _don
't_ logically conflict. Which is a pretty cool and unique property in a
database.

If any conflicts are found, there is a callback to user software to perform a
resolution.

More info in the documentation:

[https://godoc.org/github.com/attic-
labs/noms/go/merge](https://godoc.org/github.com/attic-labs/noms/go/merge)

~~~
nradov
IBM Domino (aka Lotus Notes) has been automatically, efficiently, and
correctly merging changes that don't logically conflict since 1989. How is the
functionality in Noms unique?

------
haalcion3
I've been wanting to use something like this.

 _But..._

* It's a big jump from relational or noSQL DB's, so there aren't (m)any adapters that I can see for it for JPA, ActiveRecord, etc.

* I'd really like to see a benchmark for each noms implementation compared to postgres, mysql, oracle, and mssql server, if there is a way to do apples-to-apples.

* "noms" is unfortunately is really bad for SEO because noms is a common word in French. If it could be nomsdb or nomnomnoms or something less exactly French, that'd be better. It's going to be tough to find support online easily otherwise.

* SQL compatibility.

* Fault tolerance (how easily does it corrupt), HA, mirroring, full/partial replication, sharding, archival, partial history truncation, etc.

It seems a little like a dolphin jumping into a pool of hungry sharks. It
might be more evolved and more capable in some ways, but it's going to get its
ass handed to it on speed and lack of features.

Still- I can't wait to try it.

~~~
actuallyalys
>It seems a little like a dolphin jumping into a pool of hungry sharks. It
might be more evolved and more capable in some ways, but it's going to get its
ass handed to it on speed and lack of features.

I'm inclined to agree for large, centralized databases, but I wonder if this
would be a good fit for places where sqlite is used? This seems like it could
be a good foundation for situations where you want to sync information without
a central server, like between devices. An Access/Filemaker clone built on top
of this would be cool, too.

------
kccqzy
I think there is a very similar library in Haskell called project m36. Here's
its github page on transactions:
[https://github.com/agentm/project-m36/blob/master/docs/trans...](https://github.com/agentm/project-m36/blob/master/docs/transaction_graph_operators.markdown)

------
duck
Previous discussion from back in August:
[https://news.ycombinator.com/item?id=12211754](https://news.ycombinator.com/item?id=12211754)

------
marknadal
Noms is a great example of the power of decentralized database technology, the
interesting research that goes into such systems, and wonderful documentation
to browse.

I do want to note some tradeoffs with Content-Addressed and Append-Only
systems, as my work on a similar project ( an Open Source Firebase,
[https://github.com/amark/gun](https://github.com/amark/gun) ) made me move
away from those ideas (even though they are great ideas).

\- Content-Addressed stores are going to revolutionize data integrity and
efficiency. But they do have a trade off, it makes it a lot harder to read the
data if you do not already know the data you are trying to read! The bottom of
the repo metions for instance that a query system has not yet been built. From
my experience the reason why is because it is difficult to build query systems
on Content-Addressed stores, which is a tradeoff from all the gains you can
get from it.

\- Append-Only gives you rich features like offline-first support and (if
implemented) lovely things like rewind/fastforward data time travel. All very
cool. However, do not forget that this then also makes it difficult for you to
retrieve the latest whole snapshot of your data. So you are not going to get
the read performance that you could.

But the only possible way for us as a community, and people playing around
with databases, can figure out what the best system is - is for people to
build and experiment. Which is part of the reason why Nom is so cool. It is an
invitation to others to actually join, play, and experiment with database
technology in an open and encouraging environment. That is incredibly valuable
and needed!

~~~
aboodman
1\. We have prototyped basic query functioanlity already
([https://www.youtube.com/watch?v=fv6_T5yaWns](https://www.youtube.com/watch?v=fv6_T5yaWns))
and Noms was designed from the beginning to support efficient indexes and
range scans. So I'm not sure why it would be harder for us to support a query
language than any other database.

2\. It's true that content addressing can exacerbate data locality which can
hurt read performance. However, there are thing you can do to get a lot of
that back.

------
100ideas
Looks farther along than [http://dat-data.com/](http://dat-data.com/), another
commendable distributed VCS for data. One distinction is that dat provides
additional utilities for querying and compositing the data structures
represented in any csv, json, and yaml files that stores.

~~~
100ideas
One of the other design goals of Dat is to support continually-divergent
forking, which they perceive as being useful for communities of analysts
processing common datasets but to ultimately different ends. Of course, you
never have to merge forks in git, but in it's current form they (dat devs) say
that it's not really ideal.

------
angel-manuel
I really like this, I've always thought that git needed to support diff modes
different from textline-based because even if this is fit for most programming
languages what you really what is to see differences between ASTs (take into
account those absurd change counts when just changing the indentation or
imagine a normal diff of LISP source). Maybe there's some way of replacing git
with noms to get there(even if it may be killing flies with cannonballs)

~~~
WorldMaker
For what it is worth, in my experiments most ASTs (the rare exception being
something like Roslyn's C#/VB ASTs) don't do well in "degenerate states" such
as a partially finished files. (A good source control system should let you
commit unfinished work.) I did have great success using a syntax highlighting
tokenizers. I was able to create really nice-looking character-based diffs
that were relatively semantic, quite quickly. I've not tried to use that as
the basis diffs for something like git, though I've suggested trying it
before.

Python code, if interested:
[https://github.com/WorldMaker/tokdiff](https://github.com/WorldMaker/tokdiff)

------
rpedela
Very interesting, I think we need a git for data. What is the performance of
diffs and merges? What data size does it become too slow?

~~~
aeharding
If by "Git for data" you mean accumulate-only (or append-only), immutable data
stores... There are already many existing solutions. It's always good to see
alternatives, though!

~~~
rspeer
Can you point me at some? Because I've tried a few immutable data stores and
been disappointed every time. Given about 10 GB of JSON structures, I keep
finding things that can't outperform the boring combo of:

* Convert the versioned data to tab-separated values

* COPY it into Postgres every time

* Hope Postgres can act immutable enough even though it wasn't designed to be

The closest I've come to improving this situation was Kyoto Cabinet (unusable
license) and rolling my own damn hashtable (it worked okay but adding new
kinds of indexes was just unmaintainable, there's a reason databases should be
made by experts).

~~~
sjezewski
(Disclaimer - I work at pachyderm)

[http://pachyderm.io](http://pachyderm.io)

Pachyderm is git for data. We work hard to make sure we can store data of
different types (binary, text, json) efficiently. We also work hard to give
you good mechanisms to read the data in a distributed way. I'd be curious how
this suits your purposes.

~~~
rspeer
Just started looking at Pachyderm.

While I can see how a git-based filesystem can help with some use cases, does
it do any kind of indexing at all? I see that the FAQ recommends exporting the
data from Pachyderm into PostgreSQL, which leaves me where I am now.

------
mahyarm
I hope there is a prune option to delete very old commits.

~~~
aboodman
There isn't yet, but there could be (ala shallow clone in Git).

------
benjismith
Nice. I'm super excited about this!

I've hand-rolled something a lot like this already for the Shaxpir backend,
but it would be really nice to have a well-engineered database that already
supports this kind of model, out of the box.

------
jimktrains2
I've been working on some syncing addressbook, calendar, password manager, and
notes applications. My idea was to use mdns to announce presence and git to
sync, but this might be (more) useful

~~~
aboodman
Noms should definitely be more useful in that scenario. We have some customers
who were using Git the way you describe and replaced it with Noms and have
been very happy with the results.

------
kfk
Have you had a look to the finance world? Git for data seems to be something
we in finance really need, especially the possibility of seeing all the
changes and of reconciliating things.

~~~
barrkel
diff for financial data, with attendant workflow for breaks, is an already
existing whole market segment. Duco, the startup I work for, is tackling it as
a service.

------
sroussey
How does this compare to gun.js.org?

------
ReAzem
Their logo is a squirrel giving an invisible blowjob

~~~
a3n
It's an otter, floating on its back.

------
WayneBro
How is it that you have 2 reference implementations, written in 2 different
cross platform environments, yet there is no support for Windows?

Why would I use this if I can't use it everywhere?

~~~
ekianjo
Most devs dont use windows these days.

~~~
nbevans
Good devs use all three platforms. Or at least two.

~~~
ekianjo
Use, because they have to. This being said they usually develop more in one
than the others. I have not met until now anyone who was equally proficient in
developping software across all platforms/environments.

~~~
WayneBro
> Use, because they have to.

...in your very uninformed opinion...

