
Osm-p2p: a peer-to-peer distributed OpenStreetMap database - gmaclennan
http://www.digital-democracy.org/blog/osm-p2p/
======
pfraze
Cool, this is made by substack (the author of browserify) using internal
protocol-modules of the dat project. Glad to see it launch.

[https://twitter.com/substack](https://twitter.com/substack)

[https://github.com/mafintosh/hyperlog](https://github.com/mafintosh/hyperlog)

[http://dat-data.com/](http://dat-data.com/)

------
Doctor_Fegg
This is certainly worthwhile and interesting in a lot of ways, but I'm
uncomfortable with calling it OpenStreetMap. OSM's raison d'etre is to be
collaborative and purely factual, whereas this is providing "your own, private
OpenStreetMap" ([http://www.digital-democracy.org/blog/openstreetmap-
without-...](http://www.digital-democracy.org/blog/openstreetmap-without-
servers/)). Naming it something like CommunityMap, while nodding to the fact
it's based on parts of the OSM stack, would have been clearer and kinder.

(Also trademark issues, but let's not get into those here.)

~~~
chippy
The project seeks to both accept OSM.org data and to contribute back:

>Here’s what we would like to have soon, to better interoperate with the rest
of the Open Street Map ecosystem:

    
    
       > import public osm data from a region into osm-p2p
       > export osm-p2p edits back to public open street map

------
Bedon292
So this is focused on offline editing and sharing? Very cool, though I
initially thought it was going to be a p2p torrent style of keeping OSM data
synced across the internet.

~~~
substack
The focus is offline, but the underlying techniques work just as well across
the public internet. For example, you could use
[https://www.npmjs.com/package/webrtc-
swarm](https://www.npmjs.com/package/webrtc-swarm) to sync the hyperlogs:

    
    
        var wswarm = require('webrtc-swarm')
        var signalhub = require('signalhub')
        var swarm = wswarm(signalhub('p2p-map', ['https://signalhub.mafintosh.com'])
    
        var osm = require('osm-p2p')()
        swarm.on('peer', function (peer, id) {
          peer.pipe(osm.log.replicate()).pipe(peer)
        })

------
mynewtb
How are conflicts handled?

~~~
gmaclennan
If two or more peers edit the same record, it doesn't create a conflict but
instead that record simply has two versions in the database - like a fork in a
git repo. These can be merged at any time in the future, but prior to that two
versions can continue to exist and replicate. For more about why we designed
it like this see:
[https://github.com/digidem/osm-p2p-db/blob/master/doc/archit...](https://github.com/digidem/osm-p2p-db/blob/master/doc/architecture.markdown#p2p-replication)

~~~
pfraze
How similar is this to how CouchDB handles conflicts?

~~~
gmaclennan
Substack can probably give a better answer, but my understanding is that
CouchDB only holds a version history since the last replication between
clients, and then conflicts need to be resolved before replication can
continue, after which version history is lost. With OSM-p2p no data is every
deleted, it is all just in the underlying hyperlog. It's more like git than
CouchDB, and each record has a complete history, and can be forked and merged
and will continue to sync/replicate.

~~~
rakoo
As someone who's been playing with CouchDB and has been become kind of biased
towards it, I feel like I need to make a little bit of correction here. Long
story short: CouchDB can behave exactly like hyperlog if you want it to, or it
can behave as something that gives the user the best information it can to
resolve conflicts.

You can think of CouchDB not as a key->value store, not even a key->document
store, but rather a key-> _tree of documents_ , where each node is the same
document at a different revision at some point in time. The root of the tree
is the initial revision, and the leaves are the latest revisions: if there is
only one leaf (because the other ones have been marked as deleted (which is
not a real deletion, it's just marking this leaf node as "not a possible
current value") then it's the correct value of the documents, but if there are
multiple, then there is no correct value, only multiple choices. When you want
to write a new revision, the node that is the parent (it can be a leaf or it
can be an internal node, it can even be a deleted leaf!) becomes an internal
node, with a new child. Just like git. Except CouchDB doesn't give you any
insurance about whether you're going to be able to retrieve the internal
nodes' _content_ ; the only nodes you're 100% sure to have access to are the
leaf nodes. However you will have full access to _how you got where you are_
(even though you usually don't need it, because the real important thing
you're interested in is the current (possible) values)

Because you still need to work and you still need to have a revision to work
on, CouchDB gives you a way to automatically select one of those conflict
revisions and pretend it's the correct one; but flip one bit in the query
(just add the parameter "conflicts:true") and it will give you all the
conflicting revisions so you, the user, can make a choice. You don't have to;
it will continue to work without it, but at some point in time you'd better
clean the db and clearly state what is the truth.

The other way to use it is to have something that is quite usual: create a
unique id out of each and every version of your object, and store them all in
CouchDB. You'll have the insurance that old versions won't be removed, but
you'll have to tweak CouchDB a bit to "group" relevant ids together (by using
a view, typically). In this usage each key will have a single node, that will
both be a root and a leaf node.

Whatever the way you want to use it, CouchDB's replication will make sure all
nodes will end up on every peer, whether they are a leaf node or node, whether
the tree on a given peer is complete or not. The replication happens in the
background, in parallel; it doesn't block the user working on their stuff, and
isn't blocked by the user either.

CouchDB has unfortunately suffered from the misnomed "revision" term, and it
probably was a bit too early in the game, but it pains me that such a great DB
is not more considered as a viable alternative not because it lacks on the
technical side, but probably because there isn't enough information/blog posts
on it.

~~~
substack
You can sort of do these operations with couchdb, but you've got to fight
against some bad assumptions for this use case of offline edits that could
span weeks or months. From
[https://wiki.apache.org/couchdb/Replication_and_conflicts](https://wiki.apache.org/couchdb/Replication_and_conflicts):

    
    
        With CouchDB, you can sometimes get hold of old revisions of a document.
        For example, if you fetch /db/bob?rev=v2b&revs_info=true you'll get a list
        of the previous revision ids which ended up with revision v2b. Doing the
        same for v2a you can find their common ancestor revision. However if the
        database has been compacted, the content of that document revision will
        have been lost. revs_info will still show that v1 was an ancestor, but
        report it as "missing".
    

That is a very dangerous feature for us. Also, the conflict avoidance
algorithm means that users can't work on branches in parallel like in git
because couchdb obstructs that workflow with 409 responses. This is a very bad
feature when you're collecting data in remote areas and replicating with other
databases and don't have the time, battery life, or the expert skills to
resolve a "conflict". A database should never have states where it rejects new
information.

~~~
rakoo
Note that, as stated, the _content_ of the historical revisions may be
missing, but the full lineage will still be there and will be replicated.
Actually what happens is that the "_id" field and the "_rev" field (the
revision number) will always be present, while the other fields may be
removed.

You are also never blocked by a 409. A 409 happens when CouchDB wants to help
you reduce the chance that a conflict happens, but you can bypass it and do
create a conflict if you don't want to be bothered. In fact, that's how
replication works and the only way conoflict do happen: when two dbs replicate
to each other, they may have a different history for the same document, but
the two histories are sent to the other side so that each side can have all
the timelines. You have to use another, non-obvious endpoint for that, see
[http://docs.couchdb.org/en/1.6.1/replication/conflicts.html?...](http://docs.couchdb.org/en/1.6.1/replication/conflicts.html?highlight=batch#conflicts-
in-batches):

    
    
      So this gives you a way to introduce conflicts within a single
      database instance. If you choose to do this instead of PUT, it means
      you don’t have to write any code for the possibility of 
      getting a 409 response, because you will never get one. Rather,
      you have to deal with conflicts appearing later in the database,
      which is what you’d have to do in a multi-master application anyway.

------
mbrock
I was looking at this stuff the other day after following links from the
"hyperlog" npm package. Kudos for this really great inspiring work.

