We ask for the commit we want and connect to a node with BitTorrent, but once connected we conduct this Smart Protocol negotiation in an overlay connection on top of the BitTorrent wire protocol, in what’s called a BitTorrent Extension. Then the remote node makes us a packfile and tells us the hash of that packfile, and then we start downloading that packfile from it and any other nodes who are seeding it using Standard BitTorrent.
The disadvantage here is that any given file in the repo could be stored in an ever increasing number of packfiles. Each existing version of the repo will generate a new packfile to get it to the newest version, and it's up to the authenticated masters to generate and seed each of those packfiles, while peers either do or do not cache these replicated datas. In short, this means of syndicating updates ignores the Merkle-DAG-ness (DAG-osity?) of Git.
The un-updateability of torrents is something that seems to seriously limit it's use. There are a lot of interesting attempts to hack around this- LiveStreaming and Nightweb are two that spring to mind. https://www.tribler.org/StreamingExperiment/ https://sekao.net/nightweb/protocol.html
You can use it for git repos essentially out of the box by uploading your repo.
It is made of content addressed chunks which will get re-used on each re-upload.
Every IP owner will try to slow down the progress of these technologies, mainly by not adopting them. I think these technologies won't be adopted until the IP monetization problem is solved.
To me, this means that if the media are to be easily decentralized and shared, the monetization (how you get money for your work and how you give money when you benefit from it) must become equally easily decentralized and shared (between the publishers, authors, etc.).
I started building a toy version of this about 5 years ago but got distracted by work. Essentially the repo key encrypted the packfile, the storage reliability layer used its another key to encrypt the chunks. The latter key would find the chunks, with enough reliability to re-create the encrypted packfile, which the latter key could decrypt and apply to your repo.
A very fun problem in distributed systems and data structures.
(The one that I want most is the ability to rebalance on the fly, as storage-hosts become full.)
I am not involved with the development of torrents at all but (please bear with me until the end) my initial reaction is that we should think of the lack of ability of torrents to update as a feature and not as a bug.
Perhaps if ability of torrents to update is a concern then it warrants a new peer to peer protocol? (Please note that this is not the case of http://xkcd.com/927/?cmpid=pscau as I am not advocating a new protocol for every use case)
It seems like we can sign torrent files with gpg keys. Perhaps I am wrong. Perhaps, we can allow updating in torrents if we require that the updates be signed with the same private key as the original torrent? Am I barking up the right tree here?
Edit: Oops. I edited this post before I saw the reply about BEP-0039. Apologies.
There is a new peer to peer protocol, it's even an IETF draft, it's called PPSP and is full of nice stuff:
Otherwise we need something else, which I hope to achieve in rakoshare (https://github.com/rakoo/rakoshare).
Edit: just realized I misread the question. No. In my case, the previous mame romset swarm is usually abandoned, and the new torrent takes all the traffic. The swarms are unique.
Simply signaling new Magnet URI's would have the disadvantage Gittorrent sought to avoid: resyndicating the entire contents with every single change: a killer for things like the Linux kernel or projects like Debtorrent. Git's merkle-dag avoids this problem, allows multiple concurrent versions to share the bulk of the content-indexing, and best-of-all-worlds solutions would preserve this capability.
So an example where this technology could be put to a unique use: Minecraft streamers and lets players sometimes like to distribute the world they are using. So they could make a repository of their world and distribute the read only keys for it to other users. This would allow them to play it, even temporarily make changes because sync kicks in and refreshes it, and the repository would be kept current as the world progresses. That should be viable right now with Sync.
Problem is, I think the bittorrent foundation is doing their damnedest to keep themselves firmly planted in the distribution and ownership of the technology. So we won't see an explosion of third party clients. I don't think it will see much adoption for this reason, and that's a real shame, because it would be a wonderful bit of kit for the internet.
The DAG-osity of git really helps here (because you only have to transfer what's really needed), and the "immutability" of git helps because if your project is popular and you update your branch, everyone will want to go from the old commit to the new commit, so everyone can share the diff between them directly.
"Thinking about 'meta' torrent file format."
Truly Meta - Meta:
There was a GSoC project in 2013 which did exactly this, using Freenet as decentralized storage backend with a Web of Trust for Spam resistant and updatable identities (note that in the gittorrent scheme once someone claimed a username, that username will stick there forever). It works and compared to GitTorrent it adds anonymity and upload-and-run.
A current article describing it is here: http://draketo.de/english/freenet/real-life-infocalypse
(it got referenced here, too: https://news.ycombinator.com/item?id=9562749 )
The GSoC project was done by Steve Dougherty: http://www.google-melange.com/gsoc/project/details/google/gs...
> I’d be happy to work on a project like this and make GitTorrent sit on top of it, so please let me know if you’re interested in helping with that.
Have a look at Gitocalypse: https://github.com/SeekingFor/gitocalypse
You're right, of course. This is just a first step.
One interesting followup idea might be that the BitTorrent library I'm using, webtorrent, also works in browsers over WebRTC. But I'm not using that because I wouldn't know what to do with a git cloned repo inside of a browser tab. Maybe someone else will though. :)
In comparison to decentralized Search and Community, decentralized file storage is easy. Conveniently, centralized repo hosting is the biggest problem. Not being able to Search / Comment / Report a bug during a DDOS decreases productivity, but not being able to push commits / run CI tools is a productivity halt.
The best next move, might be to focus on decentralized repository hosting, solve that well, and allow users to conveniently mirror the GitTorrent repos on GitHub. Giving the best of both worlds until Search and Community can also be solved well.
This may mean GitTorrent would need some form of post push hooks (i.e. to update mirrors or run CI). Which I'm sure is doable.
Obviously someone would need to build a user-friendly interface for all that, etc.
The owner of a project will keep a node always online to fight its own churn. There is no big files in the case of issue tracker. This is not video social network. Also, video social network do work, e.g. private trackers, the only thing is that they use the web publish magnet links which can be done over a DHT cf. PPSP and tribler.
The post mentions using the blockchain for unique username registration and mapping to public key hashes, and as it turns out there's a project I and others have been working on that does exactly this called Blockstore.
Here's the link if anyone wants to check it out: https://github.com/namesystem/blockstore
The way it works is there's a mapping between a unique name and a hash in the blockchain, and then there's a mapping in a DHT from the hash to the data to be associated (which can be a plain old public key and can also be a JSON file that references a public key and other identity information).
That's great, thanks! I should just use this (preferably with the DHT I'm already using to look up Git commits) instead of reimplementing myself.
What do you think about the idea of making pluggable modules to connect Blockstore with web frameworks (Django, Rails), without the framework/website authors having to get involved in understanding Bitcoin themselves?
As far as Django, Blockstore is on Pypi so you can just install and import the library.
I think it'd be great to have modules for Rails, Node, etc.
I saw your tweet and I'll shoot you an email. Also feel free to open an issue on github.com/namesystem/blockstore and we can discuss the idea openly there.
Some Linux distributions have experimented with delta-based package repositories, examples are deltup for Arch Linux and rpm-delta for RPM-based distros. Some of the known issues are:
- choosing the number and spacing between deltas. Fine-grained deltas require more storage space, coarse-grained deltas require more download bandwidth.
- retiring old deltas: periodically deleting all deltas older than a certain version, replacing them with the full package of that version. Again a trade-off between storage space and download bandwidth.
For Git repositories, this would roughly translate to:
- choosing the number, history spacing, and size of the Git packs per repository.
- retiring old Git packs: periodically deleting Git packs older than a particular revision, replacing them with a bare repository at that revision.
* The contents of this directory is this list of of files whose contents have these SHAs.
This is called a "tree".
The SHA of a tree is also an object, and can appear in another tree.
To see this for yourself, in any git repository run `git cat-file -p HEAD`. You'll see the (more or less) raw commit object for HEAD, which will point at a tree SHA. To see the contents of that tree-sha, run `git cat-file -p <the tree SHA>`. That tree object has a one-to-one correspondence with what you'll see on-disk in the objects directory, (if the object has not been put in a pack file).
Above I have more or less fully described the contents of the files found in `.git/objects`.
The delta'ing doesn't happen until later, if and when packfiles are constructed. But they're just a storage/bandwidth optimisation. AFAICT, these deltas have nothing to do with what you might think of as "git diff", which is just some fancy porcelain which looks at objects.
The nice property of the construction is that given a large tree, even if nested, if you change a single file in that tree, you will only change as many trees as the file is deep in the tree, so computing changes between two nearby trees can usually be done quickly.
Not using swarming brings back all of the old problems of NAT traversal, asymmetric upload/download bandwidth, throttling, censorship etc.
* Higher popularity => More peers => Higher probability that multiple peers want the same packfiles.
* Higher popularity => More commits => More permutations of packfiles => Lower probability that multiple peers want the same packfiles (and stronger trends toward small/inefficient packfiles).
* More frequent synchronizations (peers always online) => More immediacy => Smaller packfiles => Higher probability that multiple peers want the same packfiles.
* Less frequent synchronizations (peers go offline regularly) => Less immediacy => Bigger packfiles => Lower probability that multiple peers want the same packfiles.
It would be really interesting to see how these competing pressures play-out (either by doing some math or randomized experiments).
If the main goal here is strictly decentralization (without concern for performance or availability[F1]), then one might look at swarming as a nice-to-have behavior which only happens in some favorable circumstances. However, by latching onto the "torrent" brand, I think you setup some expectations for swarming/performance/availability.
([F1] Availability: If Seed-1 recommends a certain packfile, then the only peer which is guaranteed to have that packfile is Seed-1 -- even if there are many seeds with a full git history. If Seed-1 goes offline while transmitting that packfile, how could a leech continue the download from Seed-2? The #seeds wouldn't intuitively describe the reliability of the swarm... unless one adds some special-case logic to recover from unresolvable packfiles.)
Could this be mitigated with some constraints on how peers delineate packfiles?
Like so many here, you have a single view of how bittorrent should be used, based on current filesharing practices, so you believe we need to map gittorrent to filesharing and have those packfiles be as static as possible in order to be shared at large.
You need to go back to the root of the problem, which is simple: there is a resource you're interested in, and instead of getting this resource from a single machine and clog their DSL line, you want to get this resource from as many machines as possible to make better use of the network.
How does gittorrent work ?
- The project owner commits and updates a special key in the DHT that says "for this repo, HEAD is currently at 5fbfea8de70ddc686dafdd24b690893f98eb9475"
- You're interested in said repo, so you query the DHT and you know that HEAD is at 5fbfea8de70ddc686dafdd24b690893f98eb9475
- Now you ask each peer who have 5fbfea8de70ddc686dafdd24b690893f98eb9475 for their content
- Each peer builds the diff packfile and sends it through bittorrent. Technically it's another swarm with another infohash, but you don't care; it's only ephemeral anyway. The real swarm is 5fbfea8de70ddc686dafdd24b690893f98eb9475.
Because of this, higher popularity will mean more peers in the swarm, whatever the actual packfile to be exchanged is. Bittorrent the way you know it is not used as-is, because there is information specific to gittorrent that helps make a better use of it.
Great comment, thank you. But I think the infohash should actually be shared, packfiles are pretty deterministic in practice. So you'd be getting the diff packfile from the person who just made it, and anyone else who already did.
(If I find packfile generation to not be deterministic enough, I think I'll switch to using a custom packfile generation that is always deterministic.)
As much as we all hate DNS, having the ability to kick a squatter off someone's name is probably a good thing. Personally I don't think having a crazy hash for identification is a bad thing. Rather, what you need to do is just have some sort of reasonable personal contact book so you only have to deal with the crazy hash once (when you decide you want to remember that someone is who they said they were).
Domain names only ever run code in an ostensibly sandboxed vm.
If you want to cross reference the identity with an email address you could use a keyserver. If you want to cross reference the identity with a domain name, you could use a TXT record.
> We plan to transition to hosting it on IPFS as soon as DNS -> IPNS -> IPFS naming proves robust, so we can use nice looking URLs.
Of course that has the downside that large entities grab all the good names and the little people are left with the leftovers.
There is GitLab.
...Over the web...
I mean, Git and Linux are developed via email lists, which is a decentralized way of sending pull/merge requests, isn't it? I guess you could argue the mailing list is hosted on a server, fine. So then fine, usenet.
Yeah, email and nntp are old crufty technologies and there are obviously advantages to having a web based interface and a central place to go for a project. But git itself certainly supports a decentralized mechanism for merging.
So then the question really is, how do you decentralize a web interface, isn't it? The dvcs itself isn't the problem.
That should be possible via browser support for site-based URL handlers; for browsers without such support, you could also have a URL on the main GitLab site that people can register with to redirect to their "home" instance.
I have a mild bias against altcoins, and have heard bad things about Namecoin in particular: that the anti-spam incentives aren't good, leading to illegal files stored in the blockchain itself, and that there's no compact representation (like Bitcoin's Simplified Payment Verification) for determining whether a claimed name is valid without consulting a full history.
As I understand it, these two design flaws combine to mean that you have to store some very illegal files to use a namecoin resolver, which doesn't sound good to me. (I may be mistaken, since the bad things I heard about Namecoin came from Bitcoin people..)
I made that statement because a lot of people don't realize that Bitcoin, ignoring the fact that you can trade it, has solved a fundamental consensus problem in distributed systems that we should care about and use :)
Security-wise Namecoin is a weaker Blockchain, but I think in this case it's not that important. -- Are not the anti-spam measures hurting users as much as spammers in this case? Since you end up with higher-costs for what is otherwise practically free with a centralized service.
From the looks of it there is no reason that Bitcoin's Simplified Payment Verification wouldn't be usable with Namecoin either.
edit: Also this may be interesting to take a look at https://people.csail.mit.edu/nickolai/papers/vandenhooff-ver...
Especially when it has the side effect of inhibiting what otherwise might be a compelling solution.
Also: some plants are serious jerks.
This number however is:
0.0 1 10 11 100 101 110 111 1000 ...
Decentralized git pull for only a given hash isn't all that interesting?
If one could pull in updates and/or push changes -- that would be "decentralized git". This is example is more "ipfs.io as a transport for git-releases", rather than "ipfs.io as a transport for git"?
Yeah! I was a Gitchain backer. The difference is that Gitchain stored the actual git commits in the blockchain, and I leave the actual commits on the hard disks of each BitTorrent seeder.
I mean, sorta. It was also because running a service is expensive, and containing abuse is a constant thankless treadmill.
> We’re quickly heading towards a single central service for all of the world’s source code.
Far from it? Not that a fully-decentralized system seems bad, but there are many things that aren't github. I don't even have anything of interest on github.
npm install gittorrent
git clone gittorrent://github.com/cjb/gittorrent
Or for true decentralization (doesn't get the wanted sha1 from github.com):
And plenty of people schooling me that git is already distributed (git yes, github no). I am happy that someone is working on this.
It is an implementation of a standard without the standard being defined so other implementations can spring up.
Raised issue: https://github.com/cjb/GitTorrent/issues/12
What I wanted I already got, confirmation by the OP that this is in fact a reference implementation and not the all-end-all.
Were that clear to start with, I wouldn't have commented other than to say, awesome! This is extremely exciting to me.
If someone builds on this, as discussed elsewhere in the thread, to make a decentralized service that mimics 'social' functionality such as issues and pull requests, I will strongly consider using it instead of GitHub (depending on the UI, stability, etc.).
I don't even have any particularly popular repos, so there is no real reason for anyone to care about the above, but, y'know, HN comments approving of the idea don't necessarily translate into actual interest in the product, so now you know there's at least one person in the latter category. :)
What's stopping a decentralized github from ending up in the same fate as the newsgroups, that the data set gets too large to handle !?
I think that Github is more then just a repository, it's a community. I kinda quit Facebook and signed up to Github instead :P And if it weren't for Github I would have never touched git.
Startup idea: Create something like Github and assembly.com, but with a complete tool-set (git+vps)
I think this is a great piece to build a "pay for branch merging" way of promoting open source.. use decentralized currency and voila, automated programming.
So as in distributed databases, you either:
* need to acquire exclusive lock on the repository metadata, or
* accept, that your push will be eventually discarded because you did not have up to date metadata
We have both electron and NW.js now, that should be easy.
It'd be interesting to see something like gitlab to have some sort of federation support; where multiple instances can talk to one another (a la xmpp/smtp), so as to clone/send pull-requests, etc across different instances.
> "imagine someone arguing that we can do without BitTorrent because we have FTP. We would not advocate replacing BitTorrent with FTP, and the suggestion doesn’t even make sense! First — there’s no index of which hosts have which files in FTP, so we wouldn’t know where to look for anything."
Actually there is. It's called a mirror list. Most FTP-based repositories support this.
> "And second — even if we knew who owned copies of the file we wanted, those computers aren’t going to be running an anonymous FTP server."
Except bittorrent does turn your client into a server. Many clients silently punch holes in your firewall via uPnP, so you don't always realise you're running a server, but it does still happen.
And as for anonymous FTP servers, it depends on what you mean there. If you mean anonymous access, then that's not only supported, but actually the norm. If you mean the server itself is anonymous, then it should be noted that neither github nor torrent seeding peers are anonymous either.
> "Just like Git, FTP doesn’t turn clients into servers in the way that a peer-to-peer protocol does. So that’s why Git isn’t already the decentralized GitHub — you don’t know where anything’s stored, and even if you did, those machines aren’t running Git servers that you’re allowed to talk to. I think we can fix that."
Hang on, a moment ago you _didn't_ want to run servers. Now you're complaining that git clones aren't servers?
Then there's the matter of the github competitors, of which there are many. gitlab, gitbucket, etc. Some open source, some closed but free, but all of them largely offer the same features as github.
These days it seems trendy to use bittorrent as a bootstrap for all kinds of wacky and wonderful problems, but using bittorrent for a protocol that's already distributed and already pretty saturated with github competitors; well it just seems redundant.
His argument may have some holes, and people are probably mostly ignorant of those holes so they can't critique them, but I don't think they're in awe of the idea because their awe hasn't been dispelled by identifying those holes.
This is a very neat idea, and none of the issues with the setup argument dispel that.
"using bittorrent for a protocol that's already distributed and already pretty saturated with github competitors; well it just seems redundant."
no, no, it doesn't seem redundant. DHT based distributed indexing is so incredibly fundamentally different than a mirror list for files in FTP or from a series of GitHub clones. It's owner-agnostic. It just exists, by virtue of having participants, with no overhead. I don't have to select a target host or find a server or even identify where my particular file (or git repository, or username, or whatever) lives. It's .. unification. It's elegant and reduces complexity and makes the whole ecosystem more simple.
Maybe I'm blinded by my own awe, but, I love this idea.
Also, yes, some people have thought of components of this before, but I haven't really seen the full stack laid out vertically like this, combined with a narrative that makes me so excited about it.
There's a few more concerns I have:
1) The whole point of git is versioning, having a model like this breaks makes versioning several orders of magnitude harder.
2) If I'm pulling from repositories to install on servers, I'd rather grab them from known trusted sources rather than "the anonymous ether"
I'm normally really receptive to new distribution models, so I don't mean to be negative for negatives sake. But I'm struggling to see the practical upsides of this.
Generating a file that hashes to an existing hash is called a Preimage attack, and SHA-1 (the algorithm used by bittorrent) isn't, for now and as far as we know, vulnerable to any.
So it's one of those situations where everyone was right: it's so impractical to exploit that it's as good as not vulnerable even though it's mathematically possible.
An argument could be made that git ought to be augmented with something like this: why run distributed protocol A on distributed protocol B; maybe we should just run distributed protocol AB from the get-go?