Mercurial 2.0 has added largefiles extension (older r. are downloaded on demand)

kevingadd · on Nov 2, 2011

This could actually give Mercurial a big edge over Git for development environments where large binary files are a core part of your workflow - like game development. Products like Perforce are a big hit in games precisely because they are really good at handling this specific class of file.

It's a shame, because I hate using Mercurial, but this would give me a very strong reason to use it for my game projects instead of Git.

EGreg · on Nov 2, 2011

Why do you hate using mercurial? You like to type things like this:

  git fetch <project-to-union-merge>
  GIT_INDEX_FILE=.git/tmp-index git-read-tree FETCH_HEAD
  GIT_INDEX_FILE=.git/tmp-index git-checkout-cache -a -u
  git-update-cache --add -- (GIT_INDEX_FILE=.git/tmp-index git-ls-files)
  cp .git/FETCH_HEAD .git/MERGE_HEAD
  git commit

instead of this:

  hg pull --force <project-to-union-merge>
  hg merge
  hg commit

Mercurial "just works", and its commands are less arcane.

To be fair, git is now much easier to use. But also to be fair, mercurial has become much more powerful. In mercurial, doing straightforward things is simple, and doing complicated things is more complex, which is the way it should be IMHO

It even has rebase, although one might argue that is not a great differentiating feature for a REPOSITORY

rdtsc · on Nov 2, 2011

> You like to type things like this:

Why would I have to type that? I have been using git for 4 years and never had to type git-update-cache --add or copying .git/FETCH_HEAD. Are you spreading FUD?

viraptor · on Nov 2, 2011

It's an edge case, but still true... Not something you'd do more than once a year probably ;)

kevingadd · on Nov 2, 2011

What does that even do? I've never had to do anything remotely that complicated with git.

Try using patch queues for a while and tell me that Mercurial "just works" and isn't arcane. Nonsense.

viraptor · on Nov 2, 2011

mq is pretty much the same as stg in my experience. Is there any big difference?

windsurfer · on Nov 2, 2011

Union-merging is insane and thus should take some insane understanding of what you're doing before you can do it. Anyone that's merging two unrelated repositories should understand what's going on under the hood anyways.

What you're doing in git actually makes sense, since a normal pull is just fetch+checkout+merge+commit, you have to go into the plumbing and trick every step of the way to fetch, checkout, and merge using the wrong repositories. In the future, this could be made porcelain as it has been in hg, but do people really want to do this? It's much safer and saner to use submodules for 99% of use cases.

EGreg · on Nov 2, 2011

I still think people prefer git to mercurial these days (if we ignore github) for the same reason they prefer vim because real programmers use butterflies. http://xkcd.com/378/

dstufft · on Nov 2, 2011

http://progit.org/book/ch6-7.html

durin42 · on Nov 2, 2011

What in particular do you hate about Mercurial? I'm curious as a contributor to hg if there's anything in particular that might be fixable (obviously there are some sacred cows, but many things can be fixed just by turning on an extension).

kevingadd · on Nov 2, 2011

There are lots of little usability issues I hate, but it's not really fair to call those the reason, since git has similar usability issues.

It really all comes down to patch queues: They're awful. They're so awful that they make me dread writing code and make me wish I was using Subversion or CVS. Any time I need to move changes that aren't ready for trunk yet from one machine to another I can expect to spend hours figuring out what went wrong with mq this time and probably lose some work.

Exporting/importing patches, etc - none of it has ever worked right for me on the first try and I've never found documentation that explains a way to use mq without pain and suffering.

In comparison, importing/exporting patches with git is a snap.

The lack of a good equivalent to git's rebase -i is also unfortunate, but not really a showstopper. Hell, maybe mercurial has an equivalent by now.

durin42 · on Nov 2, 2011

You don't have to use mq ever. I don't really use it anymore. I still have it enabled for the 'strip' subcommand (so I can discard revisions if I really need to), but I don't even use that. Instead, it's all rebase and histedit, with a sprinkling of bookmarks so I can keep track of many lightweight branches at once.

Importing and exporting patches with hg is easy: http://selenic.com/hg/help/export http://selenic.com/hg/help/import (those links are the same output as 'hg help $topic')

As for rebase -i, you want http://mercurial.selenic.com/wiki/HisteditExtension.

rdtsc · on Nov 2, 2011

> want http://mercurial.selenic.com/wiki/HisteditExtension.

That that is the another problem -- extensions. You go and read how to do something in mercurial. Go try it on your machine doesn't work. After a while realize you have to enable an extension. Why not just enable all the useful extensions.

durin42 · on Nov 2, 2011

For one: Because many, many features have sharp edges I don't want exposed to newbies. 'git reset --hard HEAD^' for example.

Another: anything that's in hg proper has really strict backwards compatibility promises. Extensions let us screw around with a concept until the UI is _right_.

rdtsc · on Nov 2, 2011

Makes sense.

Maybe it is already implemented, but I imagine it would be possible to let the user know that they are attempting to use an extension and said extension is disable and suggest how to enable it.

Something like Ubuntu's missing command package works. If I type a command that is missing but is installable by a package in apt, there will be a suggestion printed "It looks like you are trying to use this command, it can be installed via apt-get ..."

durin42 · on Nov 2, 2011

Already done (for extensions that ship with hg):

  % hg fetch
  hg: unknown command 'fetch'
  'fetch' is provided by the following extension:
  
      fetch  pull, update and merge in one command
  
  use "hg help extensions" for information on enabling extensions

but yeah, the story for out-of-tree extensions could probably be better. I'll mull that over (and also try to upstream some of my extensions again).

rdtsc · on Nov 2, 2011

That is great! Thanks. I might have to take a second look at hg.

pnathan · on Nov 2, 2011

Use branches if you want to have changes that aren't ready for stable.

I have no understanding of why anyone would want to throw patches around instead of simply using branches.

js2 · on Nov 2, 2011

Have you looked at git-annex?

http://git-annex.branchable.com/

illumen · on Nov 2, 2011

Yeah, and web development. 4 gig repositories are fairly common on big web development projects with big graphics files taking up a lot of space.

grandinj · on Nov 2, 2011

What don't you like about Mercurial?

joeyh · on Nov 2, 2011

I'm the developer of [git-annex](http://git-annex.branchable.com/) which is AFAIK the closest eqivalant for git. I only learned about the mercurial bfiles extension (which became the large files extension) after designing git-annex.

The designs are obviously similar at a high level, but one important difference is that git-annex tracks, in a fully distributed manner, which git repositories currently contain the content of a particular large file. The mercurial extension is, AFAIK, rather more centralized; while it can transfer large file content from multiple stores it can't, for example, transfer a large file from a nearby client that happens to currently have a copy, which git-annex can do (if a remote is set up). This location tracking also allows me to have offline archival disks whose content is tracked with git-annex. If I ask for an archived file, git-annex knows which disks I can put online to retrieve it.

Another difference is that the mercurial extension always makes available all the large files for the currently checked out tree. git-annex allows a tree to be checked out with large files not present (they appear as broken symlinks); you can ask it to populate the tree and it retrieves the files as a separate step. This is both more complex and more flexible. For example, I have a git repository containing a few terabytes of data. It's checked out on my laptop's 30 gb SSD. Only the files I'm currently using are present on my laptop, but I can still manage all the other files, reorganizing them, requesting ones I need, etc.

git-annex also has support for special remotes, which are not git repositories, but in which large files are stored. So large files can be stored in Amazon S3 (or the Internet Archive S3), in a bup repository, or downloaded from arbitrary urls on the web.

Content in special remotes are tracked the same as other remotes. This lets me do things like this (the first file is one of my Grandfather's engineering drawings of Panama Canal locks):

  joey@gnu:~/lib/big/raw/eckberg_panama>git annex whereis img-0124.png
  whereis img-0124.png (5 copies) 
    	5863d8c0-d9a9-11df-adb2-af51e6559a49 -- turtle (turtle internal drive)
     	7e55d8d0-81ab-11e0-acc9-bfb671110037 -- archive-panama (internet archive http://www.archive.org/details/panama-canal-lock-design-papers)
     	905a3a64-4149-11e0-8b3f-97b9501cdcd3 -- passport (passport usb drive 1 terabyte)
     	9b22e786-dff4-11df-8b4c-731a6178061c -- archive-leech (archive-6 sata drive)
     	f4c185e2-da3e-11df-a198-e70f2c123f40 -- archive (archive-5 sata drive)
  ok
  joey@gnu:~/lib/big/raw/eckberg_panama>git annex get img-0124.png --from archive-panama
  get img-0124.png (from archive-panama...) ok

I'm hopeful that git will grow some internal hooks for managing large files that will improve git-annex and also allow others to develop extensions that, perhaps, behave more like the mercurial largefiles extension. I recently attended the GitTogether and this stuff was a major topic of discussion.

baq · on Nov 2, 2011

does git-annex work on windows? if not, do you have plans to port it? it's an important problem and it'd suck to have a solution which doesn't work on all major platforms.

joeyh · on Nov 2, 2011

Not so far. Discussion is here: http://git-annex.branchable.com/install/#comment-4637ce9b32a...

It's fine on OSX though.

naveen99 · on Nov 2, 2011

have you seen: https://github.com/apenwarr/bup "file backup system based on the git packfile format. Capable of doing fast incremental backups of virtual machine images. "

joeyh · on Nov 2, 2011

Sure, git-annex can use bup as a special remote. This way you get bup's nice properties of storing big stuff in git with binary deltas, with git-annex's nice properties of a normal-looking file in a git clone.

wcoenen · on Nov 2, 2011

Another option available since Mercurial 1.5 is to put the large files in a subversion repository and reference it as a subrepository.

http://mercurial.selenic.com/wiki/Subrepository#SVN_subrepos...

gecko · on Nov 2, 2011

That options work fine, provided that all of your binary assets sit at one location in the tree, and that you just happen to have a Subversion server lying around at the time. largefiles allows you to put your assets where you want, and allows you to avoid setting up a Subversion server.

protez · on Nov 2, 2011

I don't get what it does. Does this extension make large binary files "diffable," as it states that's the problem it solves?

gecko · on Nov 2, 2011

Binary files are already diffable, both in how they're stored (in fact, the only thing that Mercurial stores internally are binary diffs), and in terms of sending around patches (that's what the Git patch format is for).

There are two problems that largefiles tries to solve: first, that while binary files are technically diffable, most of the popular ones store large amounts of compressed data, which means that their diffs are insanely poor. Combine that with the second problem, which is that distributed version control systems tend to include the entire history in every repo, and you've got a recipe for disaster: those 200 MB worth of textures that you just color-corrected are now going to be another 200 MB of data that every last developer needs to get whenever they attempt to fetch your repository.

largefiles solves this by saying that certain, user-designated files, are not actually stored in the repository. Instead, stand-ins, which are one-line text files with the SHA-1 hash of the file they represent, are stored instead. Whenever you update (checkout, in Git parlance) to a given revision, largefiles fetches your missing files on-demand, either from the central store, or (if available) from a per-user cache.

The benefit of this approach is that, if just want the newest revision, you don't have to also fetch all the historical versions of all the assets. The downside of this approach is that a clone doesn't, by default, have the full, reconstructable history of the entire repository. Whether this trade-off works for you will largely depend on who you are and what your workflow is, but we've found many Kiln customers who find it to be an excellent trade-off.

forgotusername · on Nov 2, 2011

That's exactly what it doesn't do. It versions the checksum referring to a largefile along with the rest of your repository, meaning it's little more than sugar for a fancy "network symlink".

However this should be enough in most circumstances, e.g. to allow representing the complete state of a Debian package repository, without causing Mercurial to slow to a crawl. (Note I just made this use case up, it's just an example)

splicer · on Nov 2, 2011

I wonder whether Kiln will end up using this.

gecko · on Nov 2, 2011

Of course! largefiles is a direct descendant of kbfiles (our initial, Kiln-specific version of this functionality). We are really happy to see it integrated into the official Mercurial release, and will be supporting it within the next couple of weeks. We're just working on making sure that the switch is painless and transparent for everyone who's currently using kbfiles.

luckydude · on Nov 2, 2011

/me wonders when Mercurial will ever do anything other than copy BitKeeper. We've been doing this for years, my photos are in a ~100GB BK/BAM repo.

Release notes for BitKeeper version 4.1 (released 12-Oct-2007)

Major features

BAM support. BAM stands for "Binary Asset Management" and it adds support to BK for versioning large binaries. It solves two problems: a) one or more binary files that are frequently changed. b) collections of many large binaries where you only need a subset. The way it solves this is to introduce the concept of BAM server[s]. A BAM server manages a collection of binaries for one or more BAM clients. BAM clients may have no data present; when it is needed the data is fetched from the BAM server.

In the first case above, only the tip will be fetched. Imagine that you have 100 deltas, each 10MB in size. The history is 1GB but you only need 10MB in your clone.

In the second case, imagine that you have thousands of game assets distributed across multiple directories. You typically work only in one directory at a time. You will only need to fetch the subset of files that you need, the rest of the repository will have the history of what changed but no data (so bk log will work but bk cat will have to go fetch the data).

gaoshan · on Nov 2, 2011

To really copy BitKeeper Mercurial would need to start charging for use and come out in Basic, Pro and Enterprise editions, with things like BAM support disabled at the lowest level. Fortunately they don't do this.

luckydude · on Nov 2, 2011

It's easier to copy than it is to invent stuff, sad but true. hg has a long history of doing illegal copies of BK tech, we could have sued them out of existence years ago. Imitation is the sincerest form of flattery, so I guess we should be flattered :)

Legalities aside, the point I've made for about a decade now is that it would be interesting to see a release announcement from hg where I went "That's cool! Why didn't we think of that?"

As for the 3 levels of product, um, when you build commercial products with commercial support, you don't do that? You really want only one offering? That doesn't play well in the commercial world, we tried that. If you are just taking a dig at commercial software, sorry about that, but we have to pay for dev somehow. I'd love a way to open source the thing and make money, haven't found it.

tghw · on Nov 2, 2011

hg has a long history of doing illegal copies of BK tech

Huh? Are you claiming that Mercurial stole code from BitKeeper? Or does BitKeeper have patents on these fairly obvious concepts?

I mean, a large file store on a centralized server isn't exactly a novel concept. Other than that you don't provide any examples of "illegal copies".

luckydude · on Nov 2, 2011

It's a fairly long and pretty sordid story that ended with a certain hacker's employer sitting down with us and saying "We're not admiting that he did it, but just hypothetically speaking, suppose he did. What do you want?"

And we said "we want him to stop his illegal activity". And he finally did and we dropped it, we're not on the lawyering business. We could have made a pretty big stink about it, it's not one of open source's finer moments, but all we wanted was a level playing field, we got that, we moved on.

garenp · on Nov 2, 2011

I'm guessing you're referring to this:

http://lwn.net/Articles/153990/

I remember that, long ago; I don't think there's anything wrong with not wanting users of your product to leverage it to build a replacement nor disallowing it in the license.

I think where people (including myself) have a problem, is claiming that mercurial somehow has ripped off something from BK. From vague memory, BK laid claim to using a DAG for a DVCS? I assume you must also think that if someone is first to implement something that can be trivially found in an introductory computer science text, they have a patentable claim?

Ironic that "largefiles" is a user contributed extension, and that only mercurial seems to be guilty of "copying," but git must be so different that it's unworthy of being sued?

pnathan · on Nov 2, 2011

This is a very serious allegation. It would be nice if you substantiated your claims. One could be tempted to write off unsubstantiated claims otherwise.

Hg 0.1 was released around 6 years ago, not a decade ago.

http://lkml.indiana.edu/hypermail/linux/kernel/0504.2/0670.h...

luckydude · on Nov 2, 2011

I'd be happy to do so if you can show me an outcome that is anything other than bad PR for us and an open source hacker looking like a jerk. We looked at this hard and for us the least bad outcome was to just to get the guy to stop and move on. Anything else was going to be like this thread, everyone saying it's not true and then, if they ever believed it was true, they'd still be pissed at us.

If some credible person in the open source community wants to talk to us about it, look at the evidence, and confirm the facts and relay that back without naming names, that's fine with me.

aiiie · on Nov 2, 2011

I think you've already crossed the bad PR threshold, at least on HN.

Substantiating your claims can only make your position more legitimate. It's hard to take this seriously based on hearsay.

luckydude · on Nov 3, 2011

I'll think about it. It's not an easy choice, the guy in question is someone that I liked a lot, tried to hire him, he's a good guy other than this one issue. As much as I'd like to show you all that I'm right I'm not sure that dragging someone through the mud is worth it.

I'll think on it. Thanks for the suggestion.

rdtsc · on Nov 2, 2011

> we could have sued them out of existence years ago

really? but you didn't, from the goodness of your heart, right?

tghw · on Nov 2, 2011

After a bit of digging, I learned that BitKeeper has, as a part of its EULA, a provision that disallows its users to contribute to other source control projects[1]. Larry McVoy (aka luckydude) has actually tried to enforce the EULA by contacting the users' employers to get them to stop contributing, but my guess is that he never took it to court because, well, he would be laughed right back out again.

As for Mercuiral, one of the early developers, Bryan O'Sullivan, apparently worked for a company that used BK, and McVoy told him that he had to stop contributing[2] to the project, which he did[3]. O'Sullivan is now showing up in the commit history again, which I presume means he's no longer working for a company that uses BK.

[1] http://yro.slashdot.org/story/02/10/06/0518220/BitKeeper-EUL...

[2] http://en.wikipedia.org/wiki/Bitkeeper#Pricing_change

[3] http://article.gmane.org/gmane.comp.version-control.mercuria...

luckydude · on Nov 2, 2011

We didn't see the upside of sueing and we could sure as heck see the PR down side.

What would you have done? You got a well known hacker who all of the open source guys will side with, so you lose the PR battle, but the guy is ripping off your technology, his employer basically admitted that in front of your lawyer. What would you do?

rdtsc · on Nov 2, 2011

Yeah, not sure, not an easy problem. Maybe go public? In open source world reputation is almost everything. Honesty figures in that well. If you have good proof, their reputation would be ruined.

I personally don't care if this was RMS himself or Torvalds, if they are copying code from others as their own, I lose respect for them and will refuse to use or promote their projects.

Maybe present it in an exploratory kind of blog post -- "So yeah we found this out and we don't know what to do, what does the community think?" Just make it public.

msbarnett · on Nov 2, 2011

This is either him reminding everyone of the hissy-fit Bitkeeper threw when Tridgell telnet'd into a bitkeeper server and typed HELP, or the hissy-fit Bitkeeper threw when Bryan O'Sullivan dared to contribute to Mercurial while his employer held a license for Bitkeeper.

Either way, the reminder that Bitkeeper's licensing terms are so utterly ridiculous that either of the above cases were considered in any way, shape, or form "nefarious" only strikes me as a good way to scare away even more potential customers from their product.

luckydude · on Nov 2, 2011

Tridge is not exactly telling the whole story. That telnet thing is quite but the part he left out is when Linus was at his house running BK commands and Tridge was snooping the network to figure out the protocol. There isn't any chance that Tridge figured out what to do by telneting to bkbits and I'll back that up with a $10K challenge for anyone to write the same code Tridge wrote, in the same time frame, with the only resource being telnet to bkbits.net.

Go talk to Linus and see what he says about all this, don't take my word for it.

garenp · on Nov 2, 2011

People don't take your word for it though, so it's much ado about nothing. The burden of proof is on you, not anyone else ("Extraordinary claims require extraordinary evidence") hence no need to "go talk to Linus," because your claims are so highly suspect.

Because, really, nobody thinks BK contains some kind of rocket science for anyone to rip off in the first place. If there is, IMNSHO I don't see it in the mercurial or git sources, which is readily available to anyone. Hence why nobody takes seriously what you, or even a supposed "employer" or "lawyer" says either (and what your lawyer would say is even more highly suspect :)).

And if all this supposed restraint is due to some kind of self-interested, game theoretic calculation as you imply, why does that same thinking not restrain you from touting unverifiable claims? It's only generating "bad PR" for you which you wish to avoid, makes you look bad, and by extension your claims ever more doubtful. That can't be good for business.

rdtsc · on Nov 3, 2011

Alright luckydude, moment of truth. Does your EULA say what another poster here quoted -- that it prevents the employees of the company from contributing to Open Source projects?

luckydude · on Nov 4, 2011

Wow, I sure hope that's not what people think.

Here's the scoop. We kinda invented the whole distributed source management thing, go to google groups (if they are still there) and do a date search for changeset before 1998. Now do one today. That's all us. (in case they aren't still there, there were something like 16 hits for changeset before us and now there are zillions.) Is that proof we invented it, I dunno, it's something. We certainly raised awareness that Subversion, which started when we started, wasn't the way to go.

What our EULA said, and says, is while you are using our system you can't contribute to an open source system that competes with us. If you want to work on some other open source you are fine. When you stop using our system, have the big fun, but while using it, no copying our stuff to some open source clone. Yeah, I know that makes many (all?) of you insanely pissed off. But we invented this stuff, we've been ahead of the curve a bunch of times, and open source is always right there ready to "rewrite" whatever we invent. Is it really so unreasonable that we don't want to hand whatever is next cool thing (in source management, right, that's cool :) to people and say "here ya go, have at it, copy it"?

We don't think so because we spend tons of time and money figuring out a right answer. It was a lot of work and just like you, we'd like to reap some benefits from our work.

illumen · on Nov 2, 2011

This isn't a new idea. Subversion has been supporting large binary files since before 2007.