
Mercurial 2.0 has added largefiles extension (older r. are downloaded on demand) - ognyankulev
http://mercurial.selenic.com/wiki/LargefilesExtension
======
kevingadd
This could actually give Mercurial a big edge over Git for development
environments where large binary files are a core part of your workflow - like
game development. Products like Perforce are a big hit in games precisely
because they are really good at handling this specific class of file.

It's a shame, because I hate using Mercurial, but this would give me a very
strong reason to use it for my game projects instead of Git.

~~~
EGreg
Why do you hate using mercurial? You like to type things like this:

    
    
      git fetch <project-to-union-merge>
      GIT_INDEX_FILE=.git/tmp-index git-read-tree FETCH_HEAD
      GIT_INDEX_FILE=.git/tmp-index git-checkout-cache -a -u
      git-update-cache --add -- (GIT_INDEX_FILE=.git/tmp-index git-ls-files)
      cp .git/FETCH_HEAD .git/MERGE_HEAD
      git commit
    

instead of this:

    
    
      hg pull --force <project-to-union-merge>
      hg merge
      hg commit
    

Mercurial "just works", and its commands are less arcane.

To be fair, git is now much easier to use. But also to be fair, mercurial has
become much more powerful. In mercurial, doing straightforward things is
simple, and doing complicated things is more complex, which is the way it
should be IMHO

It even has rebase, although one might argue that is not a great
differentiating feature for a REPOSITORY

~~~
rdtsc
> You like to type things like this:

Why would I have to type that? I have been using git for 4 years and never had
to type git-update-cache --add or copying .git/FETCH_HEAD. Are you spreading
FUD?

~~~
viraptor
It's an edge case, but still true... Not something you'd do more than once a
year probably ;)

------
joeyh
I'm the developer of [git-annex](<http://git-annex.branchable.com/>) which is
AFAIK the closest eqivalant for git. I only learned about the mercurial bfiles
extension (which became the large files extension) after designing git-annex.

The designs are obviously similar at a high level, but one important
difference is that git-annex tracks, in a fully distributed manner, which git
repositories currently contain the content of a particular large file. The
mercurial extension is, AFAIK, rather more centralized; while it can transfer
large file content from multiple stores it can't, for example, transfer a
large file from a nearby client that happens to currently have a copy, which
git-annex can do (if a remote is set up). This location tracking also allows
me to have offline archival disks whose content is tracked with git-annex. If
I ask for an archived file, git-annex knows which disks I can put online to
retrieve it.

Another difference is that the mercurial extension always makes available
_all_ the large files for the currently checked out tree. git-annex allows a
tree to be checked out with large files not present (they appear as broken
symlinks); you can ask it to populate the tree and it retrieves the files as a
separate step. This is both more complex and more flexible. For example, I
have a git repository containing a few terabytes of data. It's checked out on
my laptop's 30 gb SSD. Only the files I'm currently using are present on my
laptop, but I can still _manage_ all the other files, reorganizing them,
requesting ones I need, etc.

git-annex also has support for special remotes, which are not git
repositories, but in which large files are stored. So large files can be
stored in Amazon S3 (or the Internet Archive S3), in a bup repository, or
downloaded from arbitrary urls on the web.

Content in special remotes are tracked the same as other remotes. This lets me
do things like this (the first file is one of my Grandfather's engineering
drawings of Panama Canal locks):

    
    
      joey@gnu:~/lib/big/raw/eckberg_panama>git annex whereis img-0124.png
      whereis img-0124.png (5 copies) 
        	5863d8c0-d9a9-11df-adb2-af51e6559a49 -- turtle (turtle internal drive)
         	7e55d8d0-81ab-11e0-acc9-bfb671110037 -- archive-panama (internet archive http://www.archive.org/details/panama-canal-lock-design-papers)
         	905a3a64-4149-11e0-8b3f-97b9501cdcd3 -- passport (passport usb drive 1 terabyte)
         	9b22e786-dff4-11df-8b4c-731a6178061c -- archive-leech (archive-6 sata drive)
         	f4c185e2-da3e-11df-a198-e70f2c123f40 -- archive (archive-5 sata drive)
      ok
      joey@gnu:~/lib/big/raw/eckberg_panama>git annex get img-0124.png --from archive-panama
      get img-0124.png (from archive-panama...) ok
    

I'm hopeful that git will grow some internal hooks for managing large files
that will improve git-annex and also allow others to develop extensions that,
perhaps, behave more like the mercurial largefiles extension. I recently
attended the GitTogether and this stuff was a major topic of discussion.

~~~
baq
does git-annex work on windows? if not, do you have plans to port it? it's an
important problem and it'd suck to have a solution which doesn't work on all
major platforms.

~~~
joeyh
Not so far. Discussion is here: [http://git-
annex.branchable.com/install/#comment-4637ce9b32a...](http://git-
annex.branchable.com/install/#comment-4637ce9b32abecf6ebf94c75f907f351)

It's fine on OSX though.

------
wcoenen
Another option available since Mercurial 1.5 is to put the large files in a
subversion repository and reference it as a subrepository.

[http://mercurial.selenic.com/wiki/Subrepository#SVN_subrepos...](http://mercurial.selenic.com/wiki/Subrepository#SVN_subrepositories)

~~~
gecko
That options work fine, _provided_ that all of your binary assets sit at one
location in the tree, _and_ that you just happen to have a Subversion server
lying around at the time. largefiles allows you to put your assets where you
want, and allows you to avoid setting up a Subversion server.

------
protez
I don't get what it does. Does this extension make large binary files
"diffable," as it states that's the problem it solves?

~~~
gecko
Binary files are already diffable, both in how they're stored (in fact, the
_only_ thing that Mercurial stores internally are binary diffs), and in terms
of sending around patches (that's what the Git patch format is for).

There are two problems that largefiles tries to solve: first, that while
binary files are technically diffable, most of the popular ones store large
amounts of compressed data, which means that their diffs are insanely poor.
Combine that with the second problem, which is that distributed version
control systems tend to include the entire history in every repo, and you've
got a recipe for disaster: those 200 MB worth of textures that you just color-
corrected are now going to be another 200 MB of data that every last developer
needs to get whenever they attempt to fetch your repository.

largefiles solves this by saying that certain, user-designated files, are
_not_ actually stored in the repository. Instead, stand-ins, which are one-
line text files with the SHA-1 hash of the file they represent, are stored
instead. Whenever you update (checkout, in Git parlance) to a given revision,
largefiles fetches your missing files on-demand, either from the central
store, or (if available) from a per-user cache.

The benefit of this approach is that, if just want the newest revision, you
don't have to also fetch all the historical versions of all the assets. The
downside of this approach is that a clone doesn't, by default, have the full,
reconstructable history of the entire repository. Whether this trade-off works
for you will largely depend on who you are and what your workflow is, but
we've found many Kiln customers who find it to be an excellent trade-off.

------
splicer
I wonder whether Kiln will end up using this.

~~~
gecko
Of course! largefiles is a direct descendant of kbfiles (our initial, Kiln-
specific version of this functionality). We are really happy to see it
integrated into the official Mercurial release, and will be supporting it
within the next couple of weeks. We're just working on making sure that the
switch is painless and transparent for everyone who's currently using kbfiles.

------
luckydude
/me wonders when Mercurial will ever do anything other than copy BitKeeper.
We've been doing this for years, my photos are in a ~100GB BK/BAM repo.

Release notes for BitKeeper version 4.1 (released 12-Oct-2007)

Major features

BAM support. BAM stands for "Binary Asset Management" and it adds support to
BK for versioning large binaries. It solves two problems: a) one or more
binary files that are frequently changed. b) collections of many large
binaries where you only need a subset. The way it solves this is to introduce
the concept of BAM server[s]. A BAM server manages a collection of binaries
for one or more BAM clients. BAM clients may have no data present; when it is
needed the data is fetched from the BAM server.

In the first case above, only the tip will be fetched. Imagine that you have
100 deltas, each 10MB in size. The history is 1GB but you only need 10MB in
your clone.

In the second case, imagine that you have thousands of game assets distributed
across multiple directories. You typically work only in one directory at a
time. You will only need to fetch the subset of files that you need, the rest
of the repository will have the history of what changed but no data (so bk log
will work but bk cat will have to go fetch the data).

~~~
gaoshan
To really copy BitKeeper Mercurial would need to start charging for use and
come out in Basic, Pro and Enterprise editions, with things like BAM support
disabled at the lowest level. Fortunately they don't do this.

~~~
luckydude
It's easier to copy than it is to invent stuff, sad but true. hg has a long
history of doing illegal copies of BK tech, we could have sued them out of
existence years ago. Imitation is the sincerest form of flattery, so I guess
we should be flattered :)

Legalities aside, the point I've made for about a decade now is that it would
be interesting to see a release announcement from hg where I went "That's
cool! Why didn't we think of that?"

As for the 3 levels of product, um, when you build commercial products with
commercial support, you don't do that? You really want only one offering? That
doesn't play well in the commercial world, we tried that. If you are just
taking a dig at commercial software, sorry about that, but we have to pay for
dev somehow. I'd love a way to open source the thing and make money, haven't
found it.

~~~
rdtsc
> we could have sued them out of existence years ago

really? but you didn't, from the goodness of your heart, right?

~~~
luckydude
We didn't see the upside of sueing and we could sure as heck see the PR down
side.

What would you have done? You got a well known hacker who all of the open
source guys will side with, so you lose the PR battle, but the guy is ripping
off your technology, his employer basically admitted that in front of your
lawyer. What would you do?

~~~
rdtsc
Yeah, not sure, not an easy problem. Maybe go public? In open source world
reputation is almost everything. Honesty figures in that well. If you have
good proof, their reputation would be ruined.

I personally don't care if this was RMS himself or Torvalds, if they are
copying code from others as their own, I lose respect for them and will refuse
to use or promote their projects.

Maybe present it in an exploratory kind of blog post -- "So yeah we found this
out and we don't know what to do, what does the community think?" Just make it
public.

~~~
msbarnett
This is either him reminding everyone of the hissy-fit Bitkeeper threw when
Tridgell telnet'd into a bitkeeper server and typed HELP, or the hissy-fit
Bitkeeper threw when Bryan O'Sullivan dared to contribute to Mercurial while
his employer held a license for Bitkeeper.

Either way, the reminder that Bitkeeper's licensing terms are so utterly
ridiculous that either of the above cases were considered in any way, shape,
or form "nefarious" only strikes me as a good way to scare away even more
potential customers from their product.

~~~
luckydude
Tridge is not exactly telling the whole story. That telnet thing is quite but
the part he left out is when Linus was at his house running BK commands and
Tridge was snooping the network to figure out the protocol. There isn't any
chance that Tridge figured out what to do by telneting to bkbits and I'll back
that up with a $10K challenge for anyone to write the same code Tridge wrote,
in the same time frame, with the only resource being telnet to bkbits.net.

Go talk to Linus and see what he says about all this, don't take my word for
it.

~~~
garenp
People don't take your word for it though, so it's much ado about nothing. The
burden of proof is on you, not anyone else ("Extraordinary claims require
extraordinary evidence") hence no need to "go talk to Linus," because your
claims are so highly suspect.

Because, really, nobody thinks BK contains some kind of rocket science for
anyone to rip off in the first place. If there is, IMNSHO I don't see it in
the mercurial or git sources, which is readily available to anyone. Hence why
nobody takes seriously what you, or even a supposed "employer" or "lawyer"
says either (and what your lawyer would say is even more highly suspect :)).

And if all this supposed restraint is due to some kind of self-interested,
game theoretic calculation as you imply, why does that same thinking not
restrain you from touting unverifiable claims? It's only generating "bad PR"
for you which you wish to avoid, makes you look bad, and by extension your
claims ever more doubtful. That can't be good for business.

