

Time Machine and Mail: a match made in hell - lisper
http://rondam.blogspot.com/2009/07/time-machine-and-mail-match-made-in.html

======
jff
The real question is why the hell Time Machine backs up at a file level
instead of at a block level.

Plan 9's Venti archival storage (<http://en.wikipedia.org/wiki/Venti> and
<http://plan9.bell-labs.com/sys/doc/venti/venti.html>) stores data as blocks,
which are referenced by SHA1 hashes. Since these blocks range between 512
bytes and 56 KB, a large file (like a mail file) gets split into many blocks.
The brilliant thing is that a given block will only ever be stored once; your
top-level filesystem then only has to keep track of which SHA1 hashes make up
a file.

With this system, people have been keeping daily snapshots of their
filesystems over the course of years, and the space consumption rate actually
tends to decrease over time--see the graphs at <http://plan9.bell-
labs.com/sys/doc/venti/venti.html>

~~~
rarrrrrr
Backup systems make tradeoffs to be space efficient (block level backups) or
computationally efficient (using full file backups that have mostly IO cost.)

Apple made the choice to have time machine operate with little CPU burden.
While this would be a tremendously poor choice for an online storage system
like SpiderOak, Dropbox, SugarSync, etc. it probably makes sense for them
since it's usually working with a local external drive.

I admire much of Venti's design, but last I checked, Venti didn't support
recovering the space from deleted items, except by way of making a new copy of
the file system.

~~~
wmf
_Apple made the choice to have Time Machine operate with little CPU burden._

I don't know why, given that my CPU is a sunk cost. (Also, I suspect that
Apple chose hard links instead of deltas for ease of implementation, not to
save CPU cycles.)

 _last I checked, Venti didn't support recovering the space from deleted
items, except by way of making a new copy of the file system._

I think the state of the art has moved on since Venti; Cumulus (and probably
tarsnap) implements garbage collection to recover space.
<http://cseweb.ucsd.edu/~mvrable/cumulus/>

~~~
rarrrrrr
It was definitely not chosen for ease of implementation, as Apple had to
modify core file system functionality, adding support for hard links to
directories as well as files in HFS+ which is fraught with peril if not
implemented just right. Users just don't want a backup utility to use any
significant amount of CPU and slow down their machine.

~~~
jrg
As a side note, it has always been possible (on UFS at least) to link
directories. It just gets a bit messy at fsck time, and older shells get a bit
confused with 'cd ..'

Apple's implementation could have avoided linking directories - by creating
new directories each time but always linking the files inside them - though I
suspect that they decided it would be far quicker to replicate entire trees if
you knew nothing in them had changed.

------
Locke1689
_And for small files, like most mail messages tend to be unless they have a
lot of attachments, creating a hard link is no faster than actually copying
the file._

[citation needed]

~~~
jmtulloss
In my unscientific trial, it looks like the hard link is an order of magnitude
faster. Still, it takes time.

    
    
      $ du -sh README.markdown
      4.0K	README.markdown
    
      $ time ln README.markdown README.markdown.link
    
      real	0m0.002s
      user	0m0.000s
      sys	0m0.001s
    
      $ time cp README.markdown README.markdown.copy
    
      real	0m0.041s
      user	0m0.000s
      sys	0m0.002s

~~~
lisper
That's pretty unscientific :-) To measure anything that fast reliably you have
to do it more than once:

    
    
        [ron@mickey:~/foo]$ cat foo
        foo
        
        [ron@mickey:~/foo]$ time for i in {1..1000}; do cp foo foo.c.$i; done
        real	0m2.705s
        user	0m0.524s
        sys	0m2.190s
        
        [ron@mickey:~/foo]$ time for i in {1..1000}; do ln foo foo.l.$i; done
        real	0m2.618s
        user	0m0.486s
        sys	0m2.008s

~~~
jrockway
Now you are measuring the speed of the fork/exec call. To really do this right
you need to write a small script that copies the file 1000 times or creates a
hard link 1000 times.

~~~
lisper
No, fork is a lot faster than disk I/O:

    
    
        [ron@mickey:~]$ time for i in {1..1000}; do echo foo>/dev/null; done
        real	0m0.064s
        user	0m0.039s
        sys	0m0.025s

------
a2tech
Creating an Archive folder as he describes doesn't stop Time Machine from
creating all those hard links everytime it runs-so his solution doesn't fix
anything. He should upgrade to a faster disk for his backup to solve his
issues. I have a very large collection of mailboxes on my machine and see none
of these issues he's describing.

~~~
mjtsai
Creating an archive folder does help because then Time Machine can create one
hard link for the directory rather than a folder containing thousands of hard
links to individual message files.

The archive folder is for Apple Mail, not Entourage. Entourage stores all the
mail in a single database so it doesn't matter which folders the messages are
in.

~~~
tesseract
Traditionally (and even now in most unices) hard links to directories were not
allowed, in order to prevent endless recursive cases. Apple added the feature
to HFS+ in 10.5 so that Time Machine could make use of the optimization you
describe. There are rules about directory hard links so that pathological
recursion still cannot happen: [http://lists.apple.com/archives/darwin-
dev/2007/Dec/msg00029...](http://lists.apple.com/archives/darwin-
dev/2007/Dec/msg00029.html)

