Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Help me fix your backups (spreadsheets.google.com)
46 points by drewcrawford on July 7, 2010 | hide | past | favorite | 40 comments

  How is your data backed up?
  [X] I use Tarsnap
Where's the "I am Tarsnap" option?

Somewhat related: Are the responses to this going to be made public somewhere? I'd love to see what they look like (for obvious reasons) and was thinking just last week that maybe I should post a survey to HN and Reddit.

Lol. I knew it was you as soon as I saw "I use tarsnap" as the only one checked.

I love Tarsnap. I use you for a lot of very-critical-yet-small files.

My personal pain point is that I'd like to back up a few TB of data but can't afford to--it would be nice to back up an entire workstation. I'm interested in seeing if that's a sentiment that others share, or if it's just me.

My personal pain point is that I'd like to back up a few TB of data but can't afford to--it would be nice to back up an entire workstation.

Why do you have a few TB of data on a workstation? What sort of work are you doing?

Right now about a quarter of respondents have 1TB or more, so while I'm not in the majority, it seems I'm in good company. 47% of respondents seem to be in the 100GB-1TB bracket as of right now.

Here's the breakdown of my data (largish categories only)


* 5GB of codebases I've committed to in the last 3 years.

* 50GB worth of software. Mostly development tools. Five different versions of XCode, I actually use most of them every day (scary, I know…)

* 50GB of VM images I use for development. I could rebuild them, but I really don't want to.

* 80GB on my boot volume. Stuff I would backup if and only if I want a live, bootable backup. Save me a lot of time during a restore.

* 5GB in ebooks; primarily used for reference / searching

* 20GB in music

* 20GB in scanned documents, mostly financial and legal paperwork, contracts, etc.

* 15GB of e-mail


* 250GB of samples/patches/instruments that I've built over the years and songs I've written (I play keyboards for fun). Not business critical, but I'd be sad if I lost it.

* 20GB of backups from computers I used to own growing up. The nostalgia of that game I was coding in middle-school, etc.

* 100GB in learning resources I've collected over the years that are out of print or otherwise unavailable anywhere else. Not useful to me (anymore), but sharing them with others is valuable to both me and them.

* 20GB of home video

* 13GB in personal photos


* 375GB of Apple Development videos. I refer to these fairly regularly. I typically watch a few each day to keep up-to-date.

* 20GB of Steam games

Looking at this list, sure, I could live without some of it. But a lot of it is pretty important, and it would be nice to back up most of it.

Backup the whole thing on harddisks and store one at your bank (if they do have that capability where you live) and another one at some relatives place. Encrypt and seal of course.

Update that with "slow" data every month or couple of months. Decide what needs to be backupped daily/weekly and pay some money for that to do it online. How much would it be worth to you if you lost it in a second?

FWIW, I also have the same problem (I'm a tarsnap user). I current have just shy of 2TB of GIS data (1.6TB is elevation data alone), so right now I backup everything except my GIS data. I loves me some tarsnap for the other stuff, but right now it's not a solution for this much data.

This is definately a pain point of mine. Luckily, the data doesn't change often, and if I needed to restore from data loss it doesn't need to happen ASAP. So I'm thinking just saving the data to seperate hard disks and storing those in a safety deposit box will fulfill my needs.

Does that sound right? Or am I missing some large dangers in this solution? (I'm pretty new to this stuff)

I'm not the OP but I'll add my datapoint here. I have that just because of digital photography. My current 12MP camera fills up a 16GB card easily if I am covering a 2-3 hour event for some friend. That adds up fast.

Right now I do a versioned rsync to a server I host myself that has a (just upgraded from 1TB) 4.5TB RAID5 array. That was ~400$ in disks when I upgraded. Even at the current ~1TB usage that would be 3600$ per year in tarsnap storage alone, without the bandwidth cost.

What I'm still missing is a way to backup that server. I could just collocate a second server for a fraction of the tarsnap cost, and I don't really want to be backing up into physical media (tapes or drives) and shuttling that to a deposit box. What I need is tarsnap only 10x cheaper. I'll accept a lower redundancy than S3 for it and especially a lower turnaround time for restores. S3 seems to be built for on-line transactional stuff not backups so is a poor fit for these services. Even the reduced redundancy version only seems a marginal difference.

Comparing S3 to an offsite RAID volume seems like apples to oranges. The biggest difference I think is the data are stored in multiple data centers and Amazon takes care of all the issues with that. There were a lot of interesting comments about S3 cost here: http://news.ycombinator.com/item?id=422225

It would be awesome if Amazon offered something analogous to a single offsite automagically-expanding RAID volume for a lot less money than S3.

> Comparing S3 to an offsite RAID volume seems like apples to oranges.

And that was exactly my point. Amazon is selling apples and I want an orange. Tarsnap is an apple pie and is the perfect kind of pie I want. But I want an orange pie.

I would be ok with a service that didn't give me online access to my data. Run a very large stack of removable SATA drive slots. Copy incoming files into two of them in two locations. Store them securely when they are full. Fetch them back when I need to restore. That should cut way down on the CapEx of all the NAS/SAN hardware S3 has to run.

yes, something like this would be very nice. An analogous version would be to have a "netflix-lite" of data where you mail in a portable hd, they slurp out your data into (by default ) your private store, and then write the data (whether private or public) at the top of your request queue back into the hd and mail it back to you.

Subject to pricing, reliability and privacy infrastructure being sane / good, I'd be very interested in that

I see what you are saying. I would rather do a daily sync of any changed files as the deltas would be small and then maybe have the option to do a full restore by getting a disk couriered to me if I'm in a hurry or fetching it online if I don't mind downloading files for a week or two.

What I want to avoid is to have to bother about backups in the normal case. I just want to set it up properly as a daily cron job and mostly forget about it. Putting physical disks in the mail or safety deposit boxes is too much trouble. The server I want to backup isn't even usually physically next to me.

This problem is only growing as people get better cameras and start shooting a bunch of video and stills that take up a bunch of space. An interesting solution would be a mix of this online service plus a NAS appliance to put on the home network that does fast filesharing and automatic backups locally and to the online service. When a disk in the appliance fails you call the guys and they send you a replacement drive with the data already in it.

Personally I use Puppet to handle configuration of all my systems. All code is on git/hg on codebasehq.com. This means that I don't need to backup entire systems, only databases and other user created content (which I use tarsnap for).

Thanks for the recc. Just spent some time reading the Tarsnap website. It sounds to be a very competitive solution, both in terms of operational transparency (detailed, no-gimmick website) and the layer of security.

Hope it's not a stupid question, but how do you deal with the "bus factor" (a.k.a. "truck factor")? In other words, if the service relies on encrypted S3 backups and the founder exits for whatever reason, is there a contingency plan?


how do you deal with the "bus factor"?

I don't. If I get hit by a bus, Tarsnap will almost certainly cease to exist.

That said, the service runs itself quite happily on its own (I've gone for months without touching any "live" code) so the odds are very good that Tarsnap users would have plenty of time to download their data between "email goes out announcing bus hit and account balances get refunded" and "service shuts down".

I intend to solve the bus-factor issue once Tarsnap gets larger, but right now it's not financially feasible to bring someone else in.

Just to note, I've had very good results with tarsnap, thanks cperciva!

I'm hoping that the results are made public they'll be anonymised at least somewhat.

As a software development company, all of our code is in a git repository that has a remote on a VPS. Our documents are primarily Google docs. Important docs are checked into git.

Nothing we have locally that is important is not also at some remote location.

I assume you are using git to manage your home directory. I use gibak[1] for the same(built on top of git). Out of curiosity, what is your setup like?

[1] http://eigenclass.org/hiki/gibak-backup-system-introduction

Thanks, that is great feedback. Maybe so much is web-based now that backups aren't really relevant.

Is there any pain associated with restoring your dev environment? How long would it take you to get your text editor reinstalled, color scheme set back up, system preferences back, etc.?

Almost everything I care about is managed via git. That includes editor and shell configs and ~/bin -- restoring my own dev environment is so easy, I just randomly do it on different computers.

Every time I make a change worth saving and want to share it, I push it to gerrit, which means there are now two copies (one on my laptop and one "in the cloud" on my gerrit server).

Every time a change is reviewed, verified, and blessed as part of our code base, the reviewer or verifier pushes a button to submit the change, it's automatically sent up to github, github fires webhooks that automatically replicate the data down to a machine in the office, another copy on our build master (also in the office), and then, shortly after, another copy on every build slave.

Internally, files are stored via nfs or smb onto a solaris box that takes snapshots every 15m. Those snapshots are not stored externally.

The biggest part of our data, though, is made up of vm images. Big files that change a lot. Restoring them hurts, but anything that loads up IO on that box makes vmware think the NFS server isn't responding and it unmounts and remounts it (it even does this when nothing's using it -- though the NFS server itself works fine for other clients during this time).

I wouldn't mind a backup procedure there, but it'd have to not break that.

(sorry to not use your form... the further down I got, the less it applied to me)

"The biggest problem with backups is"

Missing option is impact on the computer while I'm using it. All of the transparent services (especially Time Machine) that I have experience with are abysmally bad at turning themselves off when the machine is being actively used.

That's my favorite aspect of Windows Home Server's backup. It does it once a day whenever you're asleep. It'll even wake up your machine, back it up, and them put it back to sleep. And it's smart enough to only do this if you're laptop is plugged into AC and not running on battery. And if any machine isn't backed up in a while, you get a Health warning in your system tray. And this is all pretty configurable.

How's that for low impact.

That's awesome! I'm very jealous.

Are you going to post the survey results on HN?

I bet a lot of people would find them interesting.

I got about halfway through the survey before realizing it was about workstation backup, not server backup.

Time Machine + Dropbox makes workstation backup a solved problem.

Slicehost Backup + Tarsnap makes server backup almost a solved problem. And I say almost because I don't like having no choice but to store my full bootable backups with the same provider that hosts my primary server. If I could store bootable weekly images off-site, I'd be golden.

> * Required

> Looks like you have a question or two that still needs to be filled out.

If I want to skip a question, I should be able to do so.

Agreed. If you give me a survey and mark all the questions "required", if I find anything I don't want to answer, you're out of luck on the whole survey because I will just click back.

Dropbox solved everything backup related for me, and I love it.

On the Mac, Dropbox doesn't do so well with restoring metadata: http://www.haystacksoftware.com/blog/2010/06/the-importance-... http://www.haystacksoftware.com/arq/dropbox-backup-bouncer-t...

(Disclaimer -- I write backup software: http://haystacksoftware.com/arq/)

Arq looks nice. I wasn't able to digest the (S3) pricing table, but all in all it leaves a favourable impression of a well-designed and well thought out program.


S3 pricing is confusing. It boils down to this: $.10/GB per month for storage, plus an insignificant amount of transaction costs;

After Nov 1, data transfer to S3 will be $.10 per GB;

Restores (data transfer from S3) are $.10 per GB (1 GB data transfer from S3 free per month).

What does this mean for regular people?

I tried to explain that in the blog post. Some quick examples: If you back up a Mac app using Dropbox and then restore it, nothing happens when you double-click on it because the executable file within it wasn't set to be executable. If you restore your "downloaded files" with Dropbox, the "quarantine" metadata won't be there so you won't get the "This is an application downloaded from the Internet" warning. If you restore a "bundle" (e.g. an iPhoto Library) the bundle won't look like a single file anymore -- it'll look like a folder; double-clicking on it would open the folder instead of launching iPhoto. Modification timestamps are all set to the current time, so if you used that information to find things in the past you're out of luck.

Absolutely agree.

Only problem: getting the 100gb option is too expensive for me (for now), which is a shame. Also, I'd love it if they had a much larger size as well, so I can quietly back up everything without worrying about space.

Currentlig using 60% of my 51GB. I've got all workrelated files there (I'm a webdesigner), my music and a folder of collected inspiration. No videos.

I'm also on about 50% of my 50GB. The thing is, I'd rather dump everything in there and not worry about what is and is not "critical".

That's what I've done.

Including the needs for backing up confidential data? Hm.

One thing to point out: backups in the cloud are useless if you don't have internet access when you want access or to restore data.

If you want to fix my backups, make the data readily accessible when there isn't a net connection between me and the service.

One time this usually bites is if you just (re-)installed a new OS and you need the network driver, which is only available online. I always keep a backup of the drivers on my external hard disk for this situation.

Another service I worry about is Gmail and other Google apps. With email you can easily download the several GB of email onto your hard disk, thus keeping a backup for when you or the service are offline.

Dropbox solves this nicely: it's a local folder that's synched to the cloud. If Dropbox is down or I'm offline, I still have access to my data.

Since I've not seen it mentioned yet: I use CrashPlan to back up all of my personal systems, both to a home server and to an remote peer. It's a bit RAM heavy, but other than that nicely stays out of the way & does it's job. The killer feature is trivial peer-to-peer connectivity, making off site backups easy. I'm not affiliated with the company -- just a happy customer.

What is this "trivial peer-to-peer connectivity" exactly?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact