Hacker News new | comments | ask | show | jobs | submit login
Instapaper's backup method (marco.org)
166 points by hugoahlberg on Nov 20, 2010 | hide | past | web | favorite | 43 comments

Anyone else find this scheme completely atrocious?

1. Relying on a home computer on the critical path for data backup and persistence for a business

2. Relying on a high latency, low quality networking path between the slave db and the 'home mac' rather than a more reliable link between two machines in a datacenter.

3. A poor persistence model for long lived backups

4. No easy way to programatically recover old backups

What's even more disturbing is that this isn't a new problem. Its not like we don't know how to backup databases. This solution seems very poorly though out.

Regarding point #1 - Marco's "Home Computer" is a Mac Pro (per other posts he's made) - it has Xeon proceesors, ECC RAM, etc. Much closer to a server than what you can pick up at Best Buy for $399.

It's not about performance nor price, but conditions in which the machine operates. Many servers used nowadays are cheaper than high-end desktop machines.

Obviously most people misunderstood my GP post...

The point wasn't that his system was in some way adequate because he happened to be using one good piece of kit, but that that particular piece of kit was better than average.

Additionally, the GGP post makes some assumptions about acceptability of backup procedures that may not be correct - for example, that in Instapaper's case anything but the most current backup copy would be useful, and therefore long term storage of older copies isn't of primary concern.

The scenario he presents of being able to recover from an unintentionally broad delete or update query would seem to only work in the simplest of databases. He says:

- Instantiate the backup (at its binlog position 259) - Replay the binlog from position 260 through 999 - Replay the binlog from position 1001 through 1200 And you’ll have a copy of the complete database if that destructive query had never happened.

This only works if the changes in positions 1001-1200 were unaffected by the undesired changes in position 1000. Seems rather unlikely to me, but maybe in the case of his particular schema it works out.

on delicious, we had a thing that would serialize a user to disk for every day they were active. inactive users were not re-serialized.

this let us have day-to-day backups of individual users. this was necessary when broken clients would delete all the user's items. so we could easily restore an individual user (or do a historical recovery.)

thats why I never have a DELETE in any query, only UPDATE and a state field (ie. deleted)

performance advantage here as well since indexes aren't rebuilt and no table lock

Indexes cab certainly update. Pretty sure innodb does not table lock for delete either.

Also from a privacy perspective you can't keep people's data around forever.

even with innodb you will still find yourself running optimize.

checkout what wordpress does (from wp-content/plugins/akismet/akismet.php):

  if ( (mt_rand(1, 10) == 3) ) {
    // WP 2.0: run this one time in ten
and then in that function, after the DELETE, is this:

  $wpdb->query("OPTIMIZE TABLE $wpdb->comments");
I am sure there are plenty of people out there having fun trying to work out why there tables suddenly lock and they see an optimize process running randomly. I am also sure it runs fine in their unit tests when they have 1 post and 2 comments.

all this because they DELETE :) They have all comments, those that have been approved, those that are in moderation, and all spam, in the same table - so if they don't delete the table would become unmanagable, so it is the design at fault and the wrong solution. If you search source at github or somewhere similar, you will find projects with OPTIMIZE everywhere - solving a real problem entirely the wrong way

I got used to it because the advantages just far outweigh the disadvantages. Records don't disappear for malicious reasons or because of mistakes - you can purge records marked delete every 30 days with a background process, if you like - but I no longer, ever, type that keyword into an app.

Maybe this is overly pedantic, but you can do whatever you want with people's data, so long as you inform them of your policies and they agree to them.

I doubt that's true, at least in the UK. Terms and conditions cannot trump your personal rights. We have these terms laid out in the Data Protection Act and I doubt you can sign this away be agreeing to a website policy that contradicts them.

Same Thing in Germany. The Law always has priority over whatever you write.

I just read the 1998 Data Protection Act, and it is very similar to dozens of laws that govern data privacy in the US.

There are 8 specific directives in the law. #1-6 are about consent, #7 is about security, and #8 is about correcting inaccuracies.

So again, if a website owner clearly informs the end-user about their policies, and the end-user agrees then, the website owner is in compliance with law.

The specific example that motivated my point -- "you can't keep people's data around forever" -- is simply not true in the US or in the UK (if the '98 DPA is the only applicable law; there may be others I'm not aware of).

It seems unnecessarily exposed to an event affecting Marco's home - fire, burglary, natural disaster etc. It would appear more prudent to back up to a cloud location. Either, as he mentions, S3, or a VPS somewhere.

The problem with backing up with S3 is that if you ever stop paying for S3 you lose your backups. If you get sick for a month and your bank balance goes negative, or the IRS takes control over your account for something, there go your backups. I find that to be way more scary and likely than losing all of my DVD backups.

The cost of storage on Amazon is very low - roughly $3 per month for Marco's 22GB.

I imagine most people running a company would have a separate corporate account linked to a credit card, so that personal circumstances have less of a major effect month to month.

It doesn't matter how much it costs. I still don't want to lose my backups due to financial circumstances. I would say a fire destroying multiple safe places to store a backup (i.e., leaving a flash drive or DVD at my partner's home) is a lot less likely than financial mishap.

The solution to this then seems to be that Amazon should allow for you to prepay for AWS credit so one does not need to worry about their bank accounts suddenly being frozen, or some other mishap, just that they have X months of runaway in AWS credit for typical S3 charges.

That would be fantastic. They could also give you a discount for large deposits; i.e., deposit 3X your average monthly usage and save 5%.

Another issue is that your S3 credentials are stored on your primary server. An attacker who gains access to that machine will also gain access to off site backups, and can completely destroy your business.

I was under the impression that using S3's versioned object support, it's possible to set up an account that has the ability to write objects but not to delete previous versions.


See also the followup question:

Q: How can I ensure maximum protection of my preserved versions?

Versioning’s MFA Delete capability, which uses multi-factor authentication, can be used to provide an additional layer of security. By default, all requests to your Amazon S3 bucket require your AWS account credentials. If you enable Versioning with MFA Delete on your Amazon S3 bucket, two forms of authentication are required to permanently delete a version of an object: your AWS account credentials and a valid six-digit code and serial number from an authentication device in your physical possession

I agree completely. It doesn't seem wise, legally, to store one's business backups at home. Whether my thoughts are warranted or not I have no idea, but I'd just pay the small S3 fee and be on with it.

One of his earlier tweets that led to this article suggests it's not just backed up at his home:

“I should blog about Instapaper's backup setup sometime. It's pretty extensive. A lot of places would need to burn down to lose your data.”

Maybe he just likes having a complete copy of the production data on his local development instance? Great for local data ming too.

I agree with that.. Kinda expected a better setup for a real business.

I don't think backing up the entire db to a laptop is a good idea, since laptops can get both lost and stolen. As somebody who uses the service, I am not super-comfortable with knowing that a full copy of my account and everything I save is sitting on a laptop somewhere.

It would be much better if these dumps were made to S3, or somewhere else that is actually in a secure datacenter (and a step that includes the word 'encryption').

It's not explicitly stated in the article, but in the tweet that started it all[1], he mentioned it was a Mac Pro, rather than a laptop. So that's somewhat less likely to be stolen than a laptop that is taken out of the house regularly.

That said, I agree with you, and I hope it's at least encrypted.

[1] http://twitter.com/#!/marcoarment/status/6035374438621184

ye not much better

there is a reason datacenters were built

thinking about this after I left my comment, having all that data on your local machine is just crazy - you are one browser exploit or break-in away from having it fall into somebody elses hands. It isn't professional for a web service to be doing this - esp one that is now charging some customers.

In principle, this is true, but we're talking Instapaper here. The only sensitive data that could be in a list of URLs is if you were making a bunch of porn or subversive literature to "Read Later." It's not on a par with financial info or even personal notes.

Who are you to say what conclusions can or cannot be drawn between a persons name and a list of URLs they chose to read? I can think of many, undesirable and potentially erroneous conclusions that could be made.

[Disclaimer: Instapaper fan here, so my opinions might be biased. It is probably the application I love the most on my iPad and iPod Touch. Thanks Marco!]

Marco has recently left his position as the CEO of Tumblr; and I think concentrates on Instapaper much more than ever (I assume it was mostly a weekend project before, requiring simple fixes); therefore I have no doubt he will be making the service more reliable and better in the future (switch to S3 or similar).

Also, don't forget that Instapaper web service is currently free, although the iOS applications are not (There is a free lite version too.) There is a recently added subscription option (which AFAIK currently doesn't offer any additional thing); and I hope it will only make the service even better.

About security, I do not consider my Instapaper reading list as too confidential; so I don't have much trouble thinking the backup computer being stolen. Of course, your mileage might vary. As far as I know, even some accounts do not have passwords for Instapaper, you just login with your email address.

He was actually the CTO of tumblr.

Are the primary and backup DBs in the same data center? If so, how would you restore from an "unplanned event" there? I ask because I faced that situation once years ago, and very quickly learned that uploading 10's of GB of data from an offsite backup will keep your site offline for hours.

In the end I ended up _driving_ a copy of the DB over to a data center. Adding a slaved-replica in another location is pretty easy these days.

"Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway" -- Andrew Tanenbaum

Would the people who are upset that Marco is using his 'home' computer feel the same if he instead said it was at his office? Offices get broken into or have equipment stolen too - I'm not sure why people think this is so irresponsible given that he works from home now.

FYI You could run either tarsnap or SpiderOak directly on the server for a prompt offsite backup. Both have excellent support for archiving many versions of a file, with de-duplication of the version stream, and no limits on how many historical versions are kept.

Also, "gzip --rsyncable" increases the compressed size by only about 1%, but makes deduplication between successive compressed dump files possible.

(I cofounded SpiderOak.)

That's really an amazing system. Super redundant.

A relatively easy boost, which he briefly mentioned, would be to also store the data in S3. That should be easy enough to be automated, which could provide a a somewhat-reliable off-site backup.

However, Instapaper has the benefit of a (relatively) small DB. 22GB isn't too bad.I don't know how well this would scale to a 222GB DB with proportionally higher usage rates. It'd be possible, but it would have to be simplified, no?

I'd call S3 super-reliable rather than somewhat-reliable:

"Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities."

It's slow as a result...but that's the trade-off you're looking for in a backup.


Yes, unless there is a billing problem with your account, someone is causing a problem with your account, etc. As an Amazon S3 and AWS user I would say they are fairly reliable overall - nowhere near super-reliable!

I upvoted this not because I think personal laptops and Time Machine are a good process for db backups, but because making backups is still a huge pain and problematic area, so the more attention it gets, the better.

Marco has now updated his system with automatic S3 backup: http://www.marco.org/1630412230

are those binlogs timestamped? what wonderful graphs you could make!

I just dump data into Venti and dump my 4gb Venti slices encrypted to DVD and keep an encrypted copy of my vac scores distributed around my systems.

If you're doing full dumps every few days, you're doing it wrong.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact