
Instapaper's backup method - hugoahlberg
http://www.marco.org/1630412230
======
lockesh
Anyone else find this scheme completely atrocious?

1\. Relying on a home computer on the critical path for data backup and
persistence for a business

2\. Relying on a high latency, low quality networking path between the slave
db and the 'home mac' rather than a more reliable link between two machines in
a datacenter.

3\. A poor persistence model for long lived backups

4\. No easy way to programatically recover old backups

What's even more disturbing is that this isn't a new problem. Its not like we
don't know how to backup databases. This solution seems very poorly though
out.

~~~
zdw
Regarding point #1 - Marco's "Home Computer" is a Mac Pro (per other posts
he's made) - it has Xeon proceesors, ECC RAM, etc. Much closer to a server
than what you can pick up at Best Buy for $399.

~~~
adambyrtek
It's not about performance nor price, but conditions in which the machine
operates. Many servers used nowadays are cheaper than high-end desktop
machines.

~~~
zdw
Obviously most people misunderstood my GP post...

The point wasn't that his system was in some way adequate because he happened
to be using one good piece of kit, but that that particular piece of kit was
better than average.

Additionally, the GGP post makes some assumptions about acceptability of
backup procedures that may not be correct - for example, that in Instapaper's
case anything but the most current backup copy would be useful, and therefore
long term storage of older copies isn't of primary concern.

------
ams6110
The scenario he presents of being able to recover from an unintentionally
broad delete or update query would seem to only work in the simplest of
databases. He says:

 _\- Instantiate the backup (at its binlog position 259) \- Replay the binlog
from position 260 through 999 \- Replay the binlog from position 1001 through
1200 And you’ll have a copy of the complete database if that destructive query
had never happened._

This only works if the changes in positions 1001-1200 were unaffected by the
undesired changes in position 1000. Seems rather unlikely to me, but maybe in
the case of his particular schema it works out.

------
joshu
on delicious, we had a thing that would serialize a user to disk for every day
they were active. inactive users were not re-serialized.

this let us have day-to-day backups of individual users. this was necessary
when broken clients would delete all the user's items. so we could easily
restore an individual user (or do a historical recovery.)

~~~
bl4k
thats why I never have a DELETE in any query, only UPDATE and a state field
(ie. deleted)

performance advantage here as well since indexes aren't rebuilt and no table
lock

~~~
joshu
Indexes cab certainly update. Pretty sure innodb does not table lock for
delete either.

Also from a privacy perspective you can't keep people's data around forever.

~~~
portman
Maybe this is overly pedantic, but you can do whatever you want with people's
data, so long as you inform them of your policies and they agree to them.

~~~
heyitsnick
I doubt that's true, at least in the UK. Terms and conditions cannot trump
your personal rights. We have these terms laid out in the Data Protection Act
and I doubt you can sign this away be agreeing to a website policy that
contradicts them.

~~~
albertzeyer
Same Thing in Germany. The Law always has priority over whatever you write.

------
mseebach
It seems unnecessarily exposed to an event affecting Marco's home - fire,
burglary, natural disaster etc. It would appear more prudent to back up to a
cloud location. Either, as he mentions, S3, or a VPS somewhere.

~~~
tlack
The problem with backing up with S3 is that if you ever stop paying for S3 you
lose your backups. If you get sick for a month and your bank balance goes
negative, or the IRS takes control over your account for something, there go
your backups. I find that to be way more scary and likely than losing all of
my DVD backups.

~~~
lordmatty
The cost of storage on Amazon is very low - roughly $3 per month for Marco's
22GB.

I imagine most people running a company would have a separate corporate
account linked to a credit card, so that personal circumstances have less of a
major effect month to month.

~~~
tlack
It doesn't matter how much it costs. I still don't want to lose my backups due
to financial circumstances. I would say a fire destroying multiple safe places
to store a backup (i.e., leaving a flash drive or DVD at my partner's home) is
a lot less likely than financial mishap.

~~~
arfrank
The solution to this then seems to be that Amazon should allow for you to
prepay for AWS credit so one does not need to worry about their bank accounts
suddenly being frozen, or some other mishap, just that they have X months of
runaway in AWS credit for typical S3 charges.

~~~
tlack
That would be fantastic. They could also give you a discount for large
deposits; i.e., deposit 3X your average monthly usage and save 5%.

------
bl4k
I don't think backing up the entire db to a laptop is a good idea, since
laptops can get both lost and stolen. As somebody who uses the service, I am
not super-comfortable with knowing that a full copy of my account and
everything I save is sitting on a laptop somewhere.

It would be much better if these dumps were made to S3, or somewhere else that
is actually in a secure datacenter (and a step that includes the word
'encryption').

~~~
larrywright
It's not explicitly stated in the article, but in the tweet that started it
all[1], he mentioned it was a Mac Pro, rather than a laptop. So that's
somewhat less likely to be stolen than a laptop that is taken out of the house
regularly.

That said, I agree with you, and I hope it's at least encrypted.

[1] <http://twitter.com/#!/marcoarment/status/6035374438621184>

~~~
bl4k
ye not much better

there is a reason datacenters were built

thinking about this after I left my comment, having all that data on your
local machine is just crazy - you are one browser exploit or break-in away
from having it fall into somebody elses hands. It isn't professional for a web
service to be doing this - esp one that is now charging some customers.

~~~
petercooper
In principle, this is true, but we're talking Instapaper here. The only
sensitive data that could be in a list of URLs is if you were making a bunch
of porn or subversive literature to "Read Later." It's not on a par with
financial info or even personal notes.

~~~
mwg66
Who are you to say what conclusions can or cannot be drawn between a persons
name and a list of URLs they chose to read? I can think of many, undesirable
and potentially erroneous conclusions that could be made.

------
ludwigvan
[Disclaimer: Instapaper fan here, so my opinions might be biased. It is
probably the application I love the most on my iPad and iPod Touch. Thanks
Marco!]

Marco has recently left his position as the CEO of Tumblr; and I think
concentrates on Instapaper much more than ever (I assume it was mostly a
weekend project before, requiring simple fixes); therefore I have no doubt he
will be making the service more reliable and better in the future (switch to
S3 or similar).

Also, don't forget that Instapaper web service is currently free, although the
iOS applications are not (There is a free lite version too.) There is a
recently added subscription option (which AFAIK currently doesn't offer any
additional thing); and I hope it will only make the service even better.

About security, I do not consider my Instapaper reading list as too
confidential; so I don't have much trouble thinking the backup computer being
stolen. Of course, your mileage might vary. As far as I know, even some
accounts do not have passwords for Instapaper, you just login with your email
address.

~~~
stumm
He was actually the CTO of tumblr.

------
rarrrrrr
FYI You could run either tarsnap or SpiderOak directly on the server for a
prompt offsite backup. Both have excellent support for archiving many versions
of a file, with de-duplication of the version stream, and no limits on how
many historical versions are kept.

Also, "gzip --rsyncable" increases the compressed size by only about 1%, but
makes deduplication between successive compressed dump files possible.

(I cofounded SpiderOak.)

------
dcreemer
Are the primary and backup DBs in the same data center? If so, how would you
restore from an "unplanned event" there? I ask because I faced that situation
once years ago, and very quickly learned that uploading 10's of GB of data
from an offsite backup will keep your site offline for hours.

In the end I ended up _driving_ a copy of the DB over to a data center. Adding
a slaved-replica in another location is pretty easy these days.

~~~
ams6110
"Never underestimate the bandwidth of a station wagon full of tapes hurtling
down the highway" -- Andrew Tanenbaum

------
rbarooah
Would the people who are upset that Marco is using his 'home' computer feel
the same if he instead said it was at his office? Offices get broken into or
have equipment stolen too - I'm not sure why people think this is so
irresponsible given that he works from home now.

------
zbanks
That's really an amazing system. Super redundant.

A relatively easy boost, which he briefly mentioned, would be to also store
the data in S3. That should be easy enough to be automated, which could
provide a a somewhat-reliable off-site backup.

However, Instapaper has the benefit of a (relatively) small DB. 22GB isn't too
bad.I don't know how well this would scale to a 222GB DB with proportionally
higher usage rates. It'd be possible, but it would have to be simplified, no?

~~~
jforman
I'd call S3 super-reliable rather than somewhat-reliable:

"Amazon S3 is designed to provide 99.999999999% durability of objects over a
given year. This durability level corresponds to an average annual expected
loss of 0.000000001% of objects. For example, if you store 10,000 objects with
Amazon S3, you can on average expect to incur a loss of a single object once
every 10,000,000 years. In addition, Amazon S3 is designed to sustain the
concurrent loss of data in two facilities."

It's slow as a result...but that's the trade-off you're looking for in a
backup.

<http://aws.amazon.com/s3/faqs/#How_durable_is_Amazon_S3>

~~~
slaven
Yes, unless there is a billing problem with your account, someone is causing a
problem with your account, etc. As an Amazon S3 and AWS user I would say they
are fairly reliable overall - nowhere near super-reliable!

------
philfreo
I upvoted this not because I think personal laptops and Time Machine are a
good process for db backups, but because making backups is still a huge pain
and problematic area, so the more attention it gets, the better.

------
hugoahlberg
Marco has now updated his system with automatic S3 backup:
<http://www.marco.org/1630412230>

------
japherwocky
are those binlogs timestamped? what wonderful graphs you could make!

------
konad
I just dump data into Venti and dump my 4gb Venti slices encrypted to DVD and
keep an encrypted copy of my vac scores distributed around my systems.

If you're doing full dumps every few days, you're doing it wrong.

