
Dribbble offline due to Postgres problem - ericras
https://twitter.com/dribbble/status/334484687131123712
======
rosser
This looks like disk page corruption.

In such a case, the _very first_ thing you do — aside from verifying that your
backups, replication, and WAL archiving are all working — before you even try
diagnosing your problem any further than "something weird is happening" is
_make a filesystem-level copy of your PostgreSQL cluster_. (If you're running
LVM, ZFS, a SAN, or any other thing that lets you take an instantaneous
filesystem snapshot, so much the better. Do that, and then copy _it_.) Then,
and only then, should you even contemplate trying to un-fuck your database.

I cannot stress this point enough. That means make a tarball, or cp the
directory, or rsync it, or whatever _disk file level_ tool you prefer, and
specifically and emphatically _not_ pg_dump. (It's probably not going to make
things any worse, but if you do have corrupted disk pages, pg_dump isn't
likely work, anyway.)

Flailing around trying to fix things can sometimes make them worse. If you're
working on your already broken data, and break it further without the safety
net of a fs-level backup, you're ... well, you're worse off than you were five
minutes prior, aren't you?

~~~
astine
Better, if you _do_ have backups, just drop your database, reconstitute a new
one (using your schema definition file which you should have saved,) and load
your most recent backup. I've dealt with disk file level corruption and the
best way to deal with it IMHO is not to deal with it at all.

------
PhantomGremlin
Since "dribbble.com" appears to be "taking a timeout for maintenance" I can't
at a glance determine what, exactly, they do. So maybe they're a bunch of
hobbyists. But if not ...

I don't understand how a "real" company, even in todays overheated environment
of soon-to-fail hipster startups, doesn't do (at least) one of these two
things: a) have competent employees on staff that are intimately familiar with
their critical infrastructure and how to support it or b) pay, yes, gasp, pay
some other company for professional _support_ services. A glance at the
Wikipedia entry for Postgres shows some possible companies that do that. Or,
as already mentioned, there are mailing lists.

But twitter? Really? That's support? For someone other than a hobbyist running
a website in their spare time?

Go ahead, flame me. But my first impression is "amateur hour". My apologies if
dribbble.com really is an amateur effort.

~~~
Xylakant
Even people intimately familiar with a pieces of infrastructure - hardware or
software - can't possibly know all failure modes. Asking on twitter about an
obscure error that you've never seen doesn't cost anything and may be very
well just one way of finding a fix. It's akin to asking on IRC or writing a
mail to a mailing list - something which you seem to condone. I fail to see
how it's any less professional.

~~~
papsosouid
>It's akin to asking on IRC or writing a mail to a mailing list

No it isn't. It is akin to shouting at random people on the street "help I
don't know what to do!". Asking on IRC on a mailing list is going to a group
of people who are there specifically to help people with problems on that
subject.

~~~
Xylakant
If I had the reasonable expectation to find somebody qualified willing to help
on the street with practically no effort I'd go that route as well. It's just
very time consuming to do so, so that's why I usually don't.

~~~
papsosouid
That's precisely the point though, it isn't a reasonable expectation. The odds
of a postgresql expert just happening to follow them on twitter are poor. The
odds of multiple postgresql experts being on the postgresql mailing list is
quite high.

~~~
Xylakant
Sure, the chances of finding an expert are low, but you'd be surprised how
well it works. See, the tweet was posted to HN by someone and the chances of
finding a Postgres expert here are significantly higher. More important is
that the costs are low, even lower than writing to a ML. So why not try?

------
joevandyk
Everyone should be running pg_basebackup + pg_receivexlog on a separate
machine, preferably at a different data center.

These let you go back to any point in time. If you ran 'delete from orders
where id=id', you can restore to the transaction before you ran that command.

[http://www.packtpub.com/how-to-postgresql-backup-and-
restore...](http://www.packtpub.com/how-to-postgresql-backup-and-restore/book)
contains more information.

Also, postgresql 9.3 (out in a few months) supports disk page checksums which
can detect filesystem failures immediately.

~~~
jordanthoms
I'm using <https://github.com/wal-e/wal-e> which ships the logs to S3 with
good results

~~~
joevandyk
That's good as well!

If you are hosting your database on ec2, using only wal-e means that all of
your data is hosted on amazon. If they were to cancel your service, you'd lose
your data.

Running pg_basebackup+pg_receivexlog on a different provider is cheap
insurance against that.

Be sure to test how fast wal-e can restore your data, btw. Restoring from s3
was significantly slower than restoring from a local disk (in my testing a
while ago).

~~~
wiredfool
I ship wal logs to s3 for backup, and I do test restores on EC2.

What I've found is that most of the files can be grabbed really quickly, in
the 3mb/sec range, but there's always a handful that run at 300k/sec. Running
the downloads 8 or more at a time tends to help with that so that we're mostly
maxing out the local network.

------
parfe
The error in question is raised at
<http://doxygen.postgresql.org/sequence_8c_source.html#l01083>

My _guess_ is hard drive failure. Hope they have good backups.

~~~
mattryanharris
And if they don't?

~~~
hijinks
then they have learned an important messages.. raid or replication isn't a
backup

~~~
mattspitz
Can you further explain why raid/replication isn't a backup? Sure, you can't
backup beyond what you've replicated (locally or across a network), but for
most folks, that should be fine.

~~~
redler
Trolling, I assume? Just in case not: your own oh-shit moments (forgot that
WHERE clause?), coding errors and oversights, bad luck, adversary SQL
injection because the contractor who wrote some obscure admin page two years
ago didn't know what he was doing, customer screw-ups (your web app customer
fat-fingers their own critical data out of existence and begs you to roll them
back to yesterday), etc.

If you have a real business you need a multi-generational backup scheme of
some kind.

------
sickpig
Did they try to post about their problem on postgresql mailing list (both
pgsql bugs and general apply here)?

Postgres community is quite responsive and usually give very effective advice,
as long as you use the proper channel to communicate.

~~~
pilif
Also, #postgres on Freenode. No matter when or what you ask, one of the
regulars will usually respond instantly with very high-quality advice. No
single product, no matter whether it was free software or not has ever reached
this kind of support quality for me.

------
bifrost
I feel for you guys, but it sounds like a disk error. If you're not experts,
you need to do one of the following A) rollback to a snapshot B) restore from
backup. If you have neither of those, take a sql backup, and pray you can
recreate the broken data.

------
taitems
Might as well comment here as tweets to the account seem to do nothing. Every
25th click or so has always resulted in a Chrome "no content" error. It's a
frequent problem, and it's been happening for years.

~~~
samuelfine
You should probably email support@dribbble.com. Twitter isn't an "official"
support channel, so your tweets are far less likely to get noticed.

In the interest of saving time: browser / version / OS? Any browser extensions
installed?

~~~
taitems
Agreed, but as a follower I see how active they are and how often they respond
to users.

For clarity: Mac and Windows Chrome sans extensions, but these days I'm
running a few like Adblock and Chime.

------
19_ploT
It's back but I think one day of data is lost!

