
Dribbble is Back With a Day of Data Lost - 19_ploT
http://blog.dribbble.com/post/50490891209/regarding-our-recent-site-outage
======
replicatorblog
There is a lot of arm chair sysadmining going on, but remember, the team that
built Dribbble is essentially 4 total people, 2.5 engineers, working with no
outside funding. The fact that they've built the designer's equivalent of
Github and keep it running as smoothly as it does is amazing. It's fine to
provide suggestions, but this is a minor blip in an otherwise impeccable
record of performance.

~~~
patio11
If there's one social norm I'd love for HN, it would be "If you build things,
we're on your side." (I hope that the normative intent of this is clear enough
to not require 2 paragraphs of inoculations against nitpickery. On second
thought, if there were two social norms I'd like for HN, that plus "Default to
not nitpicking.")

~~~
tptacek
Strongest possible agree.

If there's one thing my gut says has changed for the worse since I joined,
it's the cant away from supporting people who build things to tearing them
down.

I asked in a "how do we improve HN comments" thread awhile back if Paul Graham
could just add this to the guidelines, but it got drowned out by all the nerdy
feature requests and didn't get much discussion.

------
NelsonMinar
Good disclosure on the part of Dribbble.

I have some sympathy; I've seen a Linux server randomly corrupt its file
cache, no idea why. Google's study found 8% of DIMMs experienced at least one
memory error a year. If you can't trust your RAM, what can you trust?
<http://research.google.com/pubs/pub35162.html>

~~~
manmal
In ECC we trust.

~~~
NelsonMinar
One of the great ironies of modern computing is that we stopped building ECC
into consumer hardware right when we got enough RAM to really need it. What
fraction of server hosting has ECC RAM? No one seems to know if Amazon EC2
does, for instance, which suggests it probably doesn't.

~~~
nodesocket
I would be really surprised if the memory on EC2 is not ECC. 32GB of ECC only
runs $399.00 on Crucial.

------
jakerocheleau
It could have been a lot worse, for sure. I'm just happy they resolved the
issue and it's back online with minimal damage.

Building & maintaining a website is always a learning lesson because there are
so many different areas to study.

------
hijinks
Not trying to be an ass here or anything but something doesn't add up. I
understand the memory corruption idea but I wouldn't think that would
replicate to the other postgresql server. So am I right in thinking there was
no slave ever here?

~~~
pilif
It really depends on how you have configured replication and what the exact
issue was. Postgres replication either works by directly streaming the WAL
archive or by manually shipping older archived WAL files. If these files were
corrupted on the master, then the slave would also get the corrupted files.

Now the files (and when streaming directly, the packets) have a header
containing some metadata and the actual WAL log entries have a fixed
formatting, so it's likely that the slave would have detected this corruption
(unless you were really unlucky which would then easily replicate the
corruption over to the slave).

But that would just lead to the slave stopping to actually replicate. Unless
you watch your clients whether they are still ok, streaming from the master
and the replication lag is reasonably low, you would not notice the
replication stopping. When you fail over, you get to the state which the
database was in when the first corrupted packet arrived.

So either you check your slaves, or you use two-phase commit, ensuring that
your data has reached the slaves, but that has some serious performance costs.

BTW: I would assume this was far more likely an issue with their storage, not
with RAM.

~~~
hijinks
thanks for the explanation

------
timmm
The links at the top don't really work.

------
benjaminwootton
Perhaps we will start to see posts about people abandoning Postgres and moving
back to MongoDB, completing the circle?

~~~
camus
the point is, no software can prevent hardware failure , not even mongodb ;)

~~~
brokenparser
Except for those programs which listen to sensory input, specifically to
ensure safe operating conditions. E.g. if it's a deliberate design choice not
to allow operations when external factors are out of bounds, I'd consider it a
success if control software decides to shut down when that happens. In doing
so, it has prevented the hardware from entering a potentially devastating
failure mode. Sorry to wit.

------
camus
Glad you are back, hope you'll do what it takes to avoid another data loss;)
Good luck.

