When doing something really critical (such as playing with the master database late at night) ALWAYS work with a checklist. Write down WHAT you are going to do, and if possible, talk to a coworker about it so you can vocalize the steps. If there is no coworker, talk to your rubber ducky or stapler on your desk. This will help you catch mistakes. Then when the entire plan looks sensible, go through the steps one by one. Don't deviate from the plan. Don't get distracted and start switching between terminal windows. While making the checklist ask yourself if what you're doing is A) absolutely necessary and B) risks making things worse. Even when the angry emails are piling you can't allow that pressure to cloud your judgment.
Every startup has moments when last-minute panic-patching of a critical part of the server infrastructure is needed, but if you use a checklist you're not likely to mess up badly, even when tired.
reply
YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.
pushd dir ; find . -type f -ls | less ; find . -type f -exec rm '{}' \; ; popd ; rm -rf dir
i2-badge ()
{
printf "\e]1337;SetBadgeFormat=%s\a" $(echo -n "$1" | base64)
}
Here's a generator for Bash: http://bashrcgenerator.com/, the prompt's format string is stored in the $PS1 variable.
I daresay - having hostname as part of prompt saves lot of trouble.
Far, far too many companies get production going and then just check to see that certain things "completed successfully" or didn't throw an overt alert in terms of their safety nets.
Just because things seem to be working doesn't mean they are or that they are working in a way that is recoverable.
Virtually the only way to lose data is to not have backups. We live in such fancy times that there's no reason to ever lose data that you care about.
My uneducated guess (this one hit a friend of mine): expired/revoked AWS credentials combined with a backup script that doesn't exit(1) on failure and just writes the exception trace to stderr.
followed by
> So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.
is no way to be running a public service with paying customers. Did the person who set up that S3 job simply write a script or something and just go "yep, it's done" and walk away? Seriously?
Unfortunately, this kind of situation, "only the ideal case ever worked at all", is not uncommon. I've seen it before ... when doing things the right way, dotting 'I's and crossing 'T's, requires an experienced employee a good week or two, it's very tempting for a lean startup to bang out something that seems to work in a couple days and move on.
I'm not complaining, but backup/restore is important part, with 100% test coverage and daily backup/restore runs.
> At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.
This is why I'm not a fan of emergency pager duty.
I very much appreciate their forthrightness and the way they conduct their company generally. Having said that, I have the code I work on, related content, and a number of clients on the service.
[edit for additional point]
They need the infrastructure guy they've been looking for sooner than later. I hope there's good progress on that front.
Is this an SOA where important data might lie in another service or data store, or is this a monolithic app and DB that is responsible for many (or all) things?
What was stored in that database? Does this affect user data? Code?
Regarding what's gone: The production PostgreSQL database. This suggests that the code itself is fine, but the mappings to the users are gone. But git is a distributed VCS after all, so all the code should be on the developer's machines as well.
When doing something really critical (such as playing with the master database late at night) ALWAYS work with a checklist. Write down WHAT you are going to do, and if possible, talk to a coworker about it so you can vocalize the steps. If there is no coworker, talk to your rubber ducky or stapler on your desk. This will help you catch mistakes. Then when the entire plan looks sensible, go through the steps one by one. Don't deviate from the plan. Don't get distracted and start switching between terminal windows. While making the checklist ask yourself if what you're doing is A) absolutely necessary and B) risks making things worse. Even when the angry emails are piling you can't allow that pressure to cloud your judgment.
Every startup has moments when last-minute panic-patching of a critical part of the server infrastructure is needed, but if you use a checklist you're not likely to mess up badly, even when tired.
reply