This is not good! We don't want to scare people into writing less of these. We want to encourage people to write more of them. An MBA style "due to a human error, we lost a day of your data, we're tremendously sorry, we're doing everything in our power yadayada" isn't going to help anybody.
Yes, there's all kinds of things they could have done to prevent this from happening. Yes, some of the things they did (not) do were clearly mistakes that a seasoned DBA or sysadmin would not make. Possibly they aren't seasoned DBAs or sysadmins. Or they are but they still made a mistake.
This stuff happens. It sucks, but it still does. Get over yourselves and wish these people some luck.
In software there is still a certain arrogance of quickly calling the user (or other software professional) stupid, thinking it can't happen to you. But in reality given enough time, everyone makes at least one stupid mistake, it's how humans work.
Even when it was a suicidal pilot flying the plane into a mountain on purpose. Someone had to supervise him (there are two crew members in the cockpit for a reason), someone gave him a medical, there is automation in the cockpit that could have at least caused an alarm, etc.
So even when the accident is ultimately caused by a pilot's actions, there is always a chain of events where if any of the segments were broken the accident wouldn't have happened.
While we can't prevent a bonkers pilot from crashing a plane, we could perhaps prevent a bonkers crew member from flying the plane in the first place.
Aka the Swiss cheese model. You don't want to let the holes to align.
This approach is widely used in accident investigations and not only in aviation. Most industrial accidents are investigated like this, trying to understand the entire chain of events in order that processes could be improved and the problem prevented in the future.
Oh and there is one more key part in aviation that isn't elsewhere. The goal of an accident or incident investigation IS NOT TO APPORTION BLAME. It is to learn from it. That's why pilots in airlines with a healthy safety culture are encouraged to report problems, unsafe practices, etc. and this is used to fix the process instead of firing people. Once you start to play the blame game, people won't report problems - and you are flying blind into a disaster sooner or later.
So you could argue that there have been a lot of post-mortems through the ages, with great ideas thrown around on how to avoid crimes being committed (at least against me/us). It's not just about locking people up.
When's the last time someone accidentally committed armed robbery?
But good prevention is really hard, education being a good example. Today most people have internet access and you can educate yourself there (Wikipedia, Khan Academy, YouTube...). Access to education is not a problem - getting people to educate themselves is. Nerds do it on their own, many don't. It takes individual effort to get children's minds to learn. You need teachers who like teaching and can get children excited about the world. It's not as easy as giving everyone an iPad, funding without understanding the problems doesn't work. (I guess there are situations where funding easily solves problems depending on your country and school.)
(I'm not against funding education, I just wish it would happen in a smarter way.)
On the railways in Britain the failures were extensively documented. Years ago it was possible for a single failure to cause a loss. But over the years the systems have been patched and if you look at more recent incidents it is always a multitude of factors aligning that cause the loss. Sometimes it's amazing how precisely these individual elements have to align, but it's just probability.
As demonstrated by the article here, we are still in the stage where single failures can cause a loss. But it's a bit different because there is no single universal body regulating every computer system.
E.g. in the article's case it is clear that there is some sort of procedural deficiency there that allows the configuration variables to be set wrong and thus cause a connection to the wrong database.
Another one is that the function that has directly caused the data loss DOES NOT CHECK for this.
Yet another WTF is that if that code is meant to ever run on a development system, why is it in a production codebase in the first place?
And the worst bit? They throw arms up in the air, unable to identify the reason why this has happened. So they are leaving the possibility open to another similar mistake happening in the future, even though they have removed the offending code.
Oh and the fact that they don't have backups except for those of the hosting provider (which really shouldn't be relied on except as the last hail Mary solution!) is telling.
That's not a robust system design, especially if they are hosting customers' data.
And while this sounds overly simplistic the simplest way this could have been avoided is enforcing production hygiene. No developers on production boxes. Ever.
1. With the flight deck door closed, the three flight attendants place a drink cart between first class and the attendant area/crew bathroom. There's now a ~4.5' barrier locked against the frame of the plane.
2. The flight deck door is opened; one flight attendant goes into the flight deck while one pilot uses the restroom. The flight deck door is left open but the attendant is standing right next to it (but facing the lone pilot). The other two attendants stand against the drink cart, one facing the passengers and one facing the flight deck.
3. Pilots switch while the third attendant remains on the flight deck.
4. After both pilots are done, the flight deck door is closed and locked and the drink cart is returned to where ever they store it.
Any action by a passenger would cause the flight deck door to be closed and locked. Any action by the lone pilot would cause alarm by the flight deck attendant. Any action by the flight deck attendant would cause alarm by the other two.
There was indeed a suicidal pilot that flew into a mountain, I'm not sure if you were deliberately referencing that specific time. In that case he was alone in the cabin – this would have happened briefly but he was able to lock the cabin door before anyone re-entered, and the lock cannot be opened by anyone from the other side in order to avoid September 11th-type situations. It only locks for a brief period but it can be reapplied from the pilot side before it expires an indefinite number of times.
I'm not saying that we can put that one down purely to human action, just that (to be pedantic) he wasn't being supervised by anyone, and there were already any number of alarms going off (and the frantic copilots on the other side of the door were well aware of them).
A similar procedure already exists for controlled rest in oceanic cruise flight at certain times, using the cabin crew to ensure the remaining pilot was checked to be awake every 20 minutes.
That pilot shouldn't have been in the cockpit to begin with - his eyesight was failing, he had mental problems (has been medically treated for suicidal tendencies), etc. This was not discovered nor identified, due to deficiencies in the system (doctors didn't have the duty to report this, he withheld the information from his employer, etc.)
The issue with the door was only the last element of the chain.
There were changes as the result of this incident - the cabin crew member has to be in the cockpit whenever one of the pilots steps out, there were changes to how the doors operate, etc.
Not really sure what you can about the suicidal tendencies. If you make pilots report medical treatment for suicidal tendencies, they aren't going to seek treatment for suicidal tendencies.
On the day of the crash he was not supposed to be on the plane at all - a paper from the doctors was found at his place after the crash declaring him unfit for duty. He kept it from his employer and it wasn't reported by the doctors neither (they didn't have the duty to do so), so the airline had no idea. Making a few of the holes in the cheese align nicely.
Pilots have the obligation to report when they are unfit for duty already, (no matter what the reason, being treated for a psychiatric problem certainly applies, though).
What was/is missing is the obligation of doctors to report such important issue to the employer when the crewman is unfit. It could be argued that it would be an invasion of privacy but there are precedents for this - e.g. failed medicals are routinely being reported to the authorities (not just for pilots - also for car drivers, gun holders, etc. where the corresponding licenses are then suspended), as are discoveries of e.g. child abuse.
Software vendor cannot be held responsible for errors committed by the user.
That would be blaming a parachute maker for the death of the guy who jumped out of a plane without a parachute or with one rigged wrong despite the explicit instructions (or industrial best practices) telling him not to do so.
Certainly vendors need to make sure that their product is fit for the purpose and doesn't contain glaring design problems (e.g. the infamous Therac-25 scandal) but that alone is not enough to prevent a disaster.
For example, in the cited article there was no "software error". The data haven't been lost because of a bug in some 3rd party code.
Data security and safety is always a process, there is no magic bullet you can buy and be done with it, with no effort of your own.
The swiss cheese model shows this - some of the cheese layers are safeguards put in place by the vendor, the others are there for you to put in place (e.g. the various best practices, safe work procedures, backups, etc.) If you don't, well, you are making the holes easier to align because there are now fewer safety layers between you and the disaster. By your own choice.
The common example is C: "C is a sharp tool, but with a sufficiently smart, careful and experienced developer it does what you want (you're holding it wrong").
Developers still do this to each other.
What happened is that users started blaming themselves for what was going wrong, or start thinking they needed a new PC because problems would become more frequent.
From the perspective of a software guy, it was obvious that windows was the culprit but people would assign blame elsewhere and frequently point the finger at themselves.
so yes - an FAA investigation would end up unraveling the nonsense and point to windows.
That said, aviation level of safety is reliable and dependable and few single points of failure and... there are no private kit jets darnit!
There is a continuum from nothing changes & everything works to everything changes & nothing works. You have to choose the appropriate place on the dial for the task. Sounds like this is a one-man band.
I understand what you are taking about, but aviation has also strong expectations on pilots.
Also, if the guy or gal has alcohol problems, it would likely be visible on their flying performance over time, it should be noticed during the periodic medicals, etc.
So while a drunk pilot could be the immediate cause of a crash, it is not the only one. If any of those other things I have mentioned functioned as designed (or were in place to start with - not all flying is airline flying!), the accident wouldn't have happened.
If you focus only on the "drunk pilot, case closed", you will never identify deficiencies you may have elsewhere and which have contributed to the problem.
Maybe they don't get fired if they report themselves unable to fly beforehand but I wouldn't quite call that a no blame culture.
There is also Cockpit Resource Management  which addresses the human factor in great detail (how people work with each other, and how prepared are people).
In general what you learn from reading these things is that its rarely one big error or issue - but many small things leading to the failure event.
1 - https://www.tsb.gc.ca/eng/rapports-reports/aviation/index.ht...
2 - https://www.ntsb.gov/investigations/AccidentReports/Pages/Ac...
3 - https://en.wikipedia.org/wiki/Crew_resource_management
Of course trying to assign blame is human nature, so the reports are not always completely neutral. When I read the actual NTSB report for Sullenburger's "Miracle on the Hudson", I was forced to conclude that while there were some things that the pilots could in theory have done better, given the pilots training and documented procedures, they honestly did better than could reasonably be expected. I am nearly certain that some of the wording in the report was carefully chosen to lead one to this conclusion, despite still pointing out the places where the pilots actions were suboptimal (and thus appearing facially neutral).
The "what can we do to avoid this ever happing again?" attitude applies to real air transit accident reports. Sadly many general aviation accident reports really do just become "pilot error".
Some days it’s just an on line community that gets burned to the ground.
Other days it’s just a service tied into hundred of small businesses that gets burned to the ground.
Other says it’s massive financial platform getting burned to the ground.
I’m responsible for the latter but the former two have had a much larger impact for many people when they occur. Trivialising the lax administrative discipline because a product isn’t deemed important is a slippery slope.
We need to start building competence in to what we do regardless of what it is rather than run on apologies because it’s cheaper.
The project never recovered.
Safety culture element highlighted is: not blaming a single person but finding out how to prevent accident that happened from happening again. Which is reasonable because you don't want to impose some strict rules that are expensive up front. This way you just introduce measure to prevent same thing in the future, in the context of your project.
Thanks for clarifying!
Get some sleep, do a thorough investigation, and the results of that are the post mortem that we would like published and where you learn from.
Publishing some premature thoughts without actual insight is not helping anybody. It will just invite the hate that you are seeing in this thread.
It seems that people annoyed mostly by "complexity gremlins". They are so annoyed, that they miss previous sentence "we’re too tired to figure it out right now." Guys fucked up their system, they restored it the best they could, they tried to figure out what happened, but failed. So they decided to do PR right now, to explain what they know, and to continue the investigation later.
But people see just "complexity gremlins". The lesson learned is do not try any humor in a postmortem. Be as serious, grave, and dull as you can.
What is to stop developers for checking into Github "drop database; drop table; alter index; create table; create database; alter permission;"? They are automating environment builds and so that is more efficient right? In my career, I have seen a Fortune 100 company's core system down and out for a week because of hubris like this. In large companies, data flows downstream from a core system. When you have to restore from backup, that cascades into restores in all the child systems.
Similarly, I once had to convince a Microsoft Evangelist who was hired into my company, not to redeploy our production database, every-time we had a production deployment. He was a pure developer and did not see any problems of dropping the database, recreating the database, and re-inserting all the data. I argued that a) this would take 10+ hours b) the production database has data going back many years and that the schema/keys/rules/triggers have evolved during that time-- meaning that many of the inserts would fail because they didn't meet the current schema. He was unconvinced but luckily my bosses overruled him.
My bosses were business types and understood accounting. In accounting, once you "post" a transaction to the ledger that becomes permanent. If you need to correct that transaction, then you create a new one that "credits" or corrects the entry. You don't take out the eraser.
For example, if i open the comments about a “14 hours ago” post, I usually see a top comment about other comments (like yours).
I then feel so out of the loop because i don’t see the “commenters” your are referring too - so the thread that follows seem off topic to me.
Culturally speaking we like to pat people on their back when they do something stupid and comfort them. But most of the time this isn’t productive because it doesn’t instil the requisite fear required when working out what decision to make.
What happens is we have growing complacency and disassociation from consequences.
Do you press the button on something potentially destructive because your are confident it is ok through analysis, good design and testing or confidence it is ok through trite complacency?
The industry is mostly the latter and it has to stop. And the first thing is calling bad processes, bad software and stupidity out for what it is.
Honestly these guys did good but most will try and hide this sort of fuck up or explain it away with weasel words.
But to get there you need to fear the bad outcomes.
I used to be extremely fearful of making mistakes, and used to work in a zero-tolerance fear culture. My experience and the experience of my teammates on the DevOps team? We did everything manually and slowly because we couldn’t see past our own fear to think creatively on how to automate away errors. And yes, we still made errors doing everything carefully, with a second set of eyes, late night deployments, backups taken, etc.
Once our leadership at the time changed to espouse a culture of learning from a mistake and not being afraid to make one as long as you can recover and improve something, we reduced a lot of risk and actually automated away a lot of errors we typically made which were caused initially by fear culture.
Just my two cents.
Manual is wrong for a start. That would increase the probability of making a mistake and thus increase risk for example. The mitigations are automation, review and testing.
I agree with you. Perhaps fear was the wrong term. I treat it as my personal guide to how uneasy i feel about something on the quality front.
People like you keep making the same mistake, creating companies/organisations/industries/societies that run on fear of failure. We've tried it a thousand times, and it never works.
You can't solve crime by making all punishments hearsh death, we've tried that in 1700 in Britain and crimerate was sky high.
This culture gave us disasters in USSR and famine in China.
The only thing that can solve this problem is structural change.
The fear I speak of is a personal barrier which is lacking in a lot of people. They can sleep quite happily at night knowing they did a shitty job and it's going to explode down the line. It's not their problem. They don't care.
I can't do that. Even if there are no direct consequences for me.
This is not because I am threatened but because I have some personal and professionals standards.
Yes, medical professionals use checklists. They also have a harsh and very unforgiving culture that fosters craftsmanship and values professionalism above all else. You see this in other high-stakes professions too.
You cannot just take the checklist and ignore the relentless focus on quality, the feelings of personal failure and countless hours honing and caring for the craft.
Developers are notorious for being lazy AF, so it's not hard to explain our obsession with "just fix the system". It's a required but not sufficient condition.
Everyone takes job of a medical proffesional seriously, from the education to the hospitals that enmloys them to the lawmakers to the patients.
When you pick a surgeon, you avoid the ones that killed people. Do you avoid developers that introduce bugs? We don't even keep track of that!
You can have the license taken away as a surgeon, I've never heard of anyone becoming unemployable as a developer.
You are not gonna get an equivalent outcome even if tomorrow all developers show up to work with an attitude of a heart surgeon.
However if suddenly all data loss and data breaches would result in massive compensation, and if slow and buggy software resulted in real lawsuits, you would see the results very quickly.
Basically same issues as in trading securities: no accountability for either developers or the decision makers.
operate in an environment where they don’t fully understand the systems they’re working with (human biology still has many unknowns), and many mistakes are irreversible.
If you look at the worst performing IT departments, they suffer from the same problems: they don’t fully understand how their systems work, and they lack easy ways to reverse changes.
Well, care to elaborate on this? What do we have to change, and to what end?
Blog posts analysing real-world mistakes should not be met with beratement.
In a blame free environment you find the underlying issue and fix it. In a blame full environment you cover up the mistake to avoid being fired and some other person does it again later down the line
There’s a third option where people accept responsibility and are rewarded for that rather than hide from it one way or another.
I have a modicum of respect for people who do that. I don’t for people who shovel it under a rug or point a finger which are the points you are referring to. I’ve been in both environments and neither end up with a positive outcome.
If I fuck up I’m the first person to put my hand up. Call it responsibility culture.
No one is not saying don't take responsibility, they are saying - as I understood it:
Have a "systematic-approach" to the problem, the current system for preventing "drunk pilots or the wiping of production db's are not sufficient" - improve the system ! ! All the "taking responsibilities and "falling on one's own sword" won't improve the process for the future.
If we take the example of the Space-Industry where having 3x Backups Systems are common (like life support)
It seems some people's view in the comments stream is:
"No bugger that, the life-support systems engineers and co should just 'take responsibility' and produce flawless products. No need for this 3 x backups systems"
The "system" approach is that there is x-rates of failures by having 2 backups we have now reduce the possibility of error by y amount.
Or in the case of production-dbs:
If I were the CEO and the following happens:
CTO-1: "Worker A, has deleted the production DB, I've scolded him, he is sorry and got dock a months pay and is feeling quite bad and has taken full responsibility for his stupid action this probably won't happen again !"
"Worker A, has deleted the production DB, We Identified that our process/system for allowing dev-machines to access production db's was a terrible idea and oversight, we now have measures abc in place to prevent that in the future"
I'd go with CTO-2 EVERY day of the week !
CTO-2 also has the responsibility of making sure that everyone is educated on this issue and can communicate those issues and worries (fears) to his/her level effectively because prevention is better than cure. Which is my other point.
That's what we're talking about. I hope you don't have direct reports.
Next time be honest "Just shut the conversation down, everyone's a dumbass, I'm right, you're dumb" it'll be quicker than all this back and forth actually trying to get to a stable middle ground :)
All I am calling for is people to take responsibility.
I would hate for that to be our system reliability improvement methodology.
Ok fine now I'm being slightly a "dk" but really ?
The "comfort" will come from taking responsibility and owning and correcting the problem such that you have confidence it won't happen again.
Platitudes to make someone feel better without action helps nobody.
The fear is a valuable feedback mechanism and shouldn't been ignored. It's there to remind you of the potential consequences of a careless fuckup.
Lots here misunderstood this I think.. clearly the point is not to berate people for making mistakes or to foster a "fear culture" insofar as fear of personal attack but rather to not ignore the internal/personal fear of fucking up because you give a shit.
I'm sorry for your data loss, but this is a false and dangerous conclusion to make. You can avoid this problem.
There are good suggestions in this thread, but I suggest you use Postgres's permission system to REVOKE DROP action on production except for a very special user that can only be logged in by a human, never a script.
And NEVER run your scripts or application servers as a superuser. This is a dangerous antipattern embraced by many and ORM and library. Grant CREATE and DROP to non-super users.
Gone are the days of me just being able to run a simple script that accesses data read only an exports the result elsewhere as an output.
Eventually, you come to realise that the more tech you've got, the more problems you have. .
Now developers spend more time googling errors and plugging in libraries and webservices together than writing any actual code.
Sometimes I wish for a techless cloudless revolution when we just go back to the foundations of computers and is use plain text wherever possible.
I'm yet to encounter a point in my career where KISS fails me. OTOH this is nothing new, I don't have my hopes up that the current trends of overcomplicating things are going to change in the near future.
... because software in the 60s/70s/80s was so reliable and bug-free?!
Sometimes it's simpler to go back to a commandline utility, sometimes it's not.
You can get by with one strong lead defining services and interfaces for a bunch of code monkeys that write what goes behind them.
Given an existing monolithic codebase, you can’t specify what high level services should exist and expect juniors to not only make the correct decisions on where the code should land up but also develop a migration plan such that you can move forward with existing functionality and customer data rather than starting from zero.
While annoying technically, for early stage startups, performance problems caused by an overly large number of users are almost always a good problem to have and are a far rarer sight than startups that have over-architected their technical solution without the concomitant proven actual users.
It is great for data safety -- chown/chmod the file, and you can be sure your scripts won't touch this. And if you are accessing live instance, you can be pretty sure that you won't accidentally break it by obtaining a database lock.
Now "csv" in particular is kinda bad because it never got standardized, so if you you have complex data (punctuation, newlines, etc..), it you might not be able to get the same data back using a different software.
So consider some other storage formats -- there are tons. Like TSV (tab-separated-values) if you want it simple; json if you want great tooling support; jsonlines if you want to use json tools with old-school Unix tools as well; protobufs if you like schemas and speed; numpy's npy if you have millions of fixed-width records; and so on...
There is no need to bother with SQL if the app will immediately load every row into memory and work with native objects.
Oh, the irony! Text with punctuation and newlines is complex data.
CSV is doomed but the world runs on it and pays its cost with engineer tears.
I have experienced pain with both characters (tab and comma), particularly when I am not the one creating the output file.
Commas are _way_ too common.
CSV is an awful format anyway.
You use this tsv with Unix tools like "cut", "paste", and ad-hoc scripts.
There is also "tsv" as defined by excel, which has quoting and stuff. It is basically a dialect of CSV (Python even uses the same module to read it), and has all the disadvantages of CSV. Avoid it.
Could you elaborate? I'm interested in the specific reasons.
If you use a relational database, the worst-case outcome is you hit tremendous scale and have to do something special later on. The likely case scenario is some wins from the many performance lessons databases have learned. Best case outcome is avoiding a very costly excursion relearning lessons the database community has known about since 1970 (like reinventing transactions).
Managing data with a csv (without a great reason) is like programming a complex GUI in assembly without a reason - it isn't going to look like a good decision in hindsight. Most data is not special, and databases are ready for it.
I hear this kind of thought-terminating cliche a lot on here and it makes absolutely no sense.
If # of users is a rough approximate of a company's success and more successful companies tend to hire more engineers ... then actually the majority of engineers would not have the luxury of not needing to think about scalability.
With engineering salaries being what they are, why would you think that "most people" are employed working on systems that only have 1000 users?
Please also consider that especially for smallish teams, microservices are not required to be the same as big corp microservices.
I have encountered a trend towards calling surprisingly many things non-monolithic a microservice. So what kind of microservice are you all referring to in your minds?
Edit: 2 questions were asked, too deep to reply.
1. You said 250 people, nothing about IT. Based on the info provided, this was the image reflected.
2. "the luxury of coordinating a monolith". Done well, it is not much more complicated that coordinating the design of microservices, some can argue it is the same effort.
The grandparent comments point is a single person or team can deploy a monolith on herkou and avoid a huge amount of complexity. Especially in the beginning.
I build and maintain our entire employee database with a python script, from a weird non-standard XML”like” daily dump from our payment system, and a few web-services that hold employee data in other requires systems. Our IT then builds/maintains our AD from a few powershell scripts, and finally we have a range of “micro services” that are really just independent scripts that send user data changes to the 500 systems that depend on our central record.
Sure, sure, we’re moving it to azure services for better monitoring, but basically it’s a few hundred lines of scripting that, combined with AD and ADDS, does more than a 1 million USD a year license IDM.
Just a few weeks ago, I set up a read-only user for myself, and moved all modify permission to role one must explicitly assume. Really helped me with peace of mind while developing the simple scripts that access data read only. This was on our managed AWS RDS database,
By all means, find ways to fool-proof the architecture. But be prepared for scenarios where some destructive action happens to a production database.
The article isn’t claiming that the problem is impossible to solve.
On the contrary: “However, we will figure out what went wrong and ensure that this particular error doesn’t happen again.”.
No, you can't. No matter how good you are, you can always "rm -rf" your world.
Yes, we can make it harder, but, at the end of the day, some human, somewhere, has to pull the switch on the stuff that pushes to prod.
You can clobber prod manually, or you accidentally write an erroneous script that clobbers prod. Either way--prod is toast.
The word of the day is "backups".
Yes, backups are vitally important, but no it is not possible to accidentally rm -rf with proper design.
It's possible to have the most dangerous credentials possible and still make it difficult to do catastrophic global changes. Hell it's my job to make sure this is the case.
Can you say more about this?
I understand rm -rf, but not sure how I could design that to be impossible for the most dangerous credentials.
I didn't even mean you can only make it difficult, I meant you can make it almost impossible to harm a real production environment in such a nuclear way without herculean effort and quite frankly likely collusion from multiple parties.
The most dangerous credentials are cosmic rays and we use the Earth’s atmosphere and ECC to fight that.
Until we get our shit together and start formally verifying the semantics of everything, their conclusion is 100% correct, both literally and practically.
I have been running Postgres in production supporting $millions in business for years. Here's how it's set up. These days I use RDS in AWS, but the same is doable anywhere.
First, the primary server is configured to send write ahead logs (WAL) to a secondary server. What this means is that before a transaction completes on the master, she slave has written it too. This is a hot spare in case something happens to the master.
Secondly, WAL logs will happily contain a DROP DATABASE in them, they're just the transaction log, and don't prevent bad mistakes, so I also send the WAL logs to backup storage via WAL-E. In the tale of horror in the linked article, I'd be able to recover the DB by restoring from the last backup, and applying the WAL delta. If the WAL contains a "drop database", then some manual intervention is required to only play them back up to the statement before that drop.
Third is a question of access control for developers. Absolutely nobody should have write credentials for a prod DB except for the prod services. If a developer needs to work with data to develop something, I have all these wonderful DB backups lying around, so I bring up a new DB from the backups, giving the developer a sandbox to play in, and also testing my recovery procedure, double-win. Now, there are emergencies where this rule is broken, but it's an anomalous situation handled on a case by case basis, and I only let people who know what they're doing touch that live prod DB.
If you're using MySQL, it's called a binary log and not a Write Ahead Log, it was very difficult to find meaningful Google results for "MySQL WAL"
Its a real problem that we used to have trained DBAs to own the data where now devs and automatic tools are relied upon, there isn't a culture or toolset built up yet to handle it.
It’s nice to have that capability, but some databases are just too big to have multiple copies lying around, or to able to create a sandbox for everyone.
> It’s tempting to blame the disaster on the couple of glasses of red wine. However, the function that wiped the database was written whilst sober.
It was _written_ then, but you're still admitting to the world that your employees do work on production systems after they've been drinking. Since they were working so late, one might think this was emergency work, but it says "doing some late evening coding". I think this really highlights the need to separate work time from leisure time.
In this case there were like 10 relatively easy things that could have prevented this. Your ability to mentally compile and evaluate your code before you hit enter is not a reliable way to protect your production systems.
Coding after drinking is probably not a good idea of course, but “think better” is not the right takeaway from this.
I’ve done some of my most productive work this way. Not on production systems fortunately, and not in a long time.
For ref: https://xkcd.com/323/
Maybe such a breathalyzer interlock could be installed on your workstation too. After all, your systems and processes should prevent you from stupid things.
If there was more of a cultural pressure against drunk driving and actual mechanisms to prevent it that aren't too difficult to maintain and utilize, things like the Daikou services ( https://www.quora.com/How-effective-is-the-Japanese-daikou-s... ) would pop up and take care of the other logistical problems of getting your car home. And the world would be all the better for it, because of less drunk driving and accidents.
This person had to publicly apologize to their customers. One or two low-cost guardrails could have prevented it and would probably have been worth the cost.
(But if your mission critical system relies on that scoreboard website, that's on you...)
I heard the same argument for electronic gun safety measures, except that no government agency even consider using it for their own guns. Why? They are not reliable enough, yet.
A problem with devices of that type is that they only test for a potential source of inability to drive safely. What we want is to test for an inability to drive safely.
And while one is easy and might give some quick wins, the drawbacks scare me too much.
Actually getting in the car while under the influence such of stress and alcohol sounds worse.
I know someone who had a glass of wine just before her daughter needed to be brought to the hospital. This was just two days ago. She simply concluded she could not drive. Luckily, she was able to get a taxi.
I'm sure that would save a lot of lives. An ambulatory medical service, if you will.
I once dislocated my shoulder while on a large trampoline and was unable to get up from my hands and knees due to the intense pain whenever the trampoline wobbled. The ambulance was redirected to more serious injuries three times. I was stuck in that position waiting for two hours before it arrived.
In that scenario it would also be appropriate to wait for a driver to sober up before driving you to the hospital if neither ambulance nor taxi were available (or delayed). One glass of wine would be out of most people's systems after two hours.
Thus poking hole in the "drunk drive someone to the hospital" argument, which is what this was all about in the first place.
In my previous comment I just meant that sometimes ambulances can take a good while and a taxi might not.
In the unfortunate case of the trampoline there were several sober people with driver's license and cars available and a taxi would have been there immediately.
Unfortunately,they failed to get me out of there, meaning I still had to wait until an ambulance was available. It was beyond painful and exhausting both physically and mentally. But it was still technically not an emergency.
While your friend made a call judging their own abilities and the level of emergency, that's exactly how it should be: cars should not stop us humans for making that decision.
(Fwiw, if you were just having an alcoholic drink, a breathalizer would show a much higher concentration even though alcohol might not have even kicked in or there wasn't enough for it to kivk in at all)
That is because most companies these days have processes around drinking in workplace, coming in drunk and working drunk.
Most mistakes are done sober only in environments where drinking couple of vine cups and then doing production change is considered unacceptable. In environment where drunk people work, mistakes are made when drunk.
Also I agree with other comments: doing some work after a glass or two should be fine because you should have other defences in place. “Not being drunk” shouldn’t be the only protection you have against disaster.
But it's just a side-project and I will continue late night coding with a glass of wine. I find it hugely enjoyable.
I would have a different mind-set if I was writing software for power stations as a professional.
Normally, this would be fine. But, it appears the site has paying members. Presumably, it's not "just a side-project" to them. You owe them better than tinkering with prod while tipsy.
The bigger issue is the lack of guardrails to protect against accidental damage. This is also a common trait for hobby projects (after all, it's more fun to hack stuff together) but hopefully the maintainer will use this experience as a sobering reminder (pun intended) to put some guardrails in place now.
Of course the next day, when re-reading the same pages, I was always discovering that the previous day I had everything wrong, nothing was obvious, and all my reasoning when with alcohol was false because simplistic and oblivious of any mathematical rigor.
Also, nothing is comparable between alcohol and psychodelics
In short, ability to recall memories is at least in part dependent on being in a similar state to the time when memories are formed, e.g. something learned while being intoxicated will be more easily recalled only when intoxicated again.
Though I'm talking one or two drinks here, not firing up vscode after a night out or going through a bottle of rum.
We had several MySQL string columns as long text type in our database but they should have been varchar(255) or so. So I was assigned to convert these columns to their appropriate size.
Being the good developer I was, I decided to download a snapshot of the prod database locally and checked the maximum string length we had for each column via a script. Using this script it made a migration query that would alter column types to match their maximum used length keeping the minimum length as varchar (255).
I tested that migration and everything looked good, it passed code review and was run on prod. Soon after we start getting complaints from users that their old email texts have been truncated. I then realize the stupidity of the whole thing, the local dump of production database always wiped out many columns clean for privacy like the email body column. So the script thought it had max length of 0 and decided to convert the column to varchar(255).
I realize the whole thing may look incredibly stupid, that's only because the naming for db columns was in a foreign european language so I didn't know even know the semantics of each column.
Thankfully my seniors managed to restore that column and took the responsibility themselves since they had passed the review.
We still did fix those unusually large columns but this time by simple duplicate alter queries for each of those columns instead of using fancy scripts.
I think a valuable lesson was learned that day to not rely on hacky scripts just to reduce some duplicate code.
I now prefer clarity and explicitness when writing such scripts instead of trying to be too clever and automating everything.
Basically you just blindly ran the migration on the data and checked if it didn’t fail?
The lesson here is not about cleverness unfortunately.
So yes I could have noticed their length 0 if I had looked carefully amidst hundreds of rows but since my faulty logic of prod db = local db didn't even consider this possible I didn't bother.
If it had been just 10 to 20 migrations queries that would have been a lot easier to validate but then I wouldn't even have attempted to write a script
It happens. “It worked in dev” is the database equivalent of “worked on my machine”.
The parent is either misrepresenting the situation or they didn’t do what they say they did.
Also in any production setup, before the migration in the same transaction you would have something along the lines of “check if the column size is larger than and then abort”, because you never know when that can be added while working on the database.
But this happened as described, a local only script that generated a list of columns to modify then a migration to execute the alter queries for all of them.
for db in `mysql [...] | grep [...]`
mysqldump [...] > $db.sql
git commit -a -m "Automatic backup"
git push [backup server #1]
git push [backup server #2]
git push [backup server #3]
I also have a single-entry-point script that turns a blank Linux VM into a production/staging server. If your business is more than a hobby project and you're not doing something similar, you are sitting on a ticking time bomb.
You can always switch to a more specialized solution if the repository size starts bugging you, but don't fall into the trap of premature optimization.
Using a real database backup solution isn't a premature optimization, it's basic system administration.
Also resetting the repository once every 1-2 years, and keeping the old one for a while is fine for smaller setups.
Depending on your business size and the amount of resources you want to allocate towards "basic system administration", accomplishing the same task with fewer tools could have its advantages.
This isn't just about size though. You're storing all customer data on all developer machines. You're just one stolen laptop away from your very own "we take the security of our customers' data very seriously" moment.
I still think that database size is not the only consideration.
The mysqldump command is tweaked to use individual INSERT clauses as opposed to one bulk one, so the diff hunks are smaller.
You can also sed and remove the mysqldump timestamp, so there will be no commits if there are no database changes, saving the git repo space.
During this operation, the server ran out of memory—presumably because of all the files I'd created—and before I know it I'd managed to crash 3 services and corrupted the database—which was also on this host—on my first day. All while everyone else in the company was asleep :)
Over the next few hours, I brought the site back online by piecing commands together from the `.bash_history` file.
I actually waited until nightfall just incase I bumped the server offline because we had low traffic during those hours.
I was only there for a short time though. Hopefully they figured things out.
It can even fail while in the middle of a transaction commit.
So transactions won't fix this.
Edit: and another problem is disk drives/controller caches lying and reporting write completion when not all data has actually reached stable storage
Now... our scenario was such that we could NOT lose those 7 hours because each customer record lost meant $5000 usd penalty.
What saved us is that I knew about the oplog (binlog in mysql) so after restoring the backup i isolated the last N hours lost from the log and replayed it on the database.
Lesson learned and a lucky save.
No one owned up to it, but had a pretty good idea who it was.
That sounds like you're putting (some of) the blame on whoever misclicked. As opposed to everyone who has allowed this insanely dangerous situation to exist.
Not immediately calling up your boss to say "I fucked up big" is not a mistake, it is a conscious bad action.
> in the dropdown menu of the MongoDB browser, exit & drop database were next to each other
So maybe they signed off for the night without realizing anything was wrong.
As its said before: he made a mistake. The error was allowing the prod database to to be port forwarded from a non prod environment. As head of eng that was MY error. So I owned to it and we changed policies.
Nice that you were a person he felt ok with sharing the mistake with, I suppose that's an important part of being head of eng.
There are ways around it, of course, but it prevents the scenario described above.
I haven't seen this design in practice using MongoDB Atlas or Compass, but would hope for an "Are you really sure?" confirmation in an admin UI.
Obviously, somehow the script ran on the database host.
some practices I've followed in the past to keep this kind of thing from happening:
* A script that deletes all the data can never be deployed to production.
* scripts that alter the DB rename tables/columns rather than dropping them (you write a matching rollback script ), for at least one schema upgrade cycle. you can always restore from backups, but this can make rollbacks quick when you spot a problem at deployment time.
* the number of people with access to the database in prod is severely restricted. I suppose this is obvious, so I'm curious how the particular chain of events in TFA happened.
Unfortunately, the "wipe & recreate database" script, while dangerous, is very useful; it's a core part of most of my automated testing because automated testing wipes & recreates a lot.
More likely, I'd suspect, is something like an SSH tunnel with port forwarding was running, perhaps as part of another script.
I.e. if Alice is your senior DBA who would have full access to everything including deleting the main production database, then it does not mean that the user 'alice' should have the permission to execute 'drop database production' - if that needs to be done, she can temporarily escalate the permissions to do that (e.g. a separate account, or separate role added to the account and removed afterwards, etc).
Arguably, if your DB structure changes generally are deployed with some automated tools, then the everyday permissions of senior DBA/developer accounts in the production environment(s) should be read-only for diagnostics. If you need a structural change, make a migration and deploy it properly; if you need an urgent ad-hoc fix to data for some reason (which you hopefully shouldn't need to do very often), then do that temporary privilege elevation thing; perhaps it's just "symbolic" but it can't be done accidentally.
And of those people, there should be an even fewer number with the "drop database" privilege on prod.
Also, from a first glance, it looks like using different database names and (especially!) credentials between the dev and prod environments would be a good idea too.