The CTO further demonstrates his ineptitude by firing the junior dev. Apparently he never heard the famous IBM story, and will surely live to repeat his mistakes:
After an employee made a mistake that cost the company $10 million, he walked into the office of Tom Watson, the C.E.O., expecting to get fired. “Fire you?” Mr. Watson asked. “I just spent $10 million educating you.”
- Access control 101. Seriously, this is pure incompetence. It is the equivalent of having the power cord to the Big Important Money Making Machine snaking across the office and under desks. If you can't be arsed to ensure that even basic measures are taken to avoid accidents, acting surprised when they happen is even more stupid.
- Sensible onboarding documentation. Why would prod access information be stuck in the "read this first" doc?
- Management 101. You just hired a green dev just out of college who has no idea how things are supposed to work. You just fired him in an incredibly nasty way for making an entirely predictable mistake that came about because of your lack of diligence at your job (see above).
Also, I have no idea what your culture looks like, but you just told all your reports that honest mistakes can be fatal and their manager's judgement resembles that of a petulant 14 year-old.
- Corporate Communications 101. Hindsight and all that, but it seems inevitable that this would lead to a social media trash fire. Congrats on embarrassing yourself and your company in an impressive way. On the bright side, this will last for about 15 minutes and then maybe three people will remember. Hopefully the folks at your next gig won't be among them.
My take away is that anyone involved in this might want to start polishing their resumes. The poor kid and the CTO for obvious reasons, and the rest of the devs, because good lord, that company sounds doomed.
Note that I'm not talking about the situation in this article. That was a ridiculous situation and they were just asking for trouble. I'm asking about the perception that is becoming more and more common, which is that no matter what mistakes you make you should still be given a free pass regardless of severity.
Is it the quantity of mistakes? Severity of mistakes? At what point does the calculus favor firing someone over retaining them?
In this case, it's not a matter of degree, it's a matter of responsibility. The junior dev is absolutely not the one responsible for the prod db getting blown away, and is absolutely not responsible for the backups not being adequate. As somebody else mentioned, this is like leaving the electric cable to the production system strewn across the office floor, then firing the person who happened to trip over it.
I agree somebody's job should be in jeopardy, especially if the backups weren't adequate: not for a single mistake, but for a long series of bad decisions and shoddy oversight that led to this.
I like this definition.
If someone is still expected to be learning, mistakes (even large ones) are to be expected. Incompetence has to be judged against reasonable expectations. In the case here, there was a series of pretty severe mistakes, but deleting the database isn't what I'm thinking of.
Protecting the database was the job of the more experienced employees of the company, and ultimately the CTO. Some combination of people at the company were negligent, and the absence of actions taken to protect their systems shows a pattern of irresponsible behavior.
In my opinion, mistakes should never be considered the person's fault. The development process should be designed to prevent human mistakes. If mistakes happen, that only means the process has been designed poorly.
World's worst on-boarding guide!
I know startups are a limit case, but we didn't bother to make those sorts of distinctions for this article, so it's worth considering.
- Unless you have full time DBAs, do use a managed db like RDS, so you don't have to worry about whether you've setup the backups correctly. Saving a few bucks here is incredibly shortsighted, your database is probably the most valuable asset you have. RDS allows point-in-time restore of your DB instance to any second during your retention period, up to the last five minutes. That will make you sleep better at night.
- Separate your prod and dev AWS accounts entirely. It doesn't cost you anything (in fact, you get 2x the AWS free tier benefit, score!), and it's also a big help in monitoring your cloud spend later on. Everyone, including the junior dev, should have full access to the dev environment. Fewer people should have prod access (everything devs may need for day-to-day work like logs should be streamed to some other accessible system, like Splunk or Loggly). Assuming a prod context should always require an additional step for those with access, and the separate AWS account provides that bit of friction.
- The prod RDS security group should only allow traffic from white listed security groups also in the prod environment. For those really requiring a connection to the prod DB, it is therefore always a two-step process: local -> prod host -> prod db. But carefully consider why are you even doing this in the first place? If you find yourself doing this often, perhaps you need more internal tooling (like an admin interface, again behind a whitelisting SG).
- Use a discovery service for the prod resources. One of the simplest methods is just to setup a Route 53 Private Hosted Zone in the prod account, which takes about a minute. Create an alias entry like "db.prod.private" pointing to the RDS and use that in all configurations. Except for the Route 53 record, the actual address for your DB should not appear anywhere. Even if everything else goes sideways, you've assumed a prod context locally by mistake and you run some tool that is pointed to the prod config, the address doesn't resolve in a local context.
> - Unless you have full time DBAs, do use a managed db like RDS, so you don't have to worry about whether you've setup the backups correctly.
The real way to not worry about whether you've set up backups correctly is to set up the backups, and actually try and document the recovery procedure. Over the last 30 years I've seen situations beyond counting of nasty surprises when people actually try to restore their backups during emergencies. Hopefully checking the "yes back this up" checkbox on RDS covers you, but actually following the recovery procedure and checking the results is the only way to not have some lingering worry.
In this particular example, there might be lingering surprises like part of the data might be in other databases, storage facilities like S3 that don't have backups in sync with the primary backup, or caches and queues that need to be reset as part of the recovery procedure.
A lot of people pay the tax and never even try the lux.
"Prove it. If you don’t try it it doesn’t work. Backups and restores are continually tested to verify they work"
We've made things so reports and similar read-only queries can be done from properly firewalled/authenticated/sandboxed web interfaces, and write queries get done by migrations. It's very rarely that we'll need to write to the database directly and not via some sort of admin interface like Django's admin, which makes it very hard to do bulk deletions (it will very clearly warn you).
I absolutely do. "Wrong terminal", "Wrong database", etc. mistakes are very easy to make in certain contexts.
The trick is to find circuit-breakers that work for you. Some of the above is probably overkill for one-person shops. You want some sort of safeguard at the same points, but not necessarily the same type.
This doesn't really do it for me, but one person I know uses iTerm configured to change terminal colors depending on machine, EUID, etc. as a way of avoiding mistakes. That works for him. I do tend to place heavier-weight restrictions, because they usually overlap with security and I'm a bit paranoid by nature and prefer explicit rules for these things to looser setups. Also, I don't use RDS.
I'd recommend looking at what sort of mistakes you've made in the past and how to adjust your workflow to add circuit breakers where needed. Then, if you need to, supplement that.
Except for the advice about backups and PITR. Do that. Also, if you're not, use version control for non-DB assets and config!
If you have an engineer that goes though that and shows real remorse your going to have someone who's never going to make that mistake(or similar ones) again.
So, we added a "roadblock" post auth with 2 actions- log out other sessions and log out this session.
Well, the db query for the first action (log out other sessions) was missing a where clause...a user_id!
Tickets started pouring in saying users were logged out and didn't know why. Luckily the on-call dev knew there was a recent release and was able to identify the missing where clause and added it within the hour.
The feature made it through code review, so the team acknowledged that everyone was at fault. Instead of being reprimanded, we decided to revamp our code review process.
I never made that kind of mistake again. To this day, I'm a little paranoid about update/delete queries.
1) Always have a USE statement (or equivalent);
2) Always start UPDATE or DELETE queries by writing them as SELECT;
3) Get in the habit of writing the WHERE clause first;
4) If your SQL authoring environment supports the dangerous and seductive feature where you can select some text in the window and then run only that selected text — beware! and
5) While developing a query to manipulate real data, consider topping the whole thing with BEGIN TRANSACTION (or equivalent), with both COMMIT and ROLLBACK at the end, both commented out (this is the one case where I use the run-selected-area feature: after evaluating results, select either the COMMIT or the ROLLBACK, and run-selected).
Not all of these apply to writing queries that will live in an application, and I don't do all these things all the time — but I try to take this stance when approaching writing meaningful SQL.
I will note that depending on your DB settings and systems, if you leave the transaction "hanging" without rolling back or committing, it can lock up that table in the DB for certain accessors. This is only for settings with high levels of isolation such as SERIALIZABLE, however. I believe if you're set to READ_UNCOMMITTED (this is SQL Server), you can happily leave as many hanging transactions as you want.
On that point, I'd love a database or SQL spec extension that provided a 'dry-run' mode for UPDATE or DELETE and which would report how many rows would be impacted.
123338817 rows will be affected
Oooops made an error!
I mean, if your DB supports transactions, you are in luck.
Start a transaction (that may be different among vendors - BEGIN TRANSACTION or SET AUTOCOMMIT=0 etc) and run your stuff.
If everything looks good, commit.
If you get OOOps number of records, just rollback.
There is even a song in Spanish about forgetting to add a WHERE in a DELETE: https://www.youtube.com/watch?v=i_cVJgIz_Cs
Raises questions about deployment. Didn't the on-call have a previous build they could rollback to? Customers shouldn't have been left with an issue while someone tried to hunt down the bug (which "luckily" they located), instead the first step should have been a rollback to a known good build and then the bug tracked down before a re-release (e.g. review all changesets in the latest).
Oops. That was missing a 'WHERE user_id=X'. Did not lose the client at the time (this was 15+ years ago), but that was a rough month. Haven't made that mistake again. It happens to all of us at some point though.
Could've had something like 'UPDATE ALL' required for queries without filtering.
we all assume code (or feature) that are not tested should be assumed to be broken
a) The culture was very forgiving of honest mistakes; they were seen as a learning opportunity.
b) Posting a synopsis of your cockup made it easier for others to avoid the same mistake while we were busy making sure it would be impossible to repeat it in the future; also, it got us thinking of other, related failure modes.
c) My oh my was it entertaining! Pretty much the very definition of edutainment, methinks.
My only gripe with it was that I never made the honor roll...
Note to self: be very careful using Ziegler-Nichols to tune multi-kHp-systems. Very careful. Cough.
I lost a whole weekend of sleep in recovering that one from logs, and that was when I learned some good tricks for ensuring recoverability....
Sounds like an incompetent set of people running the production server.
Lot of people think that naming a folder on a local drive "Disaster Recovery" counts as having an offsite Disaster Recovery copy of backups. The number of large corporations whose backups are in the hands of such people is frightening.
This shop sounds like a raging tire fire of negligence.
Sure, but having the plaintext credentials for a readily-deletable prod db as an example before you instruct someone to wipe the db doesn't salvage competence very much.
Our CTO thought he was on his local dev box, and was frustrated that "something" was keeping him from clearing out his testing DB.
Did I get a medal for that? No. Nobody wanted to talk about it ever again.
Yesterday, I thought I was on my local machine and clear the database, while I was in fact on the production server.
Luckily knodi123 caught it and killed the delete process. This is a reminder that *anybody* can make mistakes,
so I want to set up some process to make sure this can't happen, but meanwhile I would like to thank knodil123.
Lots of folks here are saying they should have fired the CTO or the DBA or the person who wrote the doc instead of the new dev. Let me offer a counter point. Not that it will happen here ;)
They should have run a post mortem. The idea behind it should be to understand the processes that led to a situation where this incident could happen. Gather stories, understand how things came to be.
With this information, folks can then address the issues. Maybe it shows that there is a serially incompetent individual who needs to be let go. Or maybe it shows a house of cards with each card placement making sense at the time and it is time for new, better processes and an audit of other systems.
The point being is that this is a massive learning opportunity for all those involved. The dev should not have been fired. The CTO should not have lost his shit. The DB should have regularly tested back ups. Permissions and access needs to be updated. Docs should be updated to not have sensitive information. The dev does need to contact the company to arrange surrender of the laptop. The dev should document everything just in case. The dev should have a beer with friends and relax for the weekend and get back on the job hunt next week. Later, laugh and tell of the time you destroyed prod on your first day (and what you learned from it).
1. CTO As the one in charge of the tech, allows loss of critical data. If anyone should be fired, it's the cto. And firing this guy apparently will have the greatest positive impact to the company. Assuming they can hire a better one. I think given how stupid this cto is, that should be straightforward.
2. The executives who hired the cto. People hire people similar to themselves, it seems the executives team are clueless about what kind of skills a cto should have. These people will continue fail the dev team by hiring incompetent people, or force them to work in a way that causes problem.
3. Senior devs in the team. Obviously these people did not test what they wrote. If anyone had ever dryrun the training doc, they should prevent the catastrophe. It's a must do in today's dev environment. The normal standard is to write automatic tests for every thing though.
This junior dev is the only one who should not be fired...
The fact is, this kind of scenario is extremely common. Most companies I have worked for have the production database accessible from the office. It's a very obvious "no no", but it's typical to see this at small to medium sized companies. The focus is on rushing every new feature under the sun, and infrastructure is only looked at if something blows up.
Nobody should have been fired. Not the developer, not the senior devs, not the sysadmins, and not the CTO. This should have been nothing more than a wake-up call to improve the infrastructure. That's it. The only blame here lies with the CTO - not for the event having taken place, but only because their immediate reaction was to throw the developer under the bus. A decent CTO would have simply said "oh shit guys, this can't happen again. please fix it". If the other executives can't understand that sometimes shit happens, and that a hammer doesn't need to be dropped on anyone, then they're not qualified to be running a business.
His reaction shows that he is the no1 to fire, and has a good reason.
What you said is true, but does not matter. The cto already show that he is clueless...
As my mother said, if you put a good apple with bad apples, it's not the bad apples that become good.
You run a post mortem when you're back and running again. They may never be back and running again.
>They put full access plaintext credentials for their production database in their tutorial documentation
WHAT THE HELL. Wow. I'd be shocked at that sort of thing being written out in a non-secure setting, like, anywhere, at all, never mind in freaking documentation. Making sure examples in documentation are never real and will hard fail if anyone tries to use them directly is not some new idea, heck there's an entire IETF RFC (#2606) devoted to reserving TLDs specifically for testing and example usage. Just mind blowing, and yeah there are plenty of WTFs there that have already been commented on in terms of backups, general authentication, etc. But even above all that, if those credentials had full access then "merely" having their entire db deleted might even have been a good case scenario vs having the entire thing stolen which seems quite likely if their auth is nothing more then a name/pass and they're letting credentials float around like that.
It's a real bummer this guy had such an utterly awful first day on a first job, particularly since he said he engaged in a huge move and sunk quite a bit of personal capital from the sound of it in taking that job. At the same time that sounds like a pretty shocking place to work and it might have taught a ton of bad habits. I don't think it's salvageable but I'm not even sure he should try, they likely had every right to fire him but threatening him at all with "legal" for that is very unprofessional and dickish. I hope he'll be able to bounce back and actually end up in a much better position a decade down the line, having some unusually strong caution and extra care baked into him at a very junior level.
It's depressing how many companies blindly throw unencrypted credentials around like this.
We have a password sheet. You have to be on the VPN(login/password). Then you can log in. Login/Password(diff from above)/2nd password+OTP. Then a password sheet password.
I'm still rooting out passwords from our repo with goobers putting creds in sourcecode (yeah, not config files....grrrrr). But I attack them as I find them. Ive only found 1 root password for a DB in there... and thankfully changed!
We suffer from NIH greatly. We end rolling our own stuff because either we don't trust 3rd party stuff, or it doesn't work in our infrastructure. In this case, multiple access locks with GPG is sufficient.
I agree that a multi-recipient GPG protected file is sufficient for a small org. In fact, that's how I used to do it Circa 2011. We found it worked quite well - we committed the GPG protected files to a version control system (git) and used githooks to make sure that only encrypted files were permitted, preventing users from intentionally/accidentally defeating gitignore.
Either that or this is a "Worst fuckup on the first day on job" fantasy piece - I refuse to acknowledge living in the world where alternatives have any meaningful non-zero probability of occurring.
And then it takes only one shitty manager, or manager in a bad mood, to get the innocent junior dev fired.
Not part of this story, but another pet peeve of mine is when I see scripts checking strings like "if env = 'test' else <runs against prod>". This sets up another WTF situation if someone typos 'test' now the script hits prod.
but yeah - I agree with you...
No one goes out of their way to screw up; I'd recommend making it easier to recognize when you've made a mistake, and recover from it.
Except for critical business stuff, that needs severe "you cannot fuck this up" safeguards.
I wonder whether that single difference - blame the person vs fix the system/tools predicts the failure or success of an enterprise?
This is the sort of situation that makes for a great conference talk on how companies react to disaster, and how the lessons learned can either set the company up for robust dependable systems or a series of cascading failures.
Unfortunately, the original junior dev was living the failure case. Fortunately, he has learned early in his career that he doesn't want to work for a company that blames the messenger.
Neither is providing proof, so asking one party for proof and not the other is obviously absurd.
I don't know if I should laugh or cry here.
* new dev should only have full access to local and dev envs, no write access to prod
* you're backing up all of your databases, right? Nightly is mandatory, hourly is better
* if you don't have a DBA, use RDS
That'll prevent the majority of weapons grade mistakes.
Source: 15 years in ops
Make changes to Stage after doing a "Change Management" process (effectively, document every change you plan to do, so that a average person typing them out would succeed). Test these changes. It's nicer if you have a test suite, but you won't at first.
Once testing is done and considered good, then make changes in accordance to the CM on prod. Make sure everything has a back-out procedure, even if it is "Drive to get backups, and restore". But most of these should be, "Copy config to /root/configs/$service/$date , then proceed to edit the live config". Backing out would entail in restoring the backed-up config.
Edit: As an addendum, many places too small usually have insufficient, non-existent, or schrodinger-backups. Having a healthy living stage environment does 2 things:
1. You can stage changes so you don't get caught with your pants down on prod, and
2. It is a hot-swap for prod in the case Prod catches fire.
In all likelihood, "All" of prod wouldn't DIAF, but maybe a machine that houses the DB has power issues with their PSU's and fries the motherboard. You at least have a hot machine, even if it's stale data from yesterday's imported snapshot.
Ideally, you want test->stage->prod , with puppet and ansible running on a VM fabric. Changes are made to stage and prod areas of puppet, with configuration management being backed by GIT or SVN or whatever for version control. Puppet manifests can be made and submitted to version control, with a guarantee that if you can code it, you know what you're doing. Ansible's there to run one-off commands, like reloading puppet (default is checkins every 30min)
And to be even more safe, you have hot backups in prod. Anything that runs in a critical environment can have hot backups or otherwise use HAproxy. For small instances, even things like DRBD can be a great help. Even MySQL, Postgres, Mongo and friends all support master/slave or sharding.
Generally, get more machines running the production dataset and tools, so if shit goes down, you have: machines ready to handle load, backup machines able to handle some load, full data backups onsite, and full data backups offsite. And the real ideal is that the data is available on 2 or 3 cloud compute platforms, so absolute worst case scenario occurs and you can spin up VM's on AWS or GCE or Azure.
--Our solution for Mongo is ridiculous, but the best for backing up. The Mongo backup util doesn't guarantee any sort of consistency, so either you lock the whole DB (!) or you have the DB change underneath you while you back it up... So we do LVM snapshots on the filesystem layer and back those up. It's ridiculous that Mongo doesnt have this kind of transactional backup appratus. But we needed time-series data storage. And mongodb was pretty much it.
You never don't have a DBA. If you don't know who it is, it's you! But there will always, always be someone who is held responsible for the security, integrity and availability of the company's asset.
E.g. when deciding how to backup your prod database, if you're thinking as both "personas" you'll come up with a strategy that safely backs up the database but also makes it easy for a non-privileged user to securely suck down a (optionally sanitised) version of the latest snapshot for seeding their dev environment with [ and then dog food this by using the process to seed your own dev environment ].
Some other quick & easy things:
- Design your terraform/ansible/whatever scripts such that anything touching sensitive parts needs out of band ssh keys or credentials. E.g. if you have a terraform script that brings up your prod environment on AWS, make sure that the aws credentials file it needs isn't auto-provisioned alongside the script. Instead write down on a wiki somewhere that the team member (at the moment, you) who has authority to run that terraform script needs to manually drop his AWS credentials file in directory /x/y/z before running the script. Same goes for ansible: control and limit which ssh keys can login to different machines (don't use a single "devops" key that everyone shares and imports in to their keychains!). Think about what you'll need to do if a particular key gets compromised or a person leaves the team.
- Make sure your backups are working, taken regularly, stored in multiple places and encrypted before they leave the box being backed up. Borgbackup and rsync.net are a nice, easy solution for this.
- Make sure you test your backups!
- Don't check passwords/credentials in to source code without first encrypting them.
- Use sane, safe defaults in all scripts. Like another poster mentioned, don't do if env != "test"; do prod_stuff();
- RTFM and spend the extra 20 minutes to set things up correctly and securely rather than walking away the second you've got something "working" (thinking 'I'll come back later to tidy this up' - you never will).
- Follow the usual security guidelines: firewall machines (internally and externally), limit privileges, keep packages up to date, layer your defences, use a bastion machine for access to your hosted infrastructure
- Get in the habit of documenting shit so you can quickly put together a straight forward on-boarding/ops wiki in a few days if you suddenly do hire a junior dev (or just to help yourself when you're scratching your head 6 months later wondering why you did something a certain way)
It turns out their production and QA database instances shared the same credentials, and one day somebody pointed a script that initializes the QA instances (truncate all tables, insert some dummy data) at the production master. Those TRUNCATE TABLE statements replicated to all their DB replicas, and within a few minutes their entire production DB cluster was completely hosed.
Their data thankfully still existed inside the InnoDB files on disk, but all the organizational metadata data was gone. I spent a week of 12 hour days working with folks from Percona to recover the data from the ibdata files. The old backup was of no use to us since it was several months old, but it was helpful in that it provided us a mapping of the old table names to their InnoDB tablespace ids, a mapping destroyed by the TRUNCATE TABLE statements.
(Well, then it was "land of the free, blacks excepted")
I knew I was being brought in to rearchitect the entire development process for an IT department and that I would make architectural mistakes no matter how careful I was and that I would probably make mistakes that would have to be explained to CxOs.
Whatever the answer he gave me, I remember being satisfied with it.
"The server has been down all day, and you are the only one who hasn't noticed. What did you break?"
"Well, I saw that all the files were moved to `/var/www/`, and figured it was on purpose."
Suffice it to say, I got that business to go from Filezilla as root to BitBucket with git and some update scripts.
Day 1 is when you setup your desk and get your login. Then go back to HR to do the last hiring paperwork.
It should take a good week before a new employee is able to fuck up anything. Really.
If you introduce a junior dev into this environment, then it's him that is going to discover those pitfalls, in the most catastrophic ways possible. But even experienced developers can blunder into pitfalls. At least twice I've accidentally deployed to production, or otherwise ran a powerful command intended to be used in a development environment on production.
Each time, I carefully analyze the steps that led up to running that command and implemented safety checks to keep that from happening again. I put all of my configuration into a single environment file so I see with a glance the state of my shop. I make little tweaks to the project all the time to maintain this, which can be difficult because the three devs on the project work in slightly different ways and the codebase has to be able to accommodate all of us.
While this is all well and good, my project has a positively decadent level of funding. I can lavish all the time I want in making my shop nice and pretty and safe.
A growing business concern can not afford to hire junior devs fresh out of code school / college. That's the real problem here. Not the CTO's incompetence, any new-ish CTO in a startup is going to be incompetent.
The startup simply hired too fast.
Seniority is what makes you not put the damm password into set up document. That was the inexperienced level of mistake. Forgotting to replace it while you are seting up day one machine is mistake that can happen to anyone.
If a shop is being held together with duct tape and elbow grease, then you should have known that going in, and developed personal habits to avoid this sort of thing. Being overworked and tired isn't an excuse. Sure, the company and investors have to bear the real consequences, but you as an IC can't disclaim responsibility.
After all, if the junior dev could do it, so can everybody else (and whoever manages to get their account credentials).
From Sr dev/lead dev, dev manager, architect, ops stack, all the directors, A/S/VPs, and finally the CTO. You could even blame the CEO for not knowing how to manage or qualify a CTO. Even more embarrassing is if your company is a tech company.
I think a proper due diligence would find the fault in the existing company.
It is not secure to give production access and passwords to a junior dev. And if you do, you put controls in place. I think if there is insurance in place some of the requirements would have to be reasonable access controls.
This company might find itself sued by customers for their prior and obviously premeditated negligence from lack of access controls (the doc, the fact they told you 'how' to handle the doc).
But figuring out who to blame is toxic. You've got to go for a blameless culture and instead focus on post mortems and following new and better processes.
Things can absolutely always go to shit no matter where you work or how stupidly they went to shit. What differentiates good companies from bad ones is whether they try to maximize the learning from the incident or not.
It was the second day, and I only wiped out a column from a table, but it was enough to bring business for several hundred people down for a few hours. It was embarrassing all round really. Live and learn though - at least I didn't get fired!
But the junior dev is not fully innocent either: he should have been careful about following instructions.
For extra points (to prove that he is a good developer) - he should have caught that screw up with username/passwords in the instruction. Here's approximate line of reasoning:
What is that username in the instruction responsible for? For production environment? Then what would happen if I actually run my setup script in production environment? Production database would be wiped? Shouldn't we update setup instruction and some other practices in order to make sure it would not happen by accident?).
But he it is very unlikely that this junior dev would be legally responsible for the screw up.
A mentor was supervising me and continually told me to work slower but I was doing great performing some minor maintenance on a Clipper application and didn't even need his "stupid" help ... until I typed 'del .db' instead of 'del .bak'. Oooops!
Luckily the woman whose computer we were working on clicked 'Backup my data' every single day before going home, bless her heart, and we could copy the database back from a backup folder. A 16 year old me was left utterly embarrassed and cured of flaunting his 1337 skillz.
I could see having a backup that is hours old, and losing many hours of data, but not everything.
A company with 40 devs and over 100 employees that lost an entire production db would have surfaced here from the downtime. Other devs would corroborate the story.
I'm surprised a junior dev on his first day isn't buddied up with an existing team member.
In my line of work, an existing employee who
Transferred from another location would probably be thrown in at the deep end but someone who is new would spend some time working alongside someone who is friendly and knowledgable. This seems the decent thing to do as humans.
There is still a valuable lesson for the developer here though - double check everything, and don't screw up. Over the course of a programming career, there will be times when you're operating directly on a production environment, and one misstep can spell disaster -
meaning you need to follow plans and instructions precisely.
Setting up your development environment on your first day shouldn't be one of those times, but those times do exist. Over the course of a job or career at a stable company, it's generally not the "rockstar" developers and risk-takers that ahead, it's the slow and steady people that take the extra time and never mess up.
Although firing this guy seems really harsh, especially as he had just moved and everything, the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.
> Over the course of a programming career, there will be times when you're operating directly on a production environment, and one misstep can spell disaster
These times should be extremely rare, and even in this case, they should've had backups that worked. The idea is to reduce the ability of anyone to destroy the system, not to "just be extra careful when doing something that could destroy the system."
> Although firing this guy seems really harsh, especially as he had just moved and everything, the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.
Which tells me that this company will have issues again. Look at any high functioning high risk environment and look at the way they handle accidents, especially in manufacturing. You need to look at the overarching system than enabled this behavior, not isolate it down to the single person who happened to be the guy to make the mistake today. If someone has a long track record of constantly fucking up, yeah sure, maybe it's time for them to move on, but it's very easy to see how anyone could make this mistake and so the solution needs to be to fix the system not the individual.
In fact I'd even thank the individual in this case for pointing out a disastrous flaw in the processes today rather than tomorrow, when it's one more day's worth of expensiveness to fix.
Take a look at this:
All I'm saying is that there are times when it is vital to get things right. Maybe it's only once every 5 or 10 years in a DR scenario, but those times do exist. Definitely this company is incompetent, deserves to go out of business, and the developer did himself a favor by not working there long-term, although the mechanism wasn't ideal.
I'm just saying that the blame is about 99.9% with the company, and 0.1% for the developer - there is still a lesson here for the developer - i.e., take care when executing instructions, and don't rely on other people to have gotten everything right and to have made it impossible for you to mess up. I don't see it as 100% and 0%, and arguing that the developer is 0% responsible denies them a learning opportunity.
"Double check everything" is a good lesson, because we all can and should practice it.
"Don't screw up" is not useful advice because it's impossible. There's a reason we don't work like that... Who needs backups? Just don't screw anything up! Staging environment? Bah, just don't screw up deployments! Restricted root access? Nah, just never type the wrong command. No, we need systems that mitigate humans screwing up, because it will happen.
I think that they simply acted emotionally and out of fear, anger and stress. The vague legal threat and otherwise ignoring this dude bother suggest it. The way events unfolded, it does not sound like much rational thinking was involved.
If I didn't read wrong, they write poduction db credential in first day local dev env instruction! WTF.
This CTO sounds to me even worse than this junior developer.
<off-topic>A wonderful example of the shortcoming of the singular "they"... :P
Sometimes when I read stories like these, I think it's no wonder a company like WhatsApp can have a billion customers with less than 100 employees. And then I make some backups to get that cozy safe feeling again.
However, you can abdigate a responsibility. One can say for example that the CTO abdigated his responsibility to ensure the production database was protected and backed up.
Edit: not sure why I'm being down voted. I am correct: https://www.merriam-webster.com/dictionary/abdicate
It is, but it need not be. It's pretty easy to set up at least some backup system in such a way that whoever can wipe the production systems can't also wipe the back-up.
CTO needs to put a theatre to not get fired, because he is ultimately responsible.
I also had production access to everything from day one. The first thing I did was setup the hosts files in the various dev servers - including my own computer so I couldn't access the databases from them.
I have to remote into another computer to access them.
Perhaps with hipster databases like MongoDB that are insecure out of the box, but most grown-up, sensible DBs have the concept of read-only users, and also it is trivial to set up such that you can e.g. DELETE data in a table without being able to DROP that table.
I'll wager any startup that does like you say has devs that do SELECT * FROM TABLE; on a 20-col, million+ row table only actually wanting one value from one row, but they filter in their own code... Yes I have seen this more times than I care to count.
But not in the commands to run in a local-dev-setup-guide that purges the db it points to.
If anyone should be fired for that, it's the CTO. He must suck at his job at the very least, and junior dev should get an apology.
Not focusing on devops and putting your prod db credentials in plain sights are VERY different things. It's really really easy to do, especially if you are using Heroku or something like that. Same thing goes with database backups. I worked at multiple startups (YC and non) and they all had the basics nailed down, even when they were just 3-5 employees
Also, why even distribute the production credentials at all? Only the most senior DBAs or devs should have access to production credentials.
But there is no reason to write the production DB credentials in a document, especially as an "example". That is monumentally dumb. It amounts to asking for this to happen.
There's a reason why it's called disaster recovery and prevention.
I thought that was just SOP.
His was worse though, because he had specifically written a script to nuke all the data in the DB, intending it for test DBs of course. But after all that work, he was careless and ran it against the live DB.
It was actually kind of enlightening to watch, because he was considered the "genius" or whatever of my cohort. To wit, there are different kinds of intelligence.
--Unfortunately apparently those values were actually for the production database (why they are documented in the dev setup guide i have no idea).--
Someone else should have been fired if this is true.
Firing this guy does nothing, fixing the problem does, but requires those higher up to admit the mistake was theirs to begin with.