Hacker News new | past | comments | ask | show | jobs | submit login
Accidentally destroyed production database on first day of a job (reddit.com)
740 points by whistlerbrk on June 3, 2017 | hide | past | favorite | 265 comments

Sorry, but if a junior dev can blow away your prod database by running a script on his _local_ dev environment while following your documentation, you have no one to blame but yourself. Why is your prod database even reachable from his local env? What does the rest of your security look like? Swiss cheese I bet.

The CTO further demonstrates his ineptitude by firing the junior dev. Apparently he never heard the famous IBM story, and will surely live to repeat his mistakes:

After an employee made a mistake that cost the company $10 million, he walked into the office of Tom Watson, the C.E.O., expecting to get fired. “Fire you?” Mr. Watson asked. “I just spent $10 million educating you.”

Seriously. The CTO in question is the incompetent one. S/he failed:

- Access control 101. Seriously, this is pure incompetence. It is the equivalent of having the power cord to the Big Important Money Making Machine snaking across the office and under desks. If you can't be arsed to ensure that even basic measures are taken to avoid accidents, acting surprised when they happen is even more stupid.

- Sensible onboarding documentation. Why would prod access information be stuck in the "read this first" doc?

- Management 101. You just hired a green dev just out of college who has no idea how things are supposed to work. You just fired him in an incredibly nasty way for making an entirely predictable mistake that came about because of your lack of diligence at your job (see above).

Also, I have no idea what your culture looks like, but you just told all your reports that honest mistakes can be fatal and their manager's judgement resembles that of a petulant 14 year-old.

- Corporate Communications 101. Hindsight and all that, but it seems inevitable that this would lead to a social media trash fire. Congrats on embarrassing yourself and your company in an impressive way. On the bright side, this will last for about 15 minutes and then maybe three people will remember. Hopefully the folks at your next gig won't be among them.

My take away is that anyone involved in this might want to start polishing their resumes. The poor kid and the CTO for obvious reasons, and the rest of the devs, because good lord, that company sounds doomed.

Yeah when I read that my first thought was that the CTO reacted that way because he was in fear of being fired himself. I wouldn't be at all surprised if he wrote that document or approved it himself.

So at what point are you allowed to fire someone for being incompetent? Blowing away the production database seems to rank pretty high.

Note that I'm not talking about the situation in this article. That was a ridiculous situation and they were just asking for trouble. I'm asking about the perception that is becoming more and more common, which is that no matter what mistakes you make you should still be given a free pass regardless of severity.

Is it the quantity of mistakes? Severity of mistakes? At what point does the calculus favor firing someone over retaining them?

> blowing away the production database seems to rank pretty high.

In this case, it's not a matter of degree, it's a matter of responsibility. The junior dev is absolutely not the one responsible for the prod db getting blown away, and is absolutely not responsible for the backups not being adequate. As somebody else mentioned, this is like leaving the electric cable to the production system strewn across the office floor, then firing the person who happened to trip over it.

I agree somebody's job should be in jeopardy, especially if the backups weren't adequate: not for a single mistake, but for a long series of bad decisions and shoddy oversight that led to this.

This has nothing to do with competence. Competence is not never making mistakes (if somebody tells you he never makes mistakes, it's actually more likely he's just incompetent enough to not notice them). Competence is arranging work in a way that mistakes don't result in a disaster. Which clearly wasn't the job of the junior dude, and very likely was the job of the CTO. I could easily count at least half-dozen ways in the situation was a total fail before the mistake happened. So no, one shouldn't be given free pass for mistakes. But one should expect mistakes to happen and deal with them as facts of life. As for killing whole prod database, anybody who have spent some good time in ops/devops/etc. has war stories like this. Dropping wrong instance, wiping wrong system, disabling network on wrong system... A lot of people been there and done that. If it didn't happen to you yet and you're in the line of work where it can - it will. It would feel awful too. But it'll pass and you'd be smarter for it.

> Competence is arranging work in a way that mistakes don't result in a disaster.

I like this definition.

> So at what point are you allowed to fire someone for being incompetent?

If someone is still expected to be learning, mistakes (even large ones) are to be expected. Incompetence has to be judged against reasonable expectations. In the case here, there was a series of pretty severe mistakes, but deleting the database isn't what I'm thinking of.

Protecting the database was the job of the more experienced employees of the company, and ultimately the CTO. Some combination of people at the company were negligent, and the absence of actions taken to protect their systems shows a pattern of irresponsible behavior.

You fire people when they stopped producing values for the company.

In my opinion, mistakes should never be considered the person's fault. The development process should be designed to prevent human mistakes. If mistakes happen, that only means the process has been designed poorly.

No kidding, --he should get a bonus for finding a HUGE bug in their security lol.

You can't treat mistakes as no-ops. This event demonstrated lack of attention.

Don't be ridiculous - the first day at a new job can be incredibly stressful and disorienting. And even if somehow this doesn't apply to you, keep in mind that it does to a lot of people.

Ha! I missed that it was his first day.

World's worst on-boarding guide!

Firing should only really be an option when someone doesn't respond well to re-training and education.

It's hard to imagine a startup allowing themselves to die because they're trying to patiently re-train and educate someone.

I know startups are a limit case, but we didn't bother to make those sorts of distinctions for this article, so it's worth considering.

If the startup is large enough to have employees who can be fired then it's large enough to train/educate them. 5 whys may be too many for a small startup but 1 is certainly too few.

Here's some simple practical tips you can use to prevent this and other Oh Shit Moments(tm):

- Unless you have full time DBAs, do use a managed db like RDS, so you don't have to worry about whether you've setup the backups correctly. Saving a few bucks here is incredibly shortsighted, your database is probably the most valuable asset you have. RDS allows point-in-time restore of your DB instance to any second during your retention period, up to the last five minutes. That will make you sleep better at night.

- Separate your prod and dev AWS accounts entirely. It doesn't cost you anything (in fact, you get 2x the AWS free tier benefit, score!), and it's also a big help in monitoring your cloud spend later on. Everyone, including the junior dev, should have full access to the dev environment. Fewer people should have prod access (everything devs may need for day-to-day work like logs should be streamed to some other accessible system, like Splunk or Loggly). Assuming a prod context should always require an additional step for those with access, and the separate AWS account provides that bit of friction.

- The prod RDS security group should only allow traffic from white listed security groups also in the prod environment. For those really requiring a connection to the prod DB, it is therefore always a two-step process: local -> prod host -> prod db. But carefully consider why are you even doing this in the first place? If you find yourself doing this often, perhaps you need more internal tooling (like an admin interface, again behind a whitelisting SG).

- Use a discovery service for the prod resources. One of the simplest methods is just to setup a Route 53 Private Hosted Zone in the prod account, which takes about a minute. Create an alias entry like "db.prod.private" pointing to the RDS and use that in all configurations. Except for the Route 53 record, the actual address for your DB should not appear anywhere. Even if everything else goes sideways, you've assumed a prod context locally by mistake and you run some tool that is pointed to the prod config, the address doesn't resolve in a local context.

You made a lot of insightful point here, but I'd like to chime in on one important point:

> - Unless you have full time DBAs, do use a managed db like RDS, so you don't have to worry about whether you've setup the backups correctly.

The real way to not worry about whether you've set up backups correctly is to set up the backups, and actually try and document the recovery procedure. Over the last 30 years I've seen situations beyond counting of nasty surprises when people actually try to restore their backups during emergencies. Hopefully checking the "yes back this up" checkbox on RDS covers you, but actually following the recovery procedure and checking the results is the only way to not have some lingering worry.

In this particular example, there might be lingering surprises like part of the data might be in other databases, storage facilities like S3 that don't have backups in sync with the primary backup, or caches and queues that need to be reset as part of the recovery procedure.

"Backups are a tax you pay for the luxury of restore" [1]

A lot of people pay the tax and never even try the lux.

[1] http://highscalability.com/blog/2014/2/3/how-google-backs-up...

Good blog post. This, I suggest, is its most essential point:

"Prove it. If you don’t try it it doesn’t work. Backups and restores are continually tested to verify they work"

And put a firewall between your dev machines and your production database. All production database tasks need to be done by someone who has permission to cross in to the production side -- a dev machine shouldn't be allowed to talk to it.

I would argue that no machine should be allowed to talk to each other unless their operation depends directly on each other. If I want to talk to the database, I have to either SSH to a worker machine and use the production codebase's shell, or directly to a DB machine and use a DB shell.

We've made things so reports and similar read-only queries can be done from properly firewalled/authenticated/sandboxed web interfaces, and write queries get done by migrations. It's very rarely that we'll need to write to the database directly and not via some sort of admin interface like Django's admin, which makes it very hard to do bulk deletions (it will very clearly warn you).

Would you recommend all these steps even for a single-person freelance job? Or is it overkill?

Depends. Do you make mistakes?

I absolutely do. "Wrong terminal", "Wrong database", etc. mistakes are very easy to make in certain contexts.

The trick is to find circuit-breakers that work for you. Some of the above is probably overkill for one-person shops. You want some sort of safeguard at the same points, but not necessarily the same type.

This doesn't really do it for me, but one person I know uses iTerm configured to change terminal colors depending on machine, EUID, etc. as a way of avoiding mistakes. That works for him. I do tend to place heavier-weight restrictions, because they usually overlap with security and I'm a bit paranoid by nature and prefer explicit rules for these things to looser setups. Also, I don't use RDS.

I'd recommend looking at what sort of mistakes you've made in the past and how to adjust your workflow to add circuit breakers where needed. Then, if you need to, supplement that.

Except for the advice about backups and PITR. Do that. Also, if you're not, use version control for non-DB assets and config!

For windows servers I use a different colored background for more important servers.

I do this with bash prompt colors on all our servers. Prod is always red.

I don't do production support on freelance development jobs. Even if I have to sub the hours to one of my associates, I always have a gatekeeper, that being said, when I design systems the only way to get to production is via automation, e.g something gets promoted to a prod branch in github, and production automation kicks off a backup and then applies said changes. The trick is to have a gatekeeper and never have open access to production. It's easy even as a one man shop. Git automation and CI are simple with tools like GoCD and other CI tooling and only take a day or two to set up, faster if you are familiar with them.

It depends on how much is at stake. If product does not have users yet, then there is only small downside in accidentally killing database, so it probably make sense to loose some production database security access in order to increase speed of development. But if you already have a legacy system on your hands with many users/data - then it's time to sacrifice some convenience of immediate production database access for security.

Depends on what you are hired for. If you are hired to create a web application and you spent time trying to create a stable environment with proper build processes it might be looked upon poorly. Everyone has different priorities and some have limited budgets.

I agree, it's the fault of the CTO. To me, the CTO sounds pretty incompetent. The junior engineer did them a favor. This company seems like it is an amateur hour operation, since data was deleted so easily by an junior engineer.

Yup, I've heard stories of junior engineers causing millions of dollars worth out outages. In those case the process was drilled into, the control that caused it fixed and the engineer was not given a reprimand.

If you have an engineer that goes though that and shows real remorse your going to have someone who's never going to make that mistake(or similar ones) again.

Agreed. Several years ago as a junior dev I was tasked with adding a new feature- only allowing a user to have 1 active session.

So, we added a "roadblock" post auth with 2 actions- log out other sessions and log out this session.

Well, the db query for the first action (log out other sessions) was missing a where clause...a user_id!

Tickets started pouring in saying users were logged out and didn't know why. Luckily the on-call dev knew there was a recent release and was able to identify the missing where clause and added it within the hour.

The feature made it through code review, so the team acknowledged that everyone was at fault. Instead of being reprimanded, we decided to revamp our code review process.

I never made that kind of mistake again. To this day, I'm a little paranoid about update/delete queries.

We all make this mistake eventually, often in far more spectacular fashion. My lessons learned are

1) Always have a USE statement (or equivalent);

2) Always start UPDATE or DELETE queries by writing them as SELECT;

3) Get in the habit of writing the WHERE clause first;

4) If your SQL authoring environment supports the dangerous and seductive feature where you can select some text in the window and then run only that selected text — beware! and

5) While developing a query to manipulate real data, consider topping the whole thing with BEGIN TRANSACTION (or equivalent), with both COMMIT and ROLLBACK at the end, both commented out (this is the one case where I use the run-selected-area feature: after evaluating results, select either the COMMIT or the ROLLBACK, and run-selected).

Not all of these apply to writing queries that will live in an application, and I don't do all these things all the time — but I try to take this stance when approaching writing meaningful SQL.

#5 is the big one. I left WHERE off an UPDATE on "prod" data once. Fortunately it wasn't mission critical data that I ended up wiping and I was able to recover it from a backup DB. I never did anything in SQL without wrapping it in a transaction again.

I will note that depending on your DB settings and systems, if you leave the transaction "hanging" without rolling back or committing, it can lock up that table in the DB for certain accessors. This is only for settings with high levels of isolation such as SERIALIZABLE, however. I believe if you're set to READ_UNCOMMITTED (this is SQL Server), you can happily leave as many hanging transactions as you want.

> 2) Always start UPDATE or DELETE queries by writing them as SELECT;

On that point, I'd love a database or SQL spec extension that provided a 'dry-run' mode for UPDATE or DELETE and which would report how many rows would be impacted.

-- 123338817 rows will be affected

Oooops made an error!

> I'd love a database or SQL spec extension that provided a 'dry-run' mode for UPDATE or DELETE and which would report how many rows would be impacted.

I mean, if your DB supports transactions, you are in luck.

Start a transaction (that may be different among vendors - BEGIN TRANSACTION or SET AUTOCOMMIT=0 etc) and run your stuff.

If everything looks good, commit.

If you get OOOps number of records, just rollback.

MySQL Workbench won't let you run an UPDATE or DELETE without a WHERE by default.

Not sure why this isn't default in many tools. That's 99% for hazards.

Put the following into your ~/.my.cnf to enable it for the command-line client:


Thanks for the tip, however put that in [mysql] section or otherwise you'll ruin mysqldump command not recognizing that option.

We have to manually whitelist raw queries from active record that do full table scan so this also helps with mistakes like this.

You are not alone.

There is even a song in Spanish about forgetting to add a WHERE in a DELETE: https://www.youtube.com/watch?v=i_cVJgIz_Cs

This is amazing.

> Luckily the on-call dev knew there was a recent release and was able to identify the missing where clause and added it within the hour.

Raises questions about deployment. Didn't the on-call have a previous build they could rollback to? Customers shouldn't have been left with an issue while someone tried to hunt down the bug (which "luckily" they located), instead the first step should have been a rollback to a known good build and then the bug tracked down before a re-release (e.g. review all changesets in the latest).

Yup, agreed entirely. I'm a bit fuzzy on the details of that decision...he very well may have rolled back and then deployed a change later in the evening. It was fixed by the time I checked my email after dinner.

UPDATE cashouts SET amount=0.00 <Accidental ENTER>

Oops. That was missing a 'WHERE user_id=X'. Did not lose the client at the time (this was 15+ years ago), but that was a rough month. Haven't made that mistake again. It happens to all of us at some point though.

I'm beginning to think this is a flaw in SQL. It's so easy to bust the entire table.

Could've had something like 'UPDATE ALL' required for queries without filtering.

just run BEGIN TRANSACTION before

I'm guessing this feature was never tested properly

we all assume code (or feature) that are not tested should be assumed to be broken

At a former employer, we had a Megabuck Club scoreboard; you got your name, photo and a quick outline of what your (very expensive!) mistake had been posted on it. Terrific idea, as:

a) The culture was very forgiving of honest mistakes; they were seen as a learning opportunity.

b) Posting a synopsis of your cockup made it easier for others to avoid the same mistake while we were busy making sure it would be impossible to repeat it in the future; also, it got us thinking of other, related failure modes.

c) My oh my was it entertaining! Pretty much the very definition of edutainment, methinks.

My only gripe with it was that I never made the honor roll...

We had something similar at one of my jobs, its hard to relay it in text but it was really a fun thing. Mind you this is at a fortune 100 company and the CIO would come down for the ceremony when we did it, to pass the award. We called it the build breakers award and we had a huge trophy made up at a local shop, with a toilet bowl on it. If you broke the build and took down other developers then the passing of the award ceremony was initiated, I would ping the CIO (as it was my dev shop) he would come down, and do a whole sarcastic speech about how the wasted money was all worth it because the developers got some screw off time, while the breaker fixed the build. It was all in good spirit though and people could not wait to get that trophy off their desk, it helped that the thing was probably as big as the Stanley cup.

We built actual, physical thingamajigs; 'breaking a build' more often than not meant fragments of steel whizzing around the workshop while we were all getting doused in hydraulic fluid for our trouble.

Note to self: be very careful using Ziegler-Nichols to tune multi-kHp-systems. Very careful. Cough.

Yep. I had a junior working for me once a few years ago that made a rather unfortunate error in production which deleted all of several customers' data. I could tell he was on pins and needles when he brought it to me, so I let him off the hook right away and showed him the procedures to fix the issue. He said something about being thankful there was a way to fix the problem, and I just smiled and told him A) it would have been my fault if there hadn't been; and B) he wouldn't have had the access he did without safeguards in place. Then I told him a story about the time I managed to accidentally delete an entire database of quarantined email from a spam appliance I was working on several years earlier. Sadly, my CTO at the time did NOT prepare for that.

I lost a whole weekend of sleep in recovering that one from logs, and that was when I learned some good tricks for ensuring recoverability....

Agreed. Also, why didn't they have a backup of some sort? The hard drive on the server could have failed and it would have been just as bad.

Sounds like an incompetent set of people running the production server.

As a lot of companies I bet they HAVE backup, just never tested if the backup process works. It is absurdly common...

This is trivial tho'. Just setup a regular refresh of the dev env via the backup system. Sure it takes longer because you have to read the tapes back but it's worth it for the peace of mind, and it means that every dev knows how to get at the tapes if they need to.

Well yes, but there's a string of WTFs here, lots of 'trivial' stuff that wasn't done!

From my experience, if I do not test my backups they stop working in about 1 year. So if I do not test backups for over a year then my assumption is that I probably do not have working backups.

Most likely something like this. There is probably backup software running but it's either nothing but failed jobs or misconfigured so the backups aren't working correctly.

My favourite in that regard is an anecdote shared by a customer; he said one of their techs had discovered (luckily in time) that the long-running automated backup script on a small, but important server wrote the backup volume to...


And here I've been wondering how could it get worse than writing to local storage ...

I can tell you since I work support for a backup product.

Lot of people think that naming a folder on a local drive "Disaster Recovery" counts as having an offsite Disaster Recovery copy of backups. The number of large corporations whose backups are in the hands of such people is frightening.

OP says the team is 40+ and CTO just let them all walk on a catwalk.

"It's your first day, we don't understand security so here's the combination to the safe. Have fun!!"

"we have a bunch of guns, we aren't sure which ones are loaded, all the safeties are off and we modified them to go off randomly"

"your first day's task will be to learn how to use them by putting them to the heads of our best revenue-generating sales people and pulling the trigger. don't worry it's safe, we'll check back in with you at the end of the day."

If someone on their first day of work can do this much damage, what could a disgruntled veteran do? If Snowden has taught us anything, it's that internal threats are just as dangerous as external threats.

This shop sounds like a raging tire fire of negligence.

He didn't follow the docs exactly. That doesn't matter, though, your first day should be bulletproof and if it's not, it's on the CTO. The buck does not stop with junior engineers on their first day.

> He didn't follow the docs exactly

Sure, but having the plaintext credentials for a readily-deletable prod db as an example before you instruct someone to wipe the db doesn't salvage competence very much.

I wouldn't be surprised if the actual production db was never properly named and was left with an example name.

Don't tell Etsy that

Thanks for Tom Watson quote, I'd never heard it before, it's a good one. Also agree with everything else you just said, this is not the junior devs fault at all.

He might be inept, but in this instance the CTO is mainly just covering his own ass.

"Yeah the whole site is buggered, and the backups aren't working - but I fired the Junior developer who did it" Is not how you Cover Your Ass ™.

You can be inept WHILE covering your ass. I'm not saying he's a genius.

Blaming the new guy is not covering your ass. Blaming the senior engineer who put the credentials in the document would be covering your ass.

I was on a production DB once, and ran SHOW FULL PROCESSLIST, and saw "delete from events" had been running for 4 seconds. I killed the query, and set up that processlist command to run ever 2 seconds. Sure enough, the delete kept reappearing shortly after I killed it. I wasn't on a laptop, but I knew the culprit was somewhere on my floor of the building, so I grabbed our HR woman who was walking by and told her to watch the query window, and if she saw delete, I showed her how to kill the process. Then I ran out and searched office to office until I found the culprit -

Our CTO thought he was on his local dev box, and was frustrated that "something" was keeping him from clearing out his testing DB.

Did I get a medal for that? No. Nobody wanted to talk about it ever again.

Actually, the CTO should have mailed the dev team saying:


    Yesterday, I thought I was on my local machine and clear the database, while I was in fact on the production server.
    Luckily knodi123 caught it and killed the delete process. This is a reminder that *anybody* can make mistakes, 
    so I want to set up some process to make sure this can't happen, but meanwhile I would like to thank knodil123.



Sometimes I get reminded about how awesome some of the tech we use is, in this case, transactions :)

Oh god, this is the worst... People make errors, you help them and they don't give you any credit. Hope you are working somewhere else now.

Another stupid cto... You did a great job!

A use for HR!

My comment I left there:

Lots of folks here are saying they should have fired the CTO or the DBA or the person who wrote the doc instead of the new dev. Let me offer a counter point. Not that it will happen here ;)

They should have run a post mortem. The idea behind it should be to understand the processes that led to a situation where this incident could happen. Gather stories, understand how things came to be.

With this information, folks can then address the issues. Maybe it shows that there is a serially incompetent individual who needs to be let go. Or maybe it shows a house of cards with each card placement making sense at the time and it is time for new, better processes and an audit of other systems.

The point being is that this is a massive learning opportunity for all those involved. The dev should not have been fired. The CTO should not have lost his shit. The DB should have regularly tested back ups. Permissions and access needs to be updated. Docs should be updated to not have sensitive information. The dev does need to contact the company to arrange surrender of the laptop. The dev should document everything just in case. The dev should have a beer with friends and relax for the weekend and get back on the job hunt next week. Later, laugh and tell of the time you destroyed prod on your first day (and what you learned from it).

The firing order, in theoretical order for preventing future problems:

1. CTO As the one in charge of the tech, allows loss of critical data. If anyone should be fired, it's the cto. And firing this guy apparently will have the greatest positive impact to the company. Assuming they can hire a better one. I think given how stupid this cto is, that should be straightforward.

2. The executives who hired the cto. People hire people similar to themselves, it seems the executives team are clueless about what kind of skills a cto should have. These people will continue fail the dev team by hiring incompetent people, or force them to work in a way that causes problem.

3. Senior devs in the team. Obviously these people did not test what they wrote. If anyone had ever dryrun the training doc, they should prevent the catastrophe. It's a must do in today's dev environment. The normal standard is to write automatic tests for every thing though.

This junior dev is the only one who should not be fired...

I'm amazed at how quickly everyone is trying to allocate blame, as if there must be someone upon whom to heap it all on. Commenters on both Reddit and HN are being high and mighty, offering wisdom that they would never have allowed this to take place, while eager to point fingers. I bet far more than half of these commenters have at one time or another worked for at least one company that had this kind of setup, and didn't immediately refuse to work on other tasks until the setup was patched. Hypocrites.

The fact is, this kind of scenario is extremely common. Most companies I have worked for have the production database accessible from the office. It's a very obvious "no no", but it's typical to see this at small to medium sized companies. The focus is on rushing every new feature under the sun, and infrastructure is only looked at if something blows up.

Nobody should have been fired. Not the developer, not the senior devs, not the sysadmins, and not the CTO. This should have been nothing more than a wake-up call to improve the infrastructure. That's it. The only blame here lies with the CTO - not for the event having taken place, but only because their immediate reaction was to throw the developer under the bus. A decent CTO would have simply said "oh shit guys, this can't happen again. please fix it". If the other executives can't understand that sometimes shit happens, and that a hammer doesn't need to be dropped on anyone, then they're not qualified to be running a business.

Well you need to consider the cto's reaction.

His reaction shows that he is the no1 to fire, and has a good reason.

What you said is true, but does not matter. The cto already show that he is clueless...

You are right that this is an opportunity to learn because it is a demonstration of incompetence at many levels. However, this incompetence has consequences that might be fatal for the company. How much time and effort will be required to level up ? As a CEO I would request an independent audit ASAP on this incident and see the real extend of the problem.

As my mother said, if you put a good apple with bad apples, it's not the bad apples that become good.

They are in no condition, yet, to run a post mortem. At this moment they're probably still trying to figure out how to get their data back or maybe just close up shop entirely.

You run a post mortem when you're back and running again. They may never be back and running again.

very fair point. My response was mostly in response to "fire the asshats."

>"i instead for whatever reason used the values the document had."

>They put full access plaintext credentials for their production database in their tutorial documentation

WHAT THE HELL. Wow. I'd be shocked at that sort of thing being written out in a non-secure setting, like, anywhere, at all, never mind in freaking documentation. Making sure examples in documentation are never real and will hard fail if anyone tries to use them directly is not some new idea, heck there's an entire IETF RFC (#2606) devoted to reserving TLDs specifically for testing and example usage. Just mind blowing, and yeah there are plenty of WTFs there that have already been commented on in terms of backups, general authentication, etc. But even above all that, if those credentials had full access then "merely" having their entire db deleted might even have been a good case scenario vs having the entire thing stolen which seems quite likely if their auth is nothing more then a name/pass and they're letting credentials float around like that.

It's a real bummer this guy had such an utterly awful first day on a first job, particularly since he said he engaged in a huge move and sunk quite a bit of personal capital from the sound of it in taking that job. At the same time that sounds like a pretty shocking place to work and it might have taught a ton of bad habits. I don't think it's salvageable but I'm not even sure he should try, they likely had every right to fire him but threatening him at all with "legal" for that is very unprofessional and dickish. I hope he'll be able to bounce back and actually end up in a much better position a decade down the line, having some unusually strong caution and extra care baked into him at a very junior level.

There's also a high chance that document was shared on Slack. In which case, they were one Slack breach away from the entire world having write access to their prod database.

It's depressing how many companies blindly throw unencrypted credentials around like this.

Tell me about it. Fortunately where I work is sane and reasonable about it.

We have a password sheet. You have to be on the VPN(login/password). Then you can log in. Login/Password(diff from above)/2nd password+OTP. Then a password sheet password.

I'm still rooting out passwords from our repo with goobers putting creds in sourcecode (yeah, not config files....grrrrr). But I attack them as I find them. Ive only found 1 root password for a DB in there... and thankfully changed!

A plaintext password sheet? Despite the layers of network access control, this is a horribly bad practice in our modern age. Vault is free and encrypted secret storage systems are hardly a new concept.

Not at all. The password sheet password is actually a GPG key. Everything stored encrypted.

We suffer from NIH greatly. We end rolling our own stuff because either we don't trust 3rd party stuff, or it doesn't work in our infrastructure. In this case, multiple access locks with GPG is sufficient.

A response 8 days later is better than no response at all, right? :)

I agree that a multi-recipient GPG protected file is sufficient for a small org. In fact, that's how I used to do it Circa 2011. We found it worked quite well - we committed the GPG protected files to a version control system (git) and used githooks to make sure that only encrypted files were permitted, preventing users from intentionally/accidentally defeating gitignore.

Slack getting hacked would definitely be a mess. There's going to be so many cloud credentials, passwords, keys, customer info...

The exact same slack that he remained in for several hours after being fired. Even worse way to provoke a response from a disgruntled employee...

The document is probably in Google Docs too.

Plot twist: CTO or senior staff needed to cover something up (maybe a previous loss of critical business data) and arranged for this travesty to likely happen permitted sufficient number of junior devs went through "local db setup guide" mockery of a doc.

Either that or this is a "Worst fuckup on the first day on job" fantasy piece - I refuse to acknowledge living in the world where alternatives have any meaningful non-zero probability of occurring.

There are no upper bounds on incompetence. I've seen enough WTFs even in companies that didn't seem particularly dysfunctional, and that had some very competent people.

And then it takes only one shitty manager, or manager in a bad mood, to get the innocent junior dev fired.

Yeah I also kind of thought this was fake. Could be real but...

People will screw up, so you have to do simple things to make screwing up hard. The production credentials should never have been in the document. Letting a junior have prod level access is not that far out of the normal in a small startup environment, but don't make them part of the setup guide. Sounds like they also have backup issues, which points to overall poor devops knowledge.

Not part of this story, but another pet peeve of mine is when I see scripts checking strings like "if env = 'test' else <runs against prod>". This sets up another WTF situation if someone typos 'test' now the script hits prod.

Heh, or take a Netflix Chaos Monkey approach and have a new employee attempt to take down the whole system on their first day and fire any engineers who built whatever the new employee is actually able to break!

Why fire them? It's valuable experience that you are paying a lot for them to gain. Better: hold a postmortem, figure out what broke, and make the people who screwed it up originally fix it. Keep people who screw things up, as long as they also fix it.

I wasnt serious about "firing" -- but was just maintaining the spirit of what happened to the OP on reddit...

but yeah - I agree with you...

Sounds like a technique taught at The Pirate School of Management.

We've been calling that the "Volkswagen pattern"

> so you have to do simple things to make screwing up hard

No one goes out of their way to screw up; I'd recommend making it easier to recognize when you've made a mistake, and recover from it.

Except for critical business stuff, that needs severe "you cannot fuck this up" safeguards.

Yeah, another case of "blame the person" instead of "blame the lack of systems". A while back, there was a thread here on how Amazon handled their s3 outage, caused by a devops typo. They didn't blame the DevOp guy, and instead beefed up their tooling.

I wonder whether that single difference - blame the person vs fix the system/tools predicts the failure or success of an enterprise?

I think it's a major predictor for how pleasant it is for anyone to work at the company, and thus a long term morale and hiring issue.

This is the sort of situation that makes for a great conference talk on how companies react to disaster, and how the lessons learned can either set the company up for robust dependable systems or a series of cascading failures.

Unfortunately, the original junior dev was living the failure case. Fortunately, he has learned early in his career that he doesn't want to work for a company that blames the messenger.

The Amazon DevOp guy was fired for that mistake, just FYI.

Have any proof of this?

Why do you all act like one party needs proof, but only because they are refuting what was said originally, also, without proof.

Neither is providing proof, so asking one party for proof and not the other is obviously absurd.

In the absence of proof one way or the other, people believe what seems more reasonable to them. Not particularly absurd at all.

That's a pretty huge claim you've got there, with no proof.

Assuming the details are correct, this should be considered a win by the junior dev. It only took a day to realize that this is a company he really, really doesn't want to try to learn his profession at.

He should get that laptop back to them IMMEDIATELY. These sound like exactly the sort of douches would try to charge him with theft. (Edit: Why is it not surprising they don't have a protocol in place for managing dismissing staff and, like, getting their stuff back?)

Well, the customers database with important data just got nuked, so even if there is protocol, people who would normally do the steps have different things in mind. Laptop and such is least of their concerns.

Nobody hires you if things are perfect. They hire you because there's a problem. It might be a startup or a company just starting a tech sector. Either way they are in their infancy.

This isn't imperfection. This is beyond incompetence into some sort of Dunning-Kruger zen state. The story describes failures so egregious that the principals have no business taking money from customers.

> The CTO told me to leave and never come back. He also informed me that apparently legal would need to get involved due to severity of the data loss.

I don't know if I should laugh or cry here.

Guaranteed the CTO is busily rewriting the developer quide and excising all production DB credentials from the docs so that he can pretend they were never there. While the new guy's mistake was unfortunate in a very small way, the errors made by the CTO and his team were unfortunate in a very big way. The vague threat of legal action is laughable, and the reaction of firing the junior dev who stumbled into their swamp of incompetency on his first day speaks volumes about the quality or the organization and the people who run it. My advice... learn something from the mistake, but otherwise walk away from that joint and never look back. It was a lucky thing that you found out what a mess they are on day 1.

Several years back I worked as a DBA at a managed database services company, and something very similar happened to one of our customers who ran a fairly successful startup. When we first onboarded them I strongly recommended that the first thing we do is get their DB backups happening on a fixed schedule, rather than an ad-hoc basis, as their last backup was several months old. The CEO shuts me down, and instead insists that we focus on finding a subtle bug (you can't nest transactions in MySQL) in one of their massive stored procedures.

It turns out their production and QA database instances shared the same credentials, and one day somebody pointed a script that initializes the QA instances (truncate all tables, insert some dummy data) at the production master. Those TRUNCATE TABLE statements replicated to all their DB replicas, and within a few minutes their entire production DB cluster was completely hosed.

Their data thankfully still existed inside the InnoDB files on disk, but all the organizational metadata data was gone. I spent a week of 12 hour days working with folks from Percona to recover the data from the ibdata files. The old backup was of no use to us since it was several months old, but it was helpful in that it provided us a mapping of the old table names to their InnoDB tablespace ids, a mapping destroyed by the TRUNCATE TABLE statements.

No disrespect to the OP but this sounds pretty fake. If the database in question was important enough to fire someone immediately over then there wouldn't have been the creds floating around on an onboarding pdf. And involving legal? Has anyone here heard of anything similar? I'm just 1 datapoint but I know I haven't.

Yeah, I thought it sounded fake as well. I mean things like this happen, but something about the story just doesn't ring true to me.

Realised the user account is 3 weeks old, which is a red flag for me since it has no posts and the events allegedly happened friday

It's not the CTO's fault. It's the document's fault! We should never have documentation again, this is what it has done to us! We need to revert to tribal knowledge to protect ourselves. If we didn't document these values, people wouldn't be pasting them in places they shouldn't be!


For some years now I've stopped bothering with database passwords. If technically required I just make them the same as the username (or the database name, or all three the same if possible). Why? Because the security offered by such passwords is invariably a fiction in practice, I've never seen an org where they couldn't be dug out of docs or a wiki or test code. Instead database access should be enforced by network architecture: the production database can only be accessed by the production applications, running in the production LAN/VPC. With this setup no amount of accidental (or malicious) commands run by anyone from their local machine (or any other non production environment) could possibly damage the production data.

Side question, as a dev with zero previous ops experience, now the solo techie for a small company and learning ops on the fly, we're obviously in the situation where "all devs have direct, easy access to prod", since I'm the only dev. What steps should I take before bringing on a junior dev?

* Local env setup docs should have no production creds in it (EDIT: production creds should always be encrypted at rest)

* new dev should only have full access to local and dev envs, no write access to prod

* you're backing up all of your databases, right? Nightly is mandatory, hourly is better

* if you don't have a DBA, use RDS

That'll prevent the majority of weapons grade mistakes.

Source: 15 years in ops

Do as best as you can to "find compute room" (laptop, desktop, spare servers on rack that arent being used, .. cloud) , and make a Stage.

Make changes to Stage after doing a "Change Management" process (effectively, document every change you plan to do, so that a average person typing them out would succeed). Test these changes. It's nicer if you have a test suite, but you won't at first.

Once testing is done and considered good, then make changes in accordance to the CM on prod. Make sure everything has a back-out procedure, even if it is "Drive to get backups, and restore". But most of these should be, "Copy config to /root/configs/$service/$date , then proceed to edit the live config". Backing out would entail in restoring the backed-up config.


Edit: As an addendum, many places too small usually have insufficient, non-existent, or schrodinger-backups. Having a healthy living stage environment does 2 things:

1. You can stage changes so you don't get caught with your pants down on prod, and

2. It is a hot-swap for prod in the case Prod catches fire.

In all likelihood, "All" of prod wouldn't DIAF, but maybe a machine that houses the DB has power issues with their PSU's and fries the motherboard. You at least have a hot machine, even if it's stale data from yesterday's imported snapshot.

You missed one of the really nice points of having a stage there. You use it to test your backups by restoring from live every night/week. By doing that, you discourage developing on staging and you know for sure you have working backups!

Indeed. But if it's just 1 guy who's the dev, I was trying to go for something that was rigorous, still yet very maintainable.

Ideally, you want test->stage->prod , with puppet and ansible running on a VM fabric. Changes are made to stage and prod areas of puppet, with configuration management being backed by GIT or SVN or whatever for version control. Puppet manifests can be made and submitted to version control, with a guarantee that if you can code it, you know what you're doing. Ansible's there to run one-off commands, like reloading puppet (default is checkins every 30min)

And to be even more safe, you have hot backups in prod. Anything that runs in a critical environment can have hot backups or otherwise use HAproxy. For small instances, even things like DRBD can be a great help. Even MySQL, Postgres, Mongo and friends all support master/slave or sharding.

Generally, get more machines running the production dataset and tools, so if shit goes down, you have: machines ready to handle load, backup machines able to handle some load, full data backups onsite, and full data backups offsite. And the real ideal is that the data is available on 2 or 3 cloud compute platforms, so absolute worst case scenario occurs and you can spin up VM's on AWS or GCE or Azure.

--Our solution for Mongo is ridiculous, but the best for backing up. The Mongo backup util doesn't guarantee any sort of consistency, so either you lock the whole DB (!) or you have the DB change underneath you while you back it up... So we do LVM snapshots on the filesystem layer and back those up. It's ridiculous that Mongo doesnt have this kind of transactional backup appratus. But we needed time-series data storage. And mongodb was pretty much it.

As I said in another post, the least you can do is modify your hosts file so you can't access the production database from your local computer. Then you have to login to a remote computer to access production.

As adviced somewhere else, before you have a DBA, you should consider buying a hosted service like RDs, that would provide at a minimum backup's and restore points. Even have separate dev and prod accounts on RDS.

before you have a DBA

You never don't have a DBA. If you don't know who it is, it's you! But there will always, always be someone who is held responsible for the security, integrity and availability of the company's asset.

Best rule of thumb whenever you're doing work as a solo dev/ops guy is to always think in terms of being two people: the normal you (with super user privs etc.) and the "junior dev/ops" you who jut started his first day. Whatever you're working on needs to support both variants of you (with appropriate safeguards, checks and balances in place for junior you).

E.g. when deciding how to backup your prod database, if you're thinking as both "personas" you'll come up with a strategy that safely backs up the database but also makes it easy for a non-privileged user to securely suck down a (optionally sanitised) version of the latest snapshot for seeding their dev environment with [ and then dog food this by using the process to seed your own dev environment ].

Some other quick & easy things:

- Design your terraform/ansible/whatever scripts such that anything touching sensitive parts needs out of band ssh keys or credentials. E.g. if you have a terraform script that brings up your prod environment on AWS, make sure that the aws credentials file it needs isn't auto-provisioned alongside the script. Instead write down on a wiki somewhere that the team member (at the moment, you) who has authority to run that terraform script needs to manually drop his AWS credentials file in directory /x/y/z before running the script. Same goes for ansible: control and limit which ssh keys can login to different machines (don't use a single "devops" key that everyone shares and imports in to their keychains!). Think about what you'll need to do if a particular key gets compromised or a person leaves the team.

- Make sure your backups are working, taken regularly, stored in multiple places and encrypted before they leave the box being backed up. Borgbackup and rsync.net are a nice, easy solution for this.

- Make sure you test your backups!

- Don't check passwords/credentials in to source code without first encrypting them.

- Use sane, safe defaults in all scripts. Like another poster mentioned, don't do if env != "test"; do prod_stuff();

- RTFM and spend the extra 20 minutes to set things up correctly and securely rather than walking away the second you've got something "working" (thinking 'I'll come back later to tidy this up' - you never will).

- Follow the usual security guidelines: firewall machines (internally and externally), limit privileges, keep packages up to date, layer your defences, use a bastion machine for access to your hosted infrastructure

- Get in the habit of documenting shit so you can quickly put together a straight forward on-boarding/ops wiki in a few days if you suddenly do hire a junior dev (or just to help yourself when you're scratching your head 6 months later wondering why you did something a certain way)

The author should get their own legal in line - does the contract even allow termination on the spot. If not, the employer is just adding to their own pile of ridiculous mistakes.

Probably. At will employment is pretty common in the US.

Even in Europe it's pretty lenient for the first period. Different countries obviously have different maximum probation periods but day 1 would fall into it in most (all?) of them

Are we sure its the USA some of the language used in the poor guy's post on redit implies a non native speaker I am guessing India which is known for treating employees "horrifically".

land of the uninsured, non unionized free

When America was more deservingly called the "land of the free" in the 40s and 50s, they were also heavily unionized.

(Well, then it was "land of the free, blacks excepted")

The ending with taking the laptop to home though... He is a modern time Dostoevsky.

One of the questions I asked my manager during the interview process was how did he feel about mistakes?

I knew I was being brought in to rearchitect the entire development process for an IT department and that I would make architectural mistakes no matter how careful I was and that I would probably make mistakes that would have to be explained to CxOs.

Whatever the answer he gave me, I remember being satisfied with it.

Reminds me of my first dev job, when I got a call during lunch:

"The server has been down all day, and you are the only one who hasn't noticed. What did you break?"

"Well, I saw that all the files were moved to `/var/www/`, and figured it was on purpose."

Suffice it to say, I got that business to go from Filezilla as root to BitBucket with git and some update scripts.

Something tells me their production password was nothing like a 20-char random string...

I am the only one who is surprised that he can get the keys to the kingdom on day 1?

Day 1 is when you setup your desk and get your login. Then go back to HR to do the last hiring paperwork.

It should take a good week before a new employee is able to fuck up anything. Really.

How long do you want to adjust the height of your chair? Setting up the dev enviroment often takes ages. Why wouldn't it be the first thing to do? There will be enough progress bars while updating something like visual studio to finde time to re-adjust the chair.

Hilarious. I wonder if it's true.

This happened to a friend at a new job a few weeks ago. He wasn't fired, though.

If the bit about no working backups is also true, he's likely to need a new job anyway. :)

almost certainly not.

I did the same thing early on in my career. Shut down several major ski-resorts in Sweden for an entire day during booking season by doing what we always did, running untested code in production to put out fires. Luckily, my company and our customers took that as a cue to tighten up the procedures instead of finding someone to blame. I hear this is how it works in aviation as well, no one gets blamed for mistakes since that only prevents them from being dealt with properly. Most of us are humans, humans make mistakes. The goal is to minimize the risk for mistakes.

I stopped believing reddit posts a long time ago

Exactly, the post is very clichéd. I have about 75% belief that it's fictional. I guess it could be sort of entertaining to see how easy it is to get a few hundred software engineers on reddit and hacker news worked up into a sympathetic and self-righteous frenzy with a simple and entirely fictional paragraph posted for free from a throwaway account.

I am about 101% it's fake. "Unfortunately apparently those values were actually for the production database (why they are documented in the dev setup guide i have no idea)" - yeah, no. Had you told me you were able to screw the production db up because it had no su password set, you might have got me. But this is bullshit.

this, it looks like an hoax.

Technical infrastructure is often the ultimate in hostile work environments. Every edge is sharp, and great fire-breathing dragons hide in the most innocuous of places. If it's your shop, then you are going to have a basic understanding of the safety pitfalls, but you're going to have no clue as to the true severity of the situation.

If you introduce a junior dev into this environment, then it's him that is going to discover those pitfalls, in the most catastrophic ways possible. But even experienced developers can blunder into pitfalls. At least twice I've accidentally deployed to production, or otherwise ran a powerful command intended to be used in a development environment on production.

Each time, I carefully analyze the steps that led up to running that command and implemented safety checks to keep that from happening again. I put all of my configuration into a single environment file so I see with a glance the state of my shop. I make little tweaks to the project all the time to maintain this, which can be difficult because the three devs on the project work in slightly different ways and the codebase has to be able to accommodate all of us.

While this is all well and good, my project has a positively decadent level of funding. I can lavish all the time I want in making my shop nice and pretty and safe.

A growing business concern can not afford to hire junior devs fresh out of code school / college. That's the real problem here. Not the CTO's incompetence, any new-ish CTO in a startup is going to be incompetent.

The startup simply hired too fast.

The same thing could happen to senior. In particular, to tired overworked senior. It is more likely to happen to junior, because junior is likely to be overwhelmed. However, mistakes like this happen to prole of all ages and experience levels.

Seniority is what makes you not put the damm password into set up document. That was the inexperienced level of mistake. Forgotting to replace it while you are seting up day one machine is mistake that can happen to anyone.

True, but a senior engineer, even if he is never able to make architecture decisions, can still be held accountable for knowing better. That is precisely why they are paying him the big bucks.

If a shop is being held together with duct tape and elbow grease, then you should have known that going in, and developed personal habits to avoid this sort of thing. Being overworked and tired isn't an excuse. Sure, the company and investors have to bear the real consequences, but you as an IC can't disclaim responsibility.

This company has a completely different problem: no separation of duties. Start with talking to the CTO how this could have happened in the first place, re-hire the junior dev.

After all, if the junior dev could do it, so can everybody else (and whoever manages to get their account credentials).

When it comes to backup, there are two types of people, ones who do backup and ones who will do backup.

This is purely the fault of the entire leadership stack.

From Sr dev/lead dev, dev manager, architect, ops stack, all the directors, A/S/VPs, and finally the CTO. You could even blame the CEO for not knowing how to manage or qualify a CTO. Even more embarrassing is if your company is a tech company.

I think a proper due diligence would find the fault in the existing company.

It is not secure to give production access and passwords to a junior dev. And if you do, you put controls in place. I think if there is insurance in place some of the requirements would have to be reasonable access controls.

This company might find itself sued by customers for their prior and obviously premeditated negligence from lack of access controls (the doc, the fact they told you 'how' to handle the doc).

The Junior dev does bear a small amount of blame, if you really want to go the blameful route.

But figuring out who to blame is toxic. You've got to go for a blameless culture and instead focus on post mortems and following new and better processes.

Things can absolutely always go to shit no matter where you work or how stupidly they went to shit. What differentiates good companies from bad ones is whether they try to maximize the learning from the incident or not.

Ahhhhh haaaa yeah.....I've done that.

It was the second day, and I only wiped out a column from a table, but it was enough to bring business for several hundred people down for a few hours. It was embarrassing all round really. Live and learn though - at least I didn't get fired!

Obviously this is mostly CTO's screw up.

But the junior dev is not fully innocent either: he should have been careful about following instructions.

For extra points (to prove that he is a good developer) - he should have caught that screw up with username/passwords in the instruction. Here's approximate line of reasoning:


What is that username in the instruction responsible for? For production environment? Then what would happen if I actually run my setup script in production environment? Production database would be wiped? Shouldn't we update setup instruction and some other practices in order to make sure it would not happen by accident?).


But he it is very unlikely that this junior dev would be legally responsible for the screw up.

I destroyed an accounting database at a company during a high school summer job.

A mentor was supervising me and continually told me to work slower but I was doing great performing some minor maintenance on a Clipper application and didn't even need his "stupid" help ... until I typed 'del .db' instead of 'del .bak'. Oooops!

Luckily the woman whose computer we were working on clicked 'Backup my data' every single day before going home, bless her heart, and we could copy the database back from a backup folder. A 16 year old me was left utterly embarrassed and cured of flaunting his 1337 skillz.

Obviously not the new engineer's fault. Unfortunately, aspects of this are incredibly common. On three jobs I've had, I've had full production access on day one. By that, I mean everyone had it...

Story sounds fictional to me.

He's / she's better off not working at this place. So many things wrong. Not having a backup is the number 1 thing.

I could see having a backup that is hours old, and losing many hours of data, but not everything.

Even startups have contracts with their customers about protecting the customer's data. If it is consumer data, there are even stricter privacy laws. Leaving the production database password lying around in plain text is probably explicitly prohibited by the contracts, and certainly by privacy laws. The CTO should pay him for the rest of the year and give him a great reference for his next job, in return for him to never, ever, ever, tell anyone where he found the production password.

Here's why I think this is fake:

A company with 40 devs and over 100 employees that lost an entire production db would have surfaced here from the downtime. Other devs would corroborate the story.

I'm also skeptical, but this isn't necessarily true. There's plenty of software being written outside the HN bubble that's totally invisible to us. What if this was some shipping logistics company in Texas City? We'd never know about it; they wouldn't have a trendy dev blog on Medium.

Good point.

Assuming the CTO was honest about what happened.

I always wonder, why IT companies don't test their backups? Even if it's the prod db, it should be tested on a regular base. No blame to the dev.

We were paying for RDS right from when we were a 2 man startup. Zero reasons to not have a dB service that is backed frequently by a competent team.

He needs to return the laptop asap, like now. They are in full emotional mode and can overreact to what they might perceive as another bad act too.

I don't work in tech but I'm an avid HN reader.

I'm surprised a junior dev on his first day isn't buddied up with an existing team member.

In my line of work, an existing employee who Transferred from another location would probably be thrown in at the deep end but someone who is new would spend some time working alongside someone who is friendly and knowledgable. This seems the decent thing to do as humans.

Yeah this infra/config management sounds like land-mine / time bomb incompetence territory. You just were the unlucky one to trigger it. Luckily this gives you an opportunity to work elsewhere and hopefully be in a better place to learn some good practices - which is really what you're after as a junior dev anyway.

Lucky junior dev! He has figured out a bad company to work for in his first work day. Good luck finding a new job!

Also, this is going to look great on their resumé, and be the perfect response to the "tell us a time when you made a mistake" interview question.

Everybody agrees that the instructions shouldn't have even had credentials for the production database, and the lion's share of the blame goes to whoever was responsible for that.

There is still a valuable lesson for the developer here though - double check everything, and don't screw up. Over the course of a programming career, there will be times when you're operating directly on a production environment, and one misstep can spell disaster - meaning you need to follow plans and instructions precisely.

Setting up your development environment on your first day shouldn't be one of those times, but those times do exist. Over the course of a job or career at a stable company, it's generally not the "rockstar" developers and risk-takers that ahead, it's the slow and steady people that take the extra time and never mess up.

Although firing this guy seems really harsh, especially as he had just moved and everything, the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.

No, sorry, and it's important to address this line of thinking because it goes strongly against what our top engineering cultures have learned about building robust systems.

> Over the course of a programming career, there will be times when you're operating directly on a production environment, and one misstep can spell disaster

These times should be extremely rare, and even in this case, they should've had backups that worked. The idea is to reduce the ability of anyone to destroy the system, not to "just be extra careful when doing something that could destroy the system."

> Although firing this guy seems really harsh, especially as he had just moved and everything, the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.

Which tells me that this company will have issues again. Look at any high functioning high risk environment and look at the way they handle accidents, especially in manufacturing. You need to look at the overarching system than enabled this behavior, not isolate it down to the single person who happened to be the guy to make the mistake today. If someone has a long track record of constantly fucking up, yeah sure, maybe it's time for them to move on, but it's very easy to see how anyone could make this mistake and so the solution needs to be to fix the system not the individual.

In fact I'd even thank the individual in this case for pointing out a disastrous flaw in the processes today rather than tomorrow, when it's one more day's worth of expensiveness to fix.

Take a look at this: https://codeascraft.com/2012/05/22/blameless-postmortems/

I violently agree with you.

All I'm saying is that there are times when it is vital to get things right. Maybe it's only once every 5 or 10 years in a DR scenario, but those times do exist. Definitely this company is incompetent, deserves to go out of business, and the developer did himself a favor by not working there long-term, although the mechanism wasn't ideal.

I'm just saying that the blame is about 99.9% with the company, and 0.1% for the developer - there is still a lesson here for the developer - i.e., take care when executing instructions, and don't rely on other people to have gotten everything right and to have made it impossible for you to mess up. I don't see it as 100% and 0%, and arguing that the developer is 0% responsible denies them a learning opportunity.

Well, sure... but you cant expect one transitioning from intern status to first-real-job status to have the forethought of a 20-year veteran, nor should that intern/employee have the expectation that the company who is ostensibly to mentor him in the very beginnings of his career, would have such a poor security stance as to have literal prod creds in an on-boarding document, let alone not relegating whatever he was on-boarding with to a sandbox with absolutely no access to anything.

Not to be pedantic, but the fact that you are literally assigning percentage blames to entities means you do not, in fact, violently agree with me. Read the article I posted and you'll see why it is so important not to assign blame at all.

While working on AWS, we had data corruption caused by a new feature launch. Deployments took ~6 weeks so the solution was to use GDB to flip a feature flag in memory for about 120k servers.

> There is still a valuable lesson for the developer here though - double check everything, and don't screw up.

"Double check everything" is a good lesson, because we all can and should practice it.

"Don't screw up" is not useful advice because it's impossible. There's a reason we don't work like that... Who needs backups? Just don't screw anything up! Staging environment? Bah, just don't screw up deployments! Restricted root access? Nah, just never type the wrong command. No, we need systems that mitigate humans screwing up, because it will happen.

> the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.

I think that they simply acted emotionally and out of fear, anger and stress. The vague legal threat and otherwise ignoring this dude bother suggest it. The way events unfolded, it does not sound like much rational thinking was involved.

Cool story but I think this is fake. Since there are 40 people in the company, it seems like at least a few people before him followed the onboarding instruction. I just don't believe that there would be that many people that a) didn't do the same thing he did or b) change the document.

Repeat after me, while clicking your heels together three times, "It is not my fault. It is not my fault. It is not my fault." It was obvious as I read your account that you would be fired. A company that allowed this scenario to unfold would not understand that is was their fault.

I was only granted read-only access to the Prod DB last week, after achieving 6 months of seniority.

I would assume this was mocked to test if the intern could follow simple instructions, to provide a lecture for the huge consequences of small mistakes and to have a viable reason to fire consequently; but I'm wearing my tin foil hat right now, too.

It is really unfair to have fired him. The OP is not the one that sould have been fired. The guy in charge of the db should be fired and the manager who fired the OP should be fired too. And, by the way, the guy in charge of the backups too.

Isn't this new person deserve a peer bonus by discovering a production risk?

I would suggest you, once this sorted out, to publicly mention the company name so no other Engineer will fail in this trap again. This will be lesson for them to properly follow basic practices for data storage.

Unfortunately, software companies like that are everywhere, the guy is learning and screws up a terribly designed system, the blame is on the "senior" engineers that set up that enviroment.

My question is, why in the world did they publish someone's production credentials in an onboarding document? That has to be a SOX compliance violation at the very least.

The CTO should be fired immediately!

If I didn't read wrong, they write poduction db credential in first day local dev env instruction! WTF.

This CTO sounds to me even worse than this junior developer.

So a script practically set up the machine with the nuclear football by default, and then you where expected to diffuse it before using it. That is not your fault.

I have a feeling the CTO was actually one of those "I just graduated bootcamp and started a website, so I can inflate my title 10x" types.

Should have job title changed to Junior Penetration Tester and be rewarded for exposing an outfit of highly questionable competence.

Firing the guy seems drastic but understandable. Implying that they are going to take legal action against him is ridiculous!

So the company's fault. Embarrassing they tried to blame the new guy. So many things wrong with this.

Wow. What a train wreck. This is why the documentation I write contains database URIs like:


It was their fault, plain and simple.

How is it the FNGs fault that they have no backups, no DR plan, and production DB details freely available and in the setup guide? The company is entirely at fault.

I think the "their" was a plural referencing the company, not the dev.

Did you read the details? I disagree. The dev is probably better off not working in such a volitile environment. They'll be better off working somewhere where they can learn some best practices, possibly somewhere that doesn't have the possibility of wiping a production DB because you ran some tests from a developer's machine. That's incorrigible.

> It was their fault, plain and simple.

<off-topic>A wonderful example of the shortcoming of the singular "they"... :P

I really suggest that OP sends this thread to HR & others. And this isn't sarcasm

What company is this? CTO deserves some internet slapping.

Obviously not a ssae 16 environment.

Either the CTO and his dev team are ridiculously stupid, or this was on purpose.

Lots of people in the thread are commenting how surprised they are that a junior dev has access to production db. Both jobs I've had since graduating gave me more or less complete access to production systems from day one. I think in startup land - where devops takes a back seat to product - it's probably very common.

I work for a large bank as OPS engineer. The idea that I could even read a production database without password approval from someone else is too crazy to consider. Updating or deleting takes a small committee and a sizeable "paper trail" to approve.

Sometimes when I read stories like these, I think it's no wonder a company like WhatsApp can have a billion customers with less than 100 employees. And then I make some backups to get that cozy safe feeling again.

Probably because your industry is regulated.

Which doesn't abdicate responsibility from the CTO of the company to have practices in place that could have prevented this. While I'm going to hold my breath on being threatened with legal claims, to be fired for something that any person in the building could have done doesn't sound like a conducive environment to work in.

absolve not abdicate. you abdicate a throne, you are absolved of responsibility

You also abdicate responsibilities; ie you leave them, as a king leaves a throne. You can abdicate a philosophical position, just means you reject it.

Eg https://en.m.wiktionary.org/wiki/abdicate

Nonetheless, "X doesn't abdicate responsibility from the CTO.." doesn't work - it would be the CTO that's abdicating his own responsibility.

People don't always language well.

I didn't look at comments until today, I love that the primary thread I spawned was the semantic differences between absolve and abdicate. I appreciate the criticism, I learned quite a bit about the differences :)

I think people are trying to say abrogate. I've seen this usage a few times lately.

That connotes a degree of misfeasance, though - to say one has abrogated a responsibility is to say he has failed to uphold it, where the sense intended here is more one of absolution or relief.

Correct, regarding his usage as a synonym for excuse, forgive or mitigate.

However, you can abdigate a responsibility. One can say for example that the CTO abdigated his responsibility to ensure the production database was protected and backed up.

Abdicate is correct here. It also means "to cast off", although that meaning is seldom used anywhere other than the phrase "abdicating responsibility".

Edit: not sure why I'm being down voted. I am correct: https://www.merriam-webster.com/dictionary/abdicate

"The CTO abdicated his responsibility" would be okay, but "This doesn't abdicate the CTO of his responsibility" doesn't work.

> I think in startup land - where devops takes a back seat to product - it's probably very common.

It is, but it need not be. It's pretty easy to set up at least some backup system in such a way that whoever can wipe the production systems can't also wipe the back-up.

In this particular case, the OP indicated his team size was 40 and the whole company was about 100. I'd argue that's beyond "everyone has prod access" size.


Its a pretty gross error to me to have direct db access. Obviously in any stack you could push code that affects the db catastrophically anyway, but in dev mode you should never connect directly to production database, not only for this error but for general data integrity.

CTO needs to put a theatre to not get fired, because he is ultimately responsible.

As a junior noc analyst at a Fortune 500 company, I had root access to almost the entire corporate infrastructure from day one. Databases, front ends, provisioning tools, everything.

It's not about having access to the production database. It's about having an example script that can do catastrophic things and having the production username and password in the example docs.

I also had production access to everything from day one. The first thing I did was setup the hosts files in the various dev servers - including my own computer so I couldn't access the databases from them.

I have to remote into another computer to access them.

In that case don't you think they should informed him that this is a production environment so don't fuck it up? Giving your junior dev a front seat on your production without proper communication is a disaster waiting to happen.

I think in startup land - where devops takes a back seat to product - it's probably very common

Perhaps with hipster databases like MongoDB that are insecure out of the box, but most grown-up, sensible DBs have the concept of read-only users, and also it is trivial to set up such that you can e.g. DELETE data in a table without being able to DROP that table.

I'll wager any startup that does like you say has devs that do SELECT * FROM TABLE; on a 20-col, million+ row table only actually wanting one value from one row, but they filter in their own code... Yes I have seen this more times than I care to count.

I agree.

But not in the commands to run in a local-dev-setup-guide that purges the db it points to.

If anyone should be fired for that, it's the CTO. He must suck at his job at the very least, and junior dev should get an apology.

> I think in startup land - where devops takes a back seat to product - it's probably very common.

Not focusing on devops and putting your prod db credentials in plain sights are VERY different things. It's really really easy to do, especially if you are using Heroku or something like that. Same thing goes with database backups. I worked at multiple startups (YC and non) and they all had the basics nailed down, even when they were just 3-5 employees

The problem isn't so much that a junior had access to the production Db. The problem is that the junior's dev setup had access to the production Db and could nuke the whole thing with a few misplaced keystrokes. I'm working on a product currently where I am the only dev. I have a pretty large production Db. I also have a smaller clone of that same Db on my local machine for development purposes. I can only access the production Db by directly shelling into the machine it's running on or performing management commands on one of the production worker machines(which I also need to shell into). This was not very difficult to set up and ensures that my development environment cannot in any way affect the production environment.

Also, why even distribute the production credentials at all? Only the most senior DBAs or devs should have access to production credentials.

I've done about ~40 or so technology due diligence projects for investors of tech companies. You'd be amazed how many security flaws there are out there. One of the most simple ones - storing production credentials in the git repo.

Sure, there exist reasons to do that. It's still a bad idea, but, ok.

But there is no reason to write the production DB credentials in a document, especially as an "example". That is monumentally dumb. It amounts to asking for this to happen.

We give everyone access to production systems, but even if someone deleted everything from production, we can restore everything in ~20 minutes (this has happened), and if that process fails, we have backups on s3 that can be restored in a couple of hours (and this is tested regularly, but thankfully hasn't happened yet), and even if that fails...

There's a reason why it's called disaster recovery and prevention.

Why try to justify stupid behavior and absent security controls with the idea that your downtime is "only" 20 minutes? How silly.

I've only had one job after college, and I am still there almost a year after being hired. The first few months I only had access to my own local copy of the production DB. Though there's a reason or two why I wouldn't be outright trusted one of them stemming from me being a junior.

Wow, really?!? I've never granted access to prod databases on day one, junior or not.

I thought that was just SOP.

Can you send me your password again, I forgot it. Also, please reply to those emails from emily - or just delete them... they are cluttering up your inbox, and I am tired of having to sort through your guys' personal crap.


Same. I had instructions from a competent developer however. I would still blame whoever allowed production access as part of application setup, as well as the fact there isn't a process to back up this production data.

It's not that this shouldn't happen, but that it does happen and has to be dealt with as the potential impact scales up. Having production creds on day 1 isn't the same as day 500.

Perhaps, but in 2017 it is gross fiduciary malpractice not to have backup systems in place for production data and code. It would be grounds for a shareholder suit against the principals.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact