Hacker News new | past | comments | ask | show | jobs | submit login

So storytime! I worked at Twitter as a contractor in 2008 (my job was to make internal hockey-stick graphs of usage to impress investors) during the Fail Whale era. The site would go down pretty much daily, and every time the ops team brought it back up, Twitter's VCs would send over a few bottles of really fancy imported Belgian beer (the kind with elaborate wire bottle caps that tell you it's expensive).

I would intercept these rewards and put them in my backpack for the bus ride home, in order to avoid creating perverse incentives for the operations team. But did anyone call me 'hero'?

Also at that time, I remember asking the head DB guy about a specific metric, and he ran a live query against the database in front of me. It took a while to return, so he used the time to explain how, in an ordinary setup, the query would have locked all the tables and brought down the entire site, but he was using special SQL-fu to make it run transparently.

We got so engrossed in the details of this topic that half an hour passed before we noticed that everyone had stopped working and was running around in a frenzy. Someone finally ran over and asked him if he was doing a query, he hit Control-C, and Twitter came back up.




I worked there at the time and ending up running the software infrastructure teams that fixed all these problems. The beer wasn't a reward, it was because people were stressed and morale was low. Nobody brought the site down on purpose.

What really made me mad was when we hired consultants and the contract would end, usually without much success because Twitter's problems were not normal problems, and then they would send us a fancy gift basket with our own wasted money.

Maciej, we are still waiting for you to ship the executive dashboard.


That dashboard supported something like a dozen people over its lifetime. One person would start writing it, then quit, and be replaced by another person who rewrote it in their preferred language, and then the cycle would repeat.

It was a VC-funded welfare program for slackers and I miss it greatly.


I lol'd at "welfare program for slackers" - That's the dream really... Find a chaotic workplace that lets you play with your favorite languages and no real tangible outcome.


To take the history of direct queries at Twitter even further back, I built a web interface at Odeo for the CEO to run direct queries against the database (and save them so he could re-run them). There were some basic security precautions, but this was totally cowboy.

That Odeo team was filled with best practices aficionados and the management (including me) was a bit cowardly about being clear that "WE ARE FAILING HARD AND FAST." Damn the practices.

So of course the engineering team freaked out, especially since the CEO managed to find lots of queries that did take the site down.

But I honestly credit that as one of the biggest things that I contributed to Twitter. Having easy SQL access let the CEO dig into the data for hours, ask any question he wanted, double check it, etc. He was able to really explore the bigger question, "Is Odeo working?"

The answer was no. And that's how he decided to fully staff Twitter (twttr then) as a side project, buy back the assets, and set Twitter up as it's own thing.

I think that it really was very close--if we'd moved any slower we would have run out of money before anyone was ready to commit to Twitter. Same story about Rails--without being able to do rapid prototyping we never would have convinced ourselves that Twitter was a thing.


Just a quick note not directed at OP but for any other engineers that may be unaware, these days AWS makes provisioning a read replica painless, and you can point the CEO to up-to-the-minute data while essentially firewalling the queries from customer operations.


how?


First Google result for "aws read replicas": https://aws.amazon.com/rds/details/read-replicas/

> Using the AWS Management Console, you can easily add read replicas to existing DB Instances. Use the "Create Read Replica" option corresponding to your DB Instance in the AWS Management Console.


Why not have it run against a replicated copy? I did that in the past, works amazingly, they can f* up all they want without any implications.


This was 2005. We had dedicated servers in our own cage. I can't remember if we already had replicas. It seems plausible. But actually spinning up a new one would have required more work and convincing than I wanted to do.


It's probably easy to do if you know it's an issue to begin with. I've run into this scenario before (running sql queries to read data that turned out to lock everything) and it caught me by surprise. Why would a read query cause the database to lock anything? I thought databases did stuff like multiversion concurrency control to make locks like that unnecessary.


Doing large queries on a Postgres standby had the potential to mess up the master, depending on configuration settings.


Thanks for sharing. Out of curiosity, why was the answer no? Was the issue the downtime or something more subtle?


I think in the end he lost faith over retention. We got a lot of traffic and new users but didn't keep any of it. He was already suspicious that iTunes was going to kill us and so the stats were the nail in that coffin. He was right. We were ten years too early to podcasting.


This reminded me of something too!

I used to work(on backend) on a popular app(in my country) which had a good number of users. One day I was asked to work with some infra/sysadmin folks who wanted to fix some issues with the servers in our inventory. We happily updated kernels and even rebooted servers a few time. I came back to my team and saw them deeply engrossed into production logs. Turns out few of the servers that were "fixed" were actually production servers. I almost shouted the F word when I listed all IPs. This confusion happened because the server guys used data IPs and we used management IPs. This exposed serious miscommunication among our teams. But fun times indeed!


> It took a while to return, so he used the time to explain how, in an ordinary setup ...

This one was visible from such a great distance, it's a wonder neither of you spotted it as it happened! I love your post — reminds me of BOFH :)


The guy had an amazing beard, with streaks of white in it! He looked like a great wizard to me. I remember even as we noticed people were frantic, saying to one another "oh man, another outage, thank goodness it's not us!"


And now it's a full-blown sitcom scene


A true BOFH would have either disposed of any witness or made them the culprit.


A true BOFH works with what he’s got, and when what he’s got is a fool willing to do all his work for him, then it’s time to implement Plan A: sit back and enjoy the fireworks.


> The site would go down pretty much daily, and every time the ops team brought it back up, Twitter's VCs would send over a few bottles of really fancy imported Belgian beer

Never understood this mentality but have seen it at many companies. Rewarding someone or some team for heroically fixing something after a catastrophic failure. Talk about misaligned incentives! Reminds me of the Cobra Effect [1]. When you reward “fixing a bad thing” you will get more of the bad thing to be fixed.

1: https://en.wikipedia.org/wiki/Cobra_effect


Seems like maybe you want to reward fire fighters and also reward fire prevention?


boy, do i have a podcast episode for you: https://casefilepodcast.com/case-98-the-pillow-pyro/


from a complete rando: thanks for posting this — will listen to it later today.


This gives me hope that one day I will be able to run a startup. The big tech companies aren't too different than the rest of us after all...


Agreed, the only thing that a showstopper for me is the money and talent, It is still a struggle to find talented people who want to work for a startup.


Even harder to find ones that wish to remain working for a startup!


This is hilarious, thanks for sharing. I used to work at companies like this, except they weren't worth billions of dollars.


Neither was twitter in 2008, it didn't reach $1b until the end of 2009


The story is most probably not true. Love the taco tunnel though :)

Edit: apparently the stories actually are true.


This is the same group of folks who wrote the infamous ranty blog shitting all over Rails back in...'11(?) when it was pretty clear that their workload wasn't suited to an RBDMS and ActiveRecord. They wrote their own message queue twice despite suggestions to use known tools before eventually giving up.


That’s hilarious. Reminds me of a clip from the show Silicon Valley.


Is that beer story satire?


No, it is true.


Is it actually really true? The second part, too? I thought this can't be true and must be a (good) story just to amuse the readers - I guess I was wrong.


I worked there for a bit. Sometime around 2014 I dropped a production DB table (via a fat finger, shouldn’t have even been possible in hindsight). It wasn’t consumer facing but the internal effect made it look like all of Twitter was simultaneously down. Mass hysteria there for 20 min or so.


Can someone explain the joke (about the beer) because I genuinely don't understand

edit: pretty please


Each time the ops team brought Twitter back up, they receive good beer. So it would also mean that each time Twitter goes down, they could expect to receive the beer. Without idleword's actions, they would have an incentive (good beer) for having Twitter keep going down and not doing work to improve the stability.


Under the guise of preventing the ops team from being incentivized to create outages, he was selflessly stealing all the nice beer for himself.


He took the beer because he wanted it. "Perverse incentives" are an excuse, because nobody is going to kill their production servers and all the panic that entails for like $10 worth of beer.


Sounds like the guy was bragging about his SQL skills to avoid locking the database but ended up locking the database anyway (thus, people running around)


If the ops team got beer every time the servers went down (as a reward for fixing them) then they'd have an incentive for the servers to go down.


We all understand the perverse incentives joke, I think what's confusing people here is whether there's some other hidden joke they're missing that suggests not to take OP at his word that yes, he did make off with someone else's gift, which is generally considered a dick move.


What the hell are all of you smoking, some moderately expensive alcohol is nowhere near enough reward to take down a service.


If it was a sure thing that the ops engineers were doing that, then sure, it'd be kinda funny. Otherwise it just seems like a dick move.


The alcohol was an incentive to bring the service back up quickly, but not an incentive to prevent it going down in the first place. Twitter was going down often enough on its own that nobody needed to be motivated to help it crash (except that bringing it back up sooner gives it another opportunity to crash again sooner).


Operant conditioning is a thing and it works.

While I and you would not do this I’m afraid that it would somehow find a way to work in this case too.


Ops engineers don't get paid enough to fix dev fuck ups enough as it is. No amount of beer is going to fix that.


He's taking home the special expensive beer and not telling them about it because he cares about the health and well being of his team so much, and yet they wouldn't even consider him a hero for this, how ungrateful they are!


If everytime the site was brought back up (because it had gone down), and the ops guys got free fancy beer, then the message pretty quickly turns into, "if the site goes down, I get rewarded."


In other words, that beer gave you the motive to bring twitter down, which you inevitably did by asking that question.


The second story had me in tears. Especially given that I'm building a similarly scary query right now (thankfully not against live).



Woo startups.


> We got so engrossed in the details of this topic that half an hour passed before we noticed that everyone had stopped working and was running around in a frenzy. Someone finally ran over and asked him if he was doing a query, he hit Control-C, and Twitter came back up.

This would not be out of place as a scene in Silicon Valley


idlewords, the user you're replying to, was listed as a consultant on the show


For a later season. This was one of my favorite scenes on the show.


Completely unrelated, but I find myself reading your post about Argentinian steaks at least once a year. It's perfect. https://idlewords.com/2006/04/argentina_on_two_steaks_a_day....


No joke, this post was largely the reason I wanted to travel to Argentina.

The food lived up to the mental image I had after reading the post.


I just found and read that article yesterday. The writing is on another level.


As an Uruguayan, I loved it and found it entirely accurate :)


Not the best quality, but there is a scene just like that: https://www.youtube.com/watch?v=Dz7Niw29WlY


"I would intercept these rewards and put them in my backpack for the bus ride home, in order to avoid creating perverse incentives for the operations team. But did anyone call me 'hero'?"

Wait so you stole rewards for a team that was spending time (I assume extra or stressful) on something you didn't do or have any part in. And you want a cookie?

I mean I get it, the company was probably not great in it's infancy. But what?


I think OP is saying the rewards were confiscated so the team wouldn't begin breaking things on purpose to get a reward when they fixed it.


Yeah but does anybody believe that the engineers would deliberately break things so they could have to work in a stressful environment bringing things back up just to get some free beer?


If your incentives are aligned w/firefighting as opposed to fire prevention b/c management is not motivating and rewarding the extra work that goes into avoiding these scenarios in the first place, you're encouraging fire.


Indeed, the usual motivation to try and be called a hero for putting out the fire you started is much more valuable than free booze: a title promotion with a pay bump.


I don't want a cookie; I want more $24/bottle Belgian beer.


You should submit a request to the Pinboard CEO...


Wouldn't that have made you the one with a "perverse incentive"?


That explains why he walked over to the DB guy and asked him to run an expensive query on the life system ;)


That's usually called stealing, or something a little softer than that. It's interesting that you shared that experience expected for us to laugh at it. The rest of the comment was hilarious and I'm happy you shared it, but that bit is very odd. I also see where you're coming from. But your act was ethically questionable.


Just wanted to say that I enjoy reading your blog.


It's a joke. Laugh, it's funny.


It's one of those jokes where if the story isn't true then the entire basis for it being funny disappears. (And if it is true then the joke isn't good enough to make up for the actions.)


Having worked on a lot of ops teams in unstable environments, it's just really dickish.


I also have. idlewords' post is one of the funniest things I've read this week.


yea as an ops engineer that's probably the worst violation of trust i've ever heard of.


Wait so you stole rewards for a team that was spending time (I assume extra or stressful) on something you didn't do or have any part in.

The HR department in my company does this, and then redistributes the gifts to everyone in a random drawing at the Christmas party.

One year some department got a bunch of PlayStations, and a couple of them ended up in my department. The only thing my department contributed to the kitty was candy. I bet some people in that other department were disappointed.


Finally we get the long awaited sequel to One Flew Over the Cuckoo's Nest...

One flew over the dubcanada's head.


Wait what did I miss something? lol


The joke.


Hero? You’re a villain who steps on teammates. The worst part is you thought it’d be okay to share that and think we’d be on your side. Have you no shame?


My job was to make growth graphs for investor slide decks, so by definition I had no shame.


Or, if you had any shame, its growth would be up and to the right!


>> he hit Control-C, and Twitter came back up.

Monolithic architecture. When I did security work I fought this every day. Moving away from it is a nightmare of technical debt and heated debate about who should control what. I'm reminded of a story from the early days of MSN. The legend goes that in the late 90s MSN ran out of one cabinet, a single server. The server had redundant power supplies, but only one physical plug.


> Monolithic architecture.

This particular problem had nothing to do with a monolithic architecture. Your app can be a monolith, but that still doesn't mean your BI team can't have a separate data warehouse or at least separate read replicas to run queries against.


It's not "nothing to do with". You're correct that a monolithic architecture does not imply that a single read query will lock the entire database. But it is a prerequisite.


Not really. I've seen more than one microservice-architected (admittedly, poorly) systems where, instead of the whole DB freezing up, just the one DB would freeze, but then all of the other microservices that talked to the frozen microservice didn't correctly handle the error responses, so now you had corruption strewn over multiple databases and services.

So, while true the failure mode would be different, "one bad query fucking up your entire system" is just as possible with microservices.


And of course this is standard practise. I've contracted on largish apps before (rails! Shock!) and of course we provided read-only replicas for BI query purposes. I wouldn't have provided production access even if asked.

Anything else is simple incompetence and the macro-organisation of the code and/or services is irrelevant.


If your website crashes because a single person ran a query, your system is too monolithic. You can have thousands of little microservices running all over the place, but a single query causing a fault proves that a vital system is running without redundancy or load sharing and that other systems cannot handle the situation. You have too many aspects of your service all tied together within a single system. It is too monolithic.


I think "monolithic" and "fragile" are orthogonal concepts.


> I would intercept these rewards and put them in my backpack for the bus ride home, in order to avoid creating perverse incentives for the operations team. But did anyone call me 'hero'?

Wait, I don't understand.

Why would anyone call you hero?

Are you suggesting that the team would deliberately crash the app to receive beers and that by stealing them you stopped this from happening?

Free drinks and free food is the standard here to reward teams when they spend extra unpaid time away from their families.

All of the posts asking the same question are being down voted. Am I missing something?

You said you were a contractor at the time. Unless you were on the management team I fail to see how this was your responsibility to choose what happened.


> Am I missing something?

That it is a joke.


The humor must be lost in translation then, I don't see anything resembling a joke.


> Are you suggesting that the team would deliberately crash the app to receive beers

https://en.wikipedia.org/wiki/Perverse_incentive


Yes, the cobra effect exists. Should this mean that everyone needs to stop all forms of positive reinforcement? I don't believe so.

I doubt anyone would risk a comfortable job at Twitter against a few bottles of beers. Even if they are really fancy, that's what... $20-50?

If this had been worded as a "Haha, I stole the bad team's beer" I would have laughed.

However, worded as "where is my reward for being smart and stopping the cobra effect?" that's just an humble brag and plain unfunny.


He’s joking.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: