Hacker News new | past | comments | ask | show | jobs | submit login
Lucky Lotto, chaos engineering but for teams (danlebrero.com)
139 points by delebe on July 1, 2021 | hide | past | favorite | 37 comments



I heard a good presentation about this. They did it for a team at Google. But it was daily, and you weren't allowed to tell anyone else. You just didn't respond to any requests for that day.

But even worse, you could also be assigned the "liar" task. In which case you were supposed to reply to emails but give intentionally wrong information some of the time. But in that case you would tell people that you were the liar and that your answers aren't to be trusted.

It seemed like a good way to make sure at least two people could do everything and that documentation was solid enough that you could recognize when someone was wrong.


> But it was daily, and you weren't allowed to tell anyone else. You just didn't respond to any requests for that day.

I'm 100% certain I've worked with people that did this. I never realized they were DR visionaries


I wonder if this encouraged people to be more open with their ideas also.

If I could say things knowing that if I made a mistake, someone might friendly correct me with "Ah, you're the liar today, S3 doesn't support that format", I'd be happier to make them.


When at Google, we did the simpler version (of not replying) for our DiRT week exercises. DiRT stands for disaster recovery testing.


How are people afterwards.

I refuse to use email at work anymore because of the constant phishing tests. Sick of the gotcha mentality.


I am genuinely fascinated how they managed to piss you off so much with phishing tests. (For me email is the backbone of all permanent office communication). Was the frequency too high (reducing the signal to noise ratio too much), or just the fact that they doubt your ability to fall for such a thing?


2-3 a month. Often spoofing other co workers I need to be cautious on all emails.

I use it now only to see if notifications came in from calendar or such. Even then it’s just looking at subject lines.

I then go straight to the app and check for actual message / event.


I was wondering that too. I'm guessing it's frequency. We get them, but they are once every few months. The only annoying thing about them is I have to open outlook to actually report them as I normally use Airmail.


I agree. We have a security score at work and mine had been zero for a long time. I realized that you need to report the emails as spam or phishing and not just ignore/delete like I usually do.


I have used similar discussion points as in this article to argue that parental leaves, long vacations and other similar employee benefits build resilient teams and products. If you have to plan for people going missing (without the possibility for immediate replacement) you are forced to spread knowledge in the organization and you can't be too dependent on individuals.


Banks require a planned version of this: you have to take so many contiguous days off at least once a year. But it's an anti-fraud measure, as most of the frauds you can run internally require you to be there to juggle things, and you make the number of contiguous days long enough where such a scheme will come crashing down.


Some stopped in the last few years, or at least for certain job categories. It was a wonderful thing when it was around because you could take a vacation with absolutely no remote access.


Great idea, similar to what I understand is done in accounting. The idea is to force everyone in the actg dept to take regular vacations of several weeks, requiring others to cover their tasks. This is so that others can uncover financial sleight-of-hand to prevent embezzling (or at least structure it such that successful embezzling requires a larger conspiracy, increasing the likelihood of getting caught).


I like this - we’ve had “reduce our bus factor” on the to-do list for a while, but not found a way to do it yet.


The challenge there is that you need to hire and train more people than you have right now, and hiring is difficult. I mean probably not so much if you have a lot of money.

I'm currently in a high bus factor job, I'm the only developer on the UI - our CTO can do some small jobs here and there, but nobody's touched or even looked at the new UI I'm building (Go + React).

I want to be able to leave, but the way things are going - and the way recruiters are spinning up again - I'm afraid I'll have to bring them the bad news that I got an offer I can't refuse (and that they can't match; we're talking up to 150% pay rise / benefits).

What my company needs is a big bag of money so we can hire contractors to fast forward this project. Which will be a short term solution, but still. But this company isn't eager, it's run by mid-late career veterans who are happy with things bumbling along and a decent 15%/year growth. I can kinda respect that, but for my project that's not good enough. And it's an unsexy company, so they struggle to hire anyone.


They'll survive. It is common to overvalue yourself when leaving the company but the reality is the company will adapt just like it always has and if it doesn't it would have died even with you there from choices that didn't allow it to adapt.

Just accept the offer and move on. It is the best thing you can do for a company like this.


> I'm currently in a high bus factor job, [...]

That's actually a low bus factor, isn't it?

https://en.wikipedia.org/wiki/Bus_factor


From the article you linked: "There is a rare alternative definition for the bus factor, namely: the number of people who are indispensable for the project. In other words, it is the minimum number of people who are a single point of failure. If using this definition, then a high bus factor is considered a bad thing (since the loss of any person included destroys the project), and zero is considered the ideal bus factor."

Perhaps "bus risk" would be a better term for this usage?


I don't think this is a mechanism to reduce the bus factor, to me it seems like a mechanism to make people realize what the bus factor is for the team. The actual solutions were not described in the article but I guess that is very team specific.


Maybe you didn't read the whole article then. It totally describes how to do it. The bottom has a nice summary but I'll quote the solution parts of it here:

    The winner will work on some side project. Still work.
Not a solution to the bus-factor of people but a good solution to the "we never get to work on tech debt / platform work because features" problem. This alone is awesome about this.

    Everybody, including product managers, gets one ticket every week, even if you don’t want it.
This, if done right, will result in Product Managers providing a vision and consistent answers to similar type questions. Thus the team can learn to anticipate their answers for minor things (which helps even in week where the PM isn't the winner) and in weeks in which he is on vacation (or wins again) the team isn't stuck waiting for a week for an important question that blocks development.

    Team should avoid delaying the work for a week. 
    Try to bring one of your colleagues to do the task with you or under your supervision.
This is what to do, when you need to break "rule 3" (which states you have to be completely unavailable). It's a soft rule and I think the point is actually to break it a lot in the beginning. The "under supervision" part means, you are teaching another team member part of what made you the one with the bad bus factor. As time goes on, having to break this rule will become less frequent (which is why they wanted everyone to write down when they have to break rule 3 if you ask me)


Keeping aside the insulting nature of suggesting someone did not read the article, here is my response.

I like to differentiate between mechanism to make people realize a problem and actual solutions to the problem. Its very easy to say mentor someone to do your job but if that was doable and easy they would be already doing it. The problem is precisely that there is no real good way to do KTs. Forcing someone to solve the problem is one way to KT, but is that the most efficient way? I would rather gather data about what breaks down and come up with a more efficient KT mechanism.


FWIW I operate under the assumption that HN is just as bad as Slashdot with regards to commenting without reading the article :) and the solutions were pretty clear to me, even without the nice summary at the end. Sorry if it wasn't worded softly enough, no insult meant, more an observation.

The parts I mentioned are not the 'realization' part. They are the solution parts. They aren't the 'implementation details' of the solution, I would agree, but they are the solution.

If you ask me the problem is not usually that there is no good way to do KT or that it just isn't doable at all. The problem is that in most businesses due to their 'culture' (for lack of a better word - not to start "that" culture discussion) you do not actually get to do it. A good way to mentor and transfer knowledge for example is to do pair programming. Not many places allow for that and will look at you funny for even suggesting it (and I'm not even personally on the extreme end of that like some companies, where everyone literally pair programs 100% of the time - I like a sort of hybrid model, where people pair for as long as it makes sense to them, which could be sitting there "designing" for a couple of hours together, maybe dividing things up after a basic structure is in place and then working on their own for the rest of the day w/ some quick 5 minute sync ups and questions going back and forth from time to time). Another very doable way to do knowledge transfer is to specifically not give let the one guy that wrote that part of the code and knows it inside and out work on the next ticket that will need changes to that part. But many Product Managers/businesses/team leads will not allow that because it would mean that the task will be delivered slower.

The beauty of this approach is that you don't need to actively gather data, make a decision etc. Gathering this data is usually very error prone in that you can fill out forms and skills matrices and such all you want (been there done that), you always forget about something or it doesn't really tell you the whole story (skills matrices are particularly bad)

With this, it just happens! It's the self organizing way of dealing with the problem. I think you might be putting too much emphasis on the "completely unavailable" part, whereas I see the "it's a soft rule" part bigger. Instead of waiting for someone to go on vacation and then you find out that he was the only one that can do X (there's your data point that you missed during collection and analysis) and now you're in deep trouble, because he's touring the Amazon rain forest (i.e. definitely no cell reception there), you get to figure it out because he won the lottery and he's actually at work and can tutor someone through it all.


Apology accepted. Thanks.

> But many Product Managers/businesses/team leads will not allow that because it would mean that the task will be delivered slower.

This is precisely the kind of communication breakdown I was talking about. Managers are thinking short term, forgetting all software needs to be maintained.

It may be a team personality thing, but I think there would be way more resistance to this kind of chaos engineering than a survey.


Depending on the company I would agree that there might be more initial resistance. The thing with a survey is that in way too many cases it goes nowhere after that. We did the survey. We filled out a skills matrix. Never to be heard of again (been there done that).

If you can get the chaos engineering "allowed", I think it's easier to actually follow through.

There are multiple scenarios I can think of here.

* This might be something that a team lead or a bunch of team leads want to initiate/try out. You ideally need buy in from at least your boss, potentially a bit higher up. * This could also be thought up/initiated by the development manager/VP of engineering/CTO level who now has to get his team leads to run this and might run into resistance from them actually or from Product. Sort of like with the technology based chaos engineering, where there might be a special SRE person/team responsible for running this, you could have a special team/person for this as well. If we were all in offices, it might be a bunch of tall dudes, physically escorting the chosen person from the team room :)


Aligning the pain with the people who can solve it is the first rule of getting things done. Make the developers who will have to take on the bus-factor workload realize what the bus factor is, and they'll figure out how to reduce their dependencies real quickly.

Another good take on this is the "Wheel of Misfortune"; Take a real incident (or synthesize one in testing) that paged someone inconveniently. They've already solved it - They're running the exercise. Now have everyone else on the team, individually or as a group, figure out how to solve it. Not from the debugging steps or postmortem of the incident, but before that's been shared - Have them all suggest what they'd look for, and how to fix it. Their most important resource is their instructor/adversary for this, so give them all the data, but no map to it. Further, have the instructor figure out how to respond to all the unknowns - What else was broken because of the incident? What would have happend if they tried different debugging steps? Builds team knowledge real fast.


Yes, good point, I guess the first step is realising you have a problem and proving it to management!


The most practical solution: Hire new people. Let the low-bus factor people mentor the new people to become them 2.0 (long term).

Downside: Takes time (and removes mentor from tasks deemed more important short term), costs money, people might need to get used to becoming mentors (I found most people enjoy this even if they cannot imagine doing it the first time around)

Upsides: Long term thinking and investing in employees is rarely wrong (imo)


Ummm, reducing the bus factor is probably the opposite of what you would want.


Reducing the impact of the bus factor, then. Kind of like turning down the air conditioning... which way does the thermostat go?


I guess he/she is using the alteration of the definition of the bus factor:

> There is a rare alternative definition for the bus factor, namely: the number of people who are indispensable for the project. In other words, it is the minimum number of people who are a single point of failure. If using this definition, then a high bus factor is considered a bad thing (since the loss of any person included destroys the project), and zero is considered the ideal bus factor.

Source: https://en.wikipedia.org/wiki/Bus_factor


I have a different approach for create a similar outcome: everyone rotates projects on a periodic basis. The period of change is longer than a sprint duration so as to give people time to acclimate to the new-to-them project.

[0] https://graphthinking.blogspot.com/2021/06/periodic-rotation...


Your approach is lawful, the one in the article is chaotic.

Your approach is planned, with well defined onboarding and offboarding and with a report in the end. Personally, I hate it just for the idea of having to write a report.

The article approach in unpredictable. At the last moment, you know you are not part of the team anymore, and all communications are cut. People have to take over even if you are in the middle of something. It also involves managers. I prefer this, if anything, just because there is no report to write except if things go wrong (i.e. you break the "no communication" rule).

The chaotic version of your solution would involve periodically drawing two team members, including managers and having them switch teams immediately. For each individual, the period of change would follow an average but be random. And no reports :)


This is a great idea! It sounds similar to controlled chaos and/or engineering serendipity, where you intentionally create chaos in order to uncover some sort of new skill or increase productivity


This is extremely clever, and something I've never even heard of before, bravo. The closest I've ever seen to this sort of practice in the wild is periodically shifting people around so that more than one person knows how a system works. But more often I've seen no attempts at this, and a mad scramble with KT meetings when someone gives their two weeks notice.

I'm expecting a similar mad scramble when I put in my two weeks at my current company (hopefully within the next two months, interviewing with various companies now).


Could you imagine being part of this. Told the company was running DiRT testing or whatever, and then getting fired or punished for listening and following instructions?

Could you imagine afterwards being lectured to use common sense. Also, we have secret corporate policies you need to know about and follow. Resolve that catch-22. Repeat for a decade.

You'd be justified in never trusting them with your life or another loved ones again.


... did you intend to reply here, or on some other post?


Oh, also, really clever! I love studying resiliency in complex systems!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: