Hacker News new | comments | ask | show | jobs | submit login
Move Fast without Breaking Things (pragmaticengineer.com)
86 points by devy on Mar 28, 2016 | hide | past | web | favorite | 63 comments

Racing a car has a great analogy: In order to get faster lap times, you sometimes have to drive slower.

Specifically, you need to slow down for corners in order to maintain your control over the vehicle. Try to drive too fast and you end up wrecked. Even if you don't wreck, correcting from having entered a corner too fast loses far more time than just slowing down properly. Similarly to technical debt, if you start a straightaway from a slower starting speed because you botched the prior corner, that damages your lap time throughout the entire stretch.

In the overall scheme of being able to deliver software you're confident in, it is faster to consider stability inherently in your development process than try to blast out features and bolt on stability later once you're committed to the initial brittle implementation.

Racing a car applies in two ways I think: the talent is not evenly distributed, and the masters are defined by their ability to decide when to go fast, and when to go slow.

I agree. To be more explicit, the talent is indirectly correlated with how slow your team's "slow" is going to be. I think the degree of slow-ness is what management ultimately feels when they conduct retrospectives on technical debt reduction initiatives.

We recently did a major refactor of our angular code (~1.5 months) and we've been pretty fast, releasing every 2 weeks. We've begun to outpace a lot of the backend development teams, so we now have more time to squash our backlog tickets. We have the prettiest Jira issue reports out of all teams.

I think this post misses the point.

Company values are only meaningful when they differentiate that company from others. Everyone wants to "Move Fast and Not Break Things".

"Move Fast and Break Things" was a value judgement. For Facebook, moving quickly was worth the penalty of sometimes breaking things.

This is why most companies have lame values like "Do your best work" or "Be honest". Great values, but nobody disagrees with them—so they're meaningless.

I totally agree with this but would add a bit of nuance. I'd love to see all companies do a "X and the cost of Y" value system, but there is another part of stating you have values like "Integrity" which is the same as the Google "Don't be evil" thing. It's a claim that you can be called publicly on, and that's important as well. You'd be surprised how many companies out there aren't willing to make a claim for something like integrity. But yeah, who wouldn't say they value integrity, for instance, when asked?

That being said, I'd also add that most values have those implicitly baked in. For instance, if you do say you value integrity, you're saying you're devaluing growth/money for the value/benefit of integrity.

Yep, I agree with your nuance.

FWIW I think integrity is a pretty okay company value. Depending on your industry, valuing integrity may be a trade-off that other companies are unwilling to make

I'd add: think before you code. Programming is hard because thinking is hard. If you don't use tools to help you think rigorously you're going to ship software with bugs regardless of how good your unit tests are or any other strategy you use to cope with poorly designed software.

I don't believe not thinking is going to help you move faster. You're just going to spend most of your time fixing the problems you hadn't thought of before you started shipping code. And yet this is what we're encouraged to do. We're asked to fix the bugs, do the least amount of work to get it to go, and fix the problems later. We're asked to not think and just code. There are probably a handful of people in the world who can write complex systems in code with consistently correct results. The rest of us need to be more mindful.

We have wonderful tools available to us that can exhaustively check our designs. Using a system like TLA+ allows you to model your system in some pseudo-code and comes with a program that will exhaustively check your code for dead-locks, invariant violations, termination, etc. I've seen it used to find errors in graduate students' binary search algorithms 9 out of 10 times. Imagine what it can do for non-trivial components of your application.

Part of moving fast is knowing how to avoid making errors in the first place. The only way I know how to do that is to think. You can't do this properly in code alone. You need higher-level tools to check your assumptions and find holes in your thinking.

You don't see structural engineers slapping together the first thing that works and patching the building later when parts of it fall down. You don't sign waivers before crossing a bridge that claim no responsibility on behalf of the designers if it falls down while you're on it. Nor do you see engineers accepting more money from people to build private bridges that are less likely to fall down. So why do software engineers have no such liability for their creations?

Why not take a few hours to write a proper specification for your distributed locking mechanism and save yourself days or weeks worth of debugging -- or worse -- liability suits when your mistakes harm the interests of your customers or the public at large? Why not use tools that check your thinking so that you can chase more interesting optimization and performance problems?

> You don't see structural engineers slapping together the first thing that works and patching the building later when parts of it fall down.

That's because if they mess up the first time people can die.

I'm guessing the likelihood of death due to a bug in most web applications is probably much less than that of a bridge :)

Software is great because it's possible to iterate quickly. But ultimately, you have a point. Developers should be willing to fix bugs they've created and/or come across in their projects. We should also try to limit the potential of bugs by releasing small chunks of functional work as often as possible. "Releasing" in this sense can mean to production or just to your qa/test/dev/stage environment. It's just important to have many eyes on the product before it's live.

Someone doesn't have to potentially die in order for software to be harmful. Neither should it be the deciding factor about whether we need to think critically about the software we create.

Besides it's actually not too hard to write simple specifications for the critical components of your system. It has practical benefits like helping you to write good, rock-solid software. It helps you to clearly and precisely communicate your ideas. You can shake out errors in your design long before you write your code. And there are some errors you will only find by modelling and checking your system.

Web applications have come a long way since simple CGI scripts. They often require sophisticated orchestration mechanisms for managing distributed state and processes. If you get that wrong you end up with corrupted data, deadlocks and race conditions, etc. If you had a tool which helped you sift out those potential problems that exist in your design, would you not use it? Or would you rather risk introducing those errors in your code and hope that your customers never encounter them?

The approach of throwing code into production and "iterating" is precisely the problem I was addressing in my OP. In this approach we're asked to not think and to fix our inevitable mistakes later on after users find them for us. We have software to find a reasonable majority of them for us instead so why not use it? It's already fairly common to use unit and integration tests (to make sure our code operates as expected). Why not have a system for testing our designs?

As I mentioned, it's shockingly easy to write an incorrect implementation of binary search that may appear to work under your unit tests. Writing a specification of the algorithm and checking it with a model checker will show you problems you hadn't thought to consider. This is an algorithm that is taught in the first algorithms class you take and yet even graduate students, who scoff at writing specifications because it's so boring, often get it wrong. That's because programming is hard and it is hard precisely because you have to think. And thinking is hard. We should be using tools to help us think.

Maybe it's different for others, but for me the true nature of the problem sometimes only surfaces when you start coding. I can think forever and write the perfect spec, but when the rubber hits the road then things change.

I think the reality is a virtuous cycle of thinking, coding, rethinking, recoding that gets the best results.

There's a difference between exploratory programming and the final deliverable.

I don't think the GP is suggesting 100% of spec must be completed before the first line of program code is written. I think the implication is that there should still be a dividing line between the design portion (made up of think, code, rethink, recode) and the production development portion (also made up of think, code, rethink, recode). The dividing line does not need to be months, but design as a phase of a project should be considered differently (since it has different priorities) than delivery.

I actually find it easier to write a bit of the spec, try some code, write some more of the spec. They're often intertwined in my process.

I think its weird that weve put stability and speed on the same axis. The reality is that you can ship faster with more confidence by focusing on stability from the beginning. Simply slapping some code together and going "it works!" is - in isolation - a recipe for disaster.

Only if the requirements are known. If you have firm requirements, you can test for them, you can plan for them, and you move through them much faster when you don't get surprised by random bugs that need to be tracked down.

If you don't have firm requirements, then each new fact you learn about your market potentially means you may have to invalidate large swaths of work you've already done. The more planning and testing you've done, and the more thoroughly you've covered your edge cases, the more there is to invalidate.

Most of the big business ideas in the last decade were in areas of extreme market uncertainty. Hence, companies that "move fast and break things" have been at a big advantage. This may change in the future - VCs today are enamored with Perez, and IMHO one of the signals that we've crossed over from the installation to deployment phase of the Perez model is when performance, stability, and security become valued more than features & speed of execution. But I'd guess that we have a few more years of moving fast and breaking things first.

Having tests can help you ensure that, when you do change one part of your application, unrelated parts don't start breaking. And I don't really buy the argument that, because something might be invalidated later, we shouldn't put in the time now to make sure we get it right based on the info we do have.

You should know which part of the software lifecycle you're in and adjust your development practice accordingly.

When I first started my current startup idea, the product design was changing literally multiple times a day. I didn't even bother writing any code, it was all in pencil & paper notes & diagrams. Things have slowed down to about a requirements change every 2-3 days now, and I write code but no tests. If you spend 20% of your time fixing bugs and 80% of your time writing or rewriting features, tests are not the bottleneck; it's not an improvement to spend 50% of your time writing & rewriting tests so you can avoid the 20% spent tracking down bugs.

When I was at Google Search, working on a mature product with billions of users, every change got 100% test coverage, and you were supposed to break a test with every change because that's how you know your test suite is comprehensive. And then it went out through a QA department and full release process. But adding a link to the page then took 2 weeks and 600 lines of code, while adding one to my startup takes 5 minutes and about 3 lines of code.

> If you spend 20% of your time fixing bugs and 80% of your time writing or rewriting features, tests are not the bottleneck; it's not an improvement to spend 50% of your time writing & rewriting tests so you can avoid the 20% spent tracking down bugs.

That is a good line of reasoning about when auto-tests are needed.

However you did not take into account time that is required to discover that something is broken. Without tests you do not know if your code change breaks important features. So you should also consider how expensive it is to test your product after every code change. You do not have to test yourself and delegate testing to end users. But turning users into testers could be pretty expensive too (loosing potential customers).

So auto-tests should be introduced much earlier than when you hit "50% maintenance" threshold.

You don't need to worry about losing customers if you have none. At the stage I'm talking about, you'll know something is broken when you go to demo it and it doesn't work, and presumably you'll be walking through your demo manually before putting it in front of customers.

I'd challenge you to write some tests, even high level acceptance tests.

There must be a non-zero number of requirements that are not changing every 2-3 days. Every time you don't have a test to capture a desired behavior that will probably stick around for awhile, that's a risk. Granted, everything is risky in an early stage startup, but I've found all my coding is so much better when I have tests that define the requirements - even vague ones - on which to hang my code.

Otherwise, why am I writing that function I just made? Is it really the most minimum thing required to meet the requirement change that just cropped up? If I shouldn't be writing the most minimum change possible, what other requirement am I silently signing up for?

> But adding a link to the page then took 2 weeks and 600 lines of code

That is interesting, without giving away in proprietary info, what sort of changes did you have to make to add a link?

It's hard to really describe without giving away proprietary info, but some things that are public-but-not-commonly-thought-of include: how do you propagate "sticky" settings like SafeSearch, a user-defined current location, or multi-login? Is it an ad-click, in which case it needs to be tracked rigorously because real money is changing hands for it? Or is it a result click, where we want to do anything possible to avoid introducing latency in getting the user to where they need to be? Or a feature click, where the destination is our own servers? Should the browser do a full page refresh, or will it do an AJAX fetch for only part of the page? If it's an AJAX fetch, what needs to be refreshed? Does a history entry need to be pushed? Are we going to display something special if the user hits "back", like Facebook does for "More stories like this" (the feature in question was actually one of the first sites that did this, before Facebook)? What happens if we've changed the code on the server while the user has an outdated page open in their browser? What are all the browser differences in handling these?

Its because on the web patches can be pushed as quickly as the user hits refresh, or at least so seem the thinking to be.

The real problem comes when the same mentality is applied to software running locally, perhaps with patches never applied after initial install.

This likely because developers are now so used to being always connected (this thanks to thinks like GIT that allow them to work from just about anywhere with a power outlet and a net connection), they can't grasp that not everything else is.

A lot of the bugs in our code are completely avoidable, but people get in the mindset of, "I need to move fast, bugs don't matter!"

Even if you don't have QA, when you write a bug, at least ask yourself, "what could I have done differently to avoid that bug in the future?" In some cases, it's as simple as, "put repeated code into a function so it doesn't get written twice." That will cut your chances of getting a typo in half.

So , really - you actually see typo bugs in the wild? I haven't seen one of those for 20 years.

I do see lots of cases where people didn't test it or used the wrong mechanism.

I have to go slowly because somebody already went fast and broke it.

And a friend once told me "LZW can compress code faster than you can." :) Which does not go to the value of orthogonality of purpose in source code, but I enjoyed the irony of the comment very much.

Math-heavy code tends to involve a lot of very similar algorithms, leading to a copy-paste-rename situation that isn't really helped by using a function.

As such it's incredibly common to end up with bugs from assigning to "x, y, y" and not "x, y, z" etc. Sometimes a clever compiler will warn you that you have done something odd, other times you'll be left to discover the runtime error.

Yeah, there's that. I've printed out code like that and colored different variables with different color highlighters to deal with it.

Probably happens more frequently than you think in dynamic languages. Even in compiled languages I imagine its reasonably common to compare something to a hard coded string that could easily be mistyped.

I wasn't thinking of dynamic languages at all. I'd agree.

Here's one more recent than 20 years: http://www.theregister.co.uk/2009/07/30/typo_caused_massive_...

You're right though, with care and skill, typos really shouldn't be an issue

This seems like a really uncommon case.

The vast majority of bugs I've ran across is an incorrect mental model of how something works: your system, some library, or an external system you're interfacing with.

Aside from building a more understandable system (whatever that really means/entails), from my experience, the best way to counteract this is to, while writing code, always ask: "What if my understanding is wrong? What if someone else's understanding is wrong? What can I do to make the system fail in the loudest way possible if someone's understanding is wrong?"

Generally, have the interface to the module under utilization be as simple as heck.

As an example, I have a system for which almost all state is represented as XML/JSON/CSV text when it is outside the system. Inside the system it's tables of tuples with a master table of names cross references.

Each "object" has an instantiator, a "process" callback and ... that's it. It's all done in a timer-based polling cycle ( after a select()/epoll() loop ). Each "object" owns a timer of its own that specifies the minimum poll time. There basically are no parameters after object creation, so you can't get parameters wrong. You have to send XML/JSON/CSV to it to change state. XML/JSON/CSV for initial configuration is controlled and shipped with the executable. For all settable state in the tables, there is a formal validation method and a controlled script to test them.

The the client-writers, I provide correct scripts for each use case that they can then steal or modify.

This way, I just don't have problems.

That goes with what Glenford Meyers said, that all bugs are transcription bugs, either from the requirements to the programmer, or from the programmers head to his fingers, or.....

I'm actually very happy to work somewhere that believes in "Move slow and don't break things".

I believe that churn isn't progress, things with lasting value generally take time, and some problems can't be solved quickly. I enjoy working on problems that benefit from careful, considered, time-consuming thought.

You might desire something different. That's OK, but nobody is obligated to accept fait accompli the universal necessity or value of moving "fast" (and hopefully not breaking things).

Moving slow isn't an option for many businesses.

Corollary: Breaking things isn't an option for some businesses.

Only if your industry is capital rather than labor intensive, in which case you'll have elaborate procedures in place for safeguarding and maintaining your infrastructure.

For everyone else, a certain amount of breakage is expected. When things break, there are generally manual processes that can be put into place to keep business moving.

From a Windows 10 automatic unintentional update story [1]:

'The action led to comments where life was actually being put at risk by the unilateral action: "I needed to set up my department's bronchoscopy cart quickly for someone with some sick lungs. I shit you not, when I turned on the computer it had to do a Windows update."'

Question 1: Labor-intensive business or capital-intensive business? Question 2: Whose elaborate safeguard procedure failed?

[1] - http://www.theinquirer.net/inquirer/news/2450852/updategate-...

Poorly-managed capital-intensive business. It's poorly-managed precisely because there weren't any safeguard procedures.

Why the hell wasn't someone checking to make sure such a critical piece of equipment is always at the ready?

> 2: Whose elaborate safeguard procedure failed?

Since Windows was never intended to serve a life support/saving function - it was the fault of whatever OEM chose to use it in that capacity. They took on the onus to have it work correctly when they chose it as a platform.

MS used to specifically include "Do not use this for safety critical stuff" in the EULA.

Does anyone know when they took it out? And why?

> “The Microsoft software was designed for systems that do not require fail-safe performance. You may not use the Microsoft software in any device or system in which a malfunction of the software would result in foreseeable risk of injury or death to any person.”

> Only if your industry is capital rather than labor intensive

Or if your business is safety-critical, or financial, or storing other peoples' data, or...

Those are the very definition of capital-intensive.

Some of them used to be. Some still are.

I believe it's an option far more often than people believe.

Customers do follow dependability and lasting value; nobody trying to their solve day-to-day problems actually wants a treadmill.

It's one of those cultural decisions you have to make early on in a company's formation and make sacrifices to hold on to because literally everyone else you're connected with will expect you to move at the same speed they're moving at.

Being able to separate speed for its own sake/speed as a fetish from speed when it is of the essence is critical.

Well, let's make sure we distinguish between what's viable for the company as a whole versus what's viable for the career-advancement of an individual executive :p

If the firm hasn't yet succumbed to mediocrity, then the career goals of each executive should be well-aligned with overall company strategy. Ensuring that this is the case is a primary goal of executive recruiting.

Middle managers and ICs, not necessarily, but a company generally can't afford to have its executives be out of step.

I really don't buy that argument. And generally, when we talk about "slower", we're not talking about a long period of time. Maybe a couple days.

You ever Notice how the engineering departments at these move fast and break things companies grow exponentially? Thinking of etsy, and Facebook, and a few others.

Part of it is if they survive, they thrived... But I think the other part is that the code starts to become sharded across people's brains. The technical debt piles up so high in some places that it becomes impossible to maintain productivity with the same number of people.

I think that's a non sequitur. Every startup engineering department grows exponentially. It's the only way for an engineering department to grow.

Move fast and break things for 0.001% of users.

In other words, try bold things, bold refactorings, etc. within bounds.

If a system is designed with good modularity, this can be done with very low risk.

When someone deploys new code at Facebook, I recall reading that it initially goes out to internal users, then to a small percentage of actual users, then eventually to the entire system.

This is simply a systems-based approach. The same applies when thinking about code in a testing environment, staging server, etc. At each level there are additional failure modes or adverse interactions.

When done stupidly, a cowboy approach can lead to downtime, poorly-reasoned quick fixes, and blame within an institution.

Often the same cowboy/girl who acted stupidly and caused the bug is lauded as a hero when he gets things working again a week later after the bug is discovered... when it was his/her bad judgment (or "move fast" mentality) that led the offending code to be shipped in the first place.

We must be honest with ourselves and our teams about bad decisions (even in hindsight) and build processes that let us be bold when we need to.

Well, in physics we compute the speed with half life of what we want to keep alive regarding the costs.

Mass transportation is an example of industry where half life is decades/century. Fast for this sector is slower than what the life expectancy of a coder represents.

Games made to support a product ad are ephemeral, 1 month. Faster than a new release of firefox.

Some dangerous radioactive isotopes are 10000 years. So nuclear plant related automation should not be using any CPU.

A citizen data should last for the government as long a citizen live. Government if they aim at remaining stable should consider the half life of their automation to be long, way long. Hence slow changes. Very slow

So do banks. But banks have to adapt with the speed of trades.

Fast and slow can be set by doing something agile hates: careful business analysis.

The problem is how we coupled all the businesses with insane costly rhythm that are forced down on every economical activities by the mean of noncompetitive business practices. Like useless obsolescence driven by HW monopolies, cheap energy, cheap regulated and poor education.

Do we really need 1 Pb hard drive when entropy predicts we will not be able to find any relevant information given we have to much data without increasing the means involves in costly ways?

We don't need fast changing technologies, we need boring, slow changing technologies that are reliable.

I have found that using BDD with gherkin syntax can really cut down on many conserns when it comes to software stability and ensuring business worth. Many companies do not put enough focus on automated testing and even when they do they are only testing the surface level with simple funtional tests. If one simply throws the technical requirements into the BDD tests and have your system validate every affected integration prior to shipping then you simply have less debt to pay off as more bugs should be caught prior to shipping. Couple BDD with writing a negative functional test for every bug that is fixed and you should never trigger that specific bug again. In the end these tests will take more time, but if that time is bundled into the time required for the feature, and your code is already layed out in a testable and reusable way, then it is win-win.

Possibly related: "Move fast and fix things" by GitHub engineering. https://news.ycombinator.com/item?id=10739129

I think this article misses what is meant by "Move Fast and Break Things". I never took it to mean 'physically (or programmatically) break things apart while moving quickly', I thought it was more about deconstructing things we considered "normal". Like another way of saying "think outside the box", and do that without hesitation or inhibition.

“Most companies mess up by moving too slowly and trying to be too precise. When you’re…moving quickly or doing anything like this you want to make mistakes evenly on both sides. We wanted to set up a culture so that we were equally messing up by moving too quickly and by moving too slowly some of the time. So that way, we’d know that we were in the middle.”


My interpretation of it is to say that web devs should not be afraid of introducing bugs while implementing new features, as patches can be pushed just as fast.

This may be true for a web site (or app as some like to call them these days). But the mentality has filtered down to actual software running locally, or even whole OSes (see Android for instance, where more than once Google has pushed a feature into the world before all the proverbial edges have been filed down).

Great article!

> Be Aware of What You Break

I work at Rollbar (www.rollbar.com), and that point is probably the key to our existence. You can't afford not to know what's going wrong. And the best way to know what's going on is to get a full stack trace with all the information you need to reproduce into a platform that will notify the appropriate people of the breakage.

Facebook/Zuckerberg have obviously been successful because this particular motto has been dissected countless times in blog posts and HN comments.

Try this search: https://hn.algolia.com/?q=fast+break

Von Braun's team launched over 600 V-2 rockets before any of them reached a militarily useful target.

I'm not worried about the "Move Fast and Break Things Often" mentality. This will only last until it kills a much of people. Think code in cars, plans, X-ray machines, pill dispensers etc.

> In reality more often then not moving fast and breaking things will result in shipping scrappy software. This is because in the rush to get stuff out the fastest way time consuming things get skipped. Like user testing, automation, analytics, monitoring, manual testing - just to name a few.

It sounds like the author could use a dose of his own medicine. The first sentence contains a (common) grammatical error. The third sentence is a fragment. The second is nearly unparseable.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact