Hacker News new | comments | ask | show | jobs | submit login
Scaling Engineering Teams via Writing Things Down and Sharing – Aka RFCs (pragmaticengineer.com)
194 points by jbredeche 4 months ago | hide | past | web | favorite | 51 comments

This one of those things that sounds simple to do, but is rarely done. And there’s lots of cultural forces that work against it.

I think it addresses a big problem with important tech decisions- how to weigh engineering value and focus energy on the technical merits, not on political buy-in.

I work within a decent size government org where leadership is almost exclusively non-tech (makes sense because we’re not an org with a tech mission), and almost all tech management is program mgmt, business analysis, strategy. And the building is almost exclusively contractors.

This means to make a technical decision, it’s hard to evaluate because the tech manager doesn’t known whether MySQL or Postgres is better, and contractors have incentives to not invite review by other contractors and making it “good enough” for contract acceptance.

This means if a developer picks mysql, and the project manager is happy because it’s projected to be delivered in time, then there’s two big forces that don’t want comment on design.

We just started to try a request for comment process, but the extra effort to review is challenging. I think that on the surface it’s because letting large groups see your design before start introduces organizational risk.

And then a bit deeper is that it requires a greater level of technical depth in decision makers.

Many people call this "Enterprise architecture" and although painful tends to be worth doing.

You can think of it (simplified) as 4 steps: 1) figuring out and deciding which technologies to use for which domains and problems (these tend to be called capabilities) 2) getting rid of the ones you don't want to use 3a) enforcing use of existing, approved capabilities for new projects 3b) building out new capabilities

1) Tends to have the problem you pointed out: "it requires a greater level of technical depth in decision makers". Something that may work is to have some senior engineers in-house and embed them in the outsourced projects doing part of the real work. 2) This can be a multi-year program that needs to be separately funded. Think of it as enterprise-level technical debt. 3a) This can be the easiest part: your architecture should be part of the RFP process, with a well-defined escape hatch to 3b) 3b) Building out new capabilities needs to be i) funded separately but best ii) done as part of a real project. Otherwise you'll i) have crossed incentives with the projects as you pointed out, or ii) architecture astronauts.

HTH. There's plenty of literature on enterprise architecture, but there's no silver bullet, it's just hard work.

Yes, I agree, and that's why I'm trying to take this approach.

Interestingly I work in a world where enterprise architecture means something completely different. PM me for more details.

>because letting large groups see your design before start introduces organizational risk.

What is the risk to be specific? Just curious.

There's a few, the most tangible is that someone not technically qualified deems the design inferior and communicates to decision makers. So there's this kafka-esque experience where someone two degrees from engineering asks you to defend against a misunderstanding of proper design.

Maybe the easiest to deflect is "You're developing in Ada, and we prefer Lisp, etc. etc."

This sounds like a description of a Design Document, which are very valuable when the problem space is well understood. It can serve as an alternative to sprints and sprint planning for a very small team, or when working with contractors.

When a problem is not as well understood, e.g. can't be solved just by engineering planning upfront and needs user input, I like Google's Design Sprints which uncover the critical features using a 1 week process.

With my experience leading engineering teams and growing startups, I agree completely that documentation is the #1 blocker in scaling engineering teams, all else being equal. If the design, planning, and execution process are well-documented then engineers can be onboarded as soon as they are hired without slowing down the team too much.

To me, a design document is something you write for yourself, once you’ve decided on the design; whereas an RFC is something you write for others, and others write for you, to communicate a potential design when a final design hasn’t yet been “selected.”

It’s effectively a way of brainstorming that doesn’t get quashed by “that’ll never work” half-way through the 1000-ft abstract, because all the details are already there on the page to prove it will work. It’s just a question of which of the proposed designs works best (and then of attempting to maybe incorporate some of the alternatives’ ideas into the winner, though not necessarily.)

Or, in short: it’s a debate where everyone is doing an “argument by constructive proof” for their POV.

The cryptographic-primitive standardization “competitions” for things like SHA, are RFC processes under another name.

I think this has a similar purpose as a design document. But the examples they show for the RFCs are more tactical. And are distributed broadly across their company.

Design docs get a bad rap for “big design up front” because, I think, they get fetishized as risk management rather than accomplishing their functional task.

For this post, I think the best part is being able to gather good review and edits, and track changes as decisions are made and why.

Adding to that, I often find that trying to write documentation (spec or otherwise) tends to help reveal which things lack clarity or understanding. In this way it can be useful also in unknown territory.

The most successful company I ever worked for (late 90's, the founder was a billionaire for a few months after the IPO, 99.9% of HN readers use an evolution of what we built every day) had a policy of rejecting any proposal for a tech feature or significant improvement unless it was properly documented.

We had a collection of templates for the types of documents we used. I could create a feature proposal in an hour or two, and the template ensured that all first-round questions were answered out of the box, saving many hours of staff time.

The tech founders set this up and forced everyone else to use it and once we saw the value, peer pressure kept it going. It worked incredibly well.

We weren't writing RFCs, but we had a doc process that respected (and saved) everyone's time and attention. That helped us move fast together. It really worked.

Aren't the bug templates on github the same approach. A template to ensure key questions are addressed and information provided. The key is to find the sweet spot on asking just enough and not much time for something that might not go ahead.

Any chance you share those?

I could probably find them, but we used Framemaker and I don't know whether I have anything that can open them now.

Framemaker? This brings back old memories (circa 1993) the only editor I liked.. All the other editors I've used since then feels like a downgrade (for technical documentation purpose at least)

99.9% of HN readers use an evolution of what we built every day


Browser? 0.1% use curl maybe.

ROI (Radio Over Internet)

I am sorry but I will be cynical here. Does the author know how many shit people give about clarifying and planning things these days? Let alone documenting. ZERO.

I worked in many companies scaling from small to huge and in-between and generally the process is the exact opposite. Small minded managers and insecure engineers want to build things as fast as possible thinking this will be their big chance to leave a mark in history. So they implement some crap (not uncommon to even see different teams building the same thing at the same time!!!) with the speed of light. Then 6 months later they keep patching bugs every week! If you ever mention them to slow down or try to find a working example before building or at least organize the work let's say by starting to explore the problem domain they would just bully you into oblivion...

Case in point: Facebook. 'Move Fast and Break Things'

One thing I learned is that being successful in an IT engineering company in the 21st century is let the idiots do their idiotic ways and concentrate on my work and if they propose stupid things than let be it. Politics are always stronger then engineering considerations.

Maybe Uber got it right from the beginning, I cannot confirm that claim. But I am very skeptical that any established organization would change in this direction proposed by the author.

As someone who runs a business with a co-founder, we both are constantly making trade offs. How do we move quickly while remaining stable? How do increase our featureset without getting over our skis? How do we grow our business quickly, but not take on the wrong customers?

Incentives can win, and should win, on both sides of the equation at different moments in time. The most mature organizations can feel the pain from one direction or another, and adjust accordingly.

Shortcuts are great when survival time is bought. This can turn into a self-defeating problem when there is extreme habituation to recurrently and disproportionately sacrificing effort, cost and/or burdens on others in future for the now such that the shortcut wastes vastly more net time and money than it could've ever saved. I've seen founders shoot themselves in both feet and unable to move fast, or anywhere, because they knowingly stuck themselves with keeping 50 plates spinning while trying to swim through quicksand... especially solitary founders because they lack the check and balance of a second or third founder to keep them honest with reality and on-track.

Duplication of effort is common-place at scale because there are too many communication paths to even hope for very much collaboration, much less organize everyone/everything top-down (eg, impossible).

Technical debt is like the national debt: interest gets paid along with way, but sooner or later there will be a reckoning.

There is often too much ego-investment, NIH and sunk costs bias for people to diverge much from whichever direction they were moving, even if it's not a good direction to go.

There is no perfect; and in fact, the perfect is the enemy of the shipping; and there's a infinitely massive continuum from fragile to awesome.

This isn't bad advice, but I would add to it. If you're a large organization and building in such a way that large groups need to understand the details of what you're doing, then you're doing it wrong.

Ideally speaking, implementations can be worked through with small groups. Only the interfaces need to be exposed and documented. Unless you're doing something beyond CRUD, drowning teams in unnecessary details often result in distraction.

The arguments against this I usually hear are about things like security audits, architecture reviews, and other internal processes designed to ensure engineering quality. However, I'd encourage these teams to also think like platforms that have machine interfaces. People make terrible APIs.

> This isn't bad advice, but I would add to it. If you're a large organization and building in such a way that large groups need to understand the details of what you're doing, then you're doing it wrong.

> Ideally speaking, implementations can be worked through with small groups. Only the interfaces need to be exposed and documented. Unless you're doing something beyond CRUD, drowning teams in unnecessary details often result in distraction.

This is assuming that attrition is not a thing. The set of engineers that are on your team is likely smaller than the total number of engineers that will ever work on your project. When people want to know six months from now why you chose RabbitMQ over other tech at the time, having an RFC lets you point to an artifact versus conjecture on past motivations.

Do the reasons for the choice six months ago even matter today?

Today you have six months of history of it being a good choice or not. And six months of development assuming it was the choice, that may or may not apply to something else.

And the choices available today have likely changed from six months ago, too.

(I'm a little concerned about your team that has no connection to decisions from six months ago, but we can adjust the time frame and the rest still applies)

Of course they matter. Ideally the decision that was made 6 months ago under the conditions/restraints was the right one back then. Since then circumstances might have changed that would result in a different decision 6 months ago if you had known. Only if you have a record of some sort (ideally with background information and not just the time of the decision) people will be able to understand and re-evaluate the past.

Money and time was invested since then and the decision to keep the course or to change again should consider the history of the decision.

I've had this debate in the office. It is, potentially, one of the start examples of where "service oriented architecture" shines. Define the service interface between the teams, and let them do the rest.

Unfortunately, it is too easy to actually talk of a shared infrastructure design. As soon as you have a shared message queue/repository or some other piece of infrastructure, you no longer have a solid contract between teams where each can operate independently of the other.

Oh for sure, this is way easier to say than do. More often than not some team's interface isn't quite what you need, so then you have to have some way of working between teams to make sure the interfaces improve, and then what do you do while you wait for that to happen?

But I've seen large orgs that at least strive for this ideal vs orgs that have thrown their hands in the air and given up. One is definitely a lot easier to operate in.

An RFC says "heads up, I'm planning to make a change." If the tagline doesn't seem relevant to your world, you simply ignore it.

The difficulty is that quite often you don't know until you understand the change. Better to operate in a world where things that impact you don't come through RFCs.

One of the things I love about open source projects is they are almost forced to do this and it works great. Having designs written down and iterated on helps in planning and designing, and helps when its time to make changes. Being able to see why something was done and the arguments at the time can really help prevent people from making the same mistakes twice as they make revisions on a project.

> Have a few, select people approve this plan before starting work.

How are these ppl selected.

This contradicts org structures where "architects" are supposed to give "guidance" and enforce a "uniform vision".

Your use of quotes suggests some grudge against architects. I am an architect and I consider my role as one to give guidance and enforce a uniform vision (although "enforce" sounds like I would actually have significant power over this chaos here). For example, last week I learned about situation which caused frustration to our developer for months already. After a few hours I confidently could give them guidance what they should do in the short-term, what will happen long-term and how we should handle it.

I agree that architects sometimes solve too general problems instead of addressing the immediate problem. I'm guilty of this myself. An architect has the responsibility to think further and wider than a developer while also being able to consider the details. I sometimes switch to the wrong gear.

Back to the actual topic: The relevant architect is just one of the select people to approve the plan.

I would love it if I would hear about new plans in written form. Unfortunately, it is usually in some meeting. This modus operandi forces me to sit in lots of meetings which are rarely relevant to me. Even worse, sometimes it is only through rumors which means lots of followup emails to clarify.

We actually do have a similar process to the one in the article, but without step 4 "Send this planning document out to all engineers". I suggested something like this a few times but most people complain that they get too much mail anyways. I also did mail to everybody a few times and it was helpful. So I fully agree with the article, but I lack to power to implement step 5 "Have everyone follow the above steps" so far.

We have something like this at work. Approvers are generally respected senior engineers adjacent to the author and representatives from consumers/dependencies of the new system.

One twist on this concept that we've used effectively at Mavenlink is that there is no approval step.

RFCs function more as a communication and participation process before an effort starts, and approval just hasn't felt like a necessary part of that.

Our org is around 50 engineers and has a very collaborative culture already, and maybe approval would be necessary for other environments.

Another benefit from the RFC process in general is that it's very easy for technical leadership and management (as well as everyone at the company, really) to see all the technical efforts underway.

I guess it depends on what you consider approval. Are all comments being addressed before the RFC is worked on, and are all RFCs receiving regular comments?

In a small team with decent communication, RFCs can be prioritized somewhat easily. But they still should be tracked so that you can which are RFCs are being written, currently receiving/addressing comments, and which are being implemented by the team.

Great question!

There's no enforcement mechanism, but I can't think of any examples of comments not being addressed before being worked on. Also, so far all RFCs receive comments (between 5 and 50).

Tracking RFCs is a good point, and we do it by putting them in a git repo. The document itself is markdown, and comments happen on a pull request (which also helps with notifying the entire team).

So RFCs with an "open" PR are the ones currently soliciting feedback.

On the previous Platform Team I worked in, we called these Tech Docs and used a template for anything moderate-sized or required new software/service selection or was judged to be complex.

In every case, we had one person responsible to putting together the doc, but the team or a subset of the team participated in brainstorming, whiteboarding, and submitting additional input prior to the first draft. The draft was shared and commented/ammended with a target finalize date by which time it had been signed off by identified key people as well as others who reviewed it.

I'm curious, does anyone know of documented cases where RFCs have been used in non-software related engineering environments? E.g. manufacturing, equipment design, chemical process design etc.

Yes, but they're called ISO, ANSI, DIN, etc. standards

I'm aware that these standards exist and are being used. My question was more on the engineering process that uses them. E.g. a Process design that is developed into Flow sheets , then PI&D until it is built. Developments of PFD and PI&D are covered in intentional standards and procedures. But the underlying knowledge on what and how it is designed is part of the process knowledge the engineering company has. My question was related to that knowledge and if there are examples of RFC being used out there. The RFC or equivalent would then discuss the reasons behind e.g. the selection of a specific pressure vessel head for a certain application. While there are many head options that are code or standard compliant the engineering company has reasons to chose one. The documentation of these reasons and the history behind it is what interest me.

What are the collaborative steps followed by other big companies to move fast - concept to seeing something in production?

Op of the post here. I’ve been talking to a few people at other large companies and here’s the information I have. Most of this is anecdotal so treat it as such. Appreciate input or corrections from people working at the companies.

- Facebook is the most lightweight on docs. Code is/was still king there and even planning docs might be written after the fact. The downsides I’ve heard is tech/architecture debt building up fast and lots of throwaway stuff built.

- Amazon is quite rigid and requires a concise planning doc. Depending on the org you work, there might be a few levels of more formal approvals required.

- Google has a process similar to that described in this post, with planning docs being circulated. Due to the large size of the company, docs are routed to specific committees within orgs who give feedback on them.

- For smaller companies it will very much vary. Interesting that some do follow something similar, apparently Cockroachdb has a process close to this one: https://twitter.com/vivekmenezes/status/1047827698956079104?...

Note that the process I described works well when you have a clear idea of what you are building and have few dependencies. For prototyping or for large/complex projects, planning can get way too slow. That’s when a “war room” with a small team building a prototype, skipping all the docs part, will work a lot faster. All bigger companies I know use this when a better fit.

It varies dramatically by team. Personally I don't think the goal should be to collaborate like the big guys, and instead try to emulate the successful smaller companies. Large companies have the resources to support huge amounts of redundancy, making effective collaboration less important.

tnolet 4 months ago [flagged]

It’s fascinating to see people reinvent some form of ITIL and genuinely think they stumbled on some unique or new way of doing things.

Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.


Do you feel like the process described in this post is a facsimile of ITIL? Do you feel like ITIL is generally associated with repeated shipped product?

I can see my tone being snarky. Sorry for that. But yes, I can draw a ton of parallels between how the change management and RFC process (request for change in this case) is used by more traditional companies and the OP’s post. ITIL is no holy grail, far from it, but some things invented in enterprise IT in the 1980’s have a tendency to resurface in some guise or form years later.

Just googled ITIL, its an interesting word in Indonesia

Have to agree that this seems like a re-discovery of an already solved problem. Sometimes called architecture review, before new initiatives kick off.

I worked at a big, regulated, company that had a very strict change control process. Too many RFC's for implementations a week out were met with a response like "what! why did you buy/build a system to do X? We already have three systems that do that!!", but it was too late by then to rethink x. The problem was it's difficult to deny a plane in the air permission to land - it's coming down eventually, like it or not. So we implemented an architecture review board that was "permission to take off", hopefully before sinking a lot of money and time into something, run the idea around past some connected folks. It was default approve - someone could raise an objection and ask for more information, but if there were collective shrugs, you were good to go.

They just re-invented the waterfall method.


"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact