I'm going to be contrary and say "that's not possible". At least, it's impossible to maintain feature velocity, even if you maintain development velocity.
The reason is simple, in the beginning you're starting with no legacy code. Features are simple to add, because there is no history to work around. However, even a month in, features are going to start blending together, and new features will have to work around existing features.
Your databases will get new tables, and the existing tables will get more columns. The response code is going to fork as the API versions increment. The API space is going to grow.
Technology which handled 1-1000 requests per minute will fall over when they start getting hit with 10,000 requests per minute, and require optimizations, indexes, caches, and other complexities in the stack which aren't new features.
Startups are fun to work for early on, when complications like these don't exist. It requires more than "the best engineers" to keep up development velocity, let alone feature velocity.
Most of the time, yeah that's how it plays out; it's easy to start from a blank slate. But where I differ is that I'd claim that most early startups don't even have much velocity. They don't actually move with speed, but with haste. All possible corners are cut. As the team scales, the debt comes due and forward progress stalls. What strikes me about this article is the forethought that was put into the engineering from the start - this is a company that really does have velocity and wants to maintain as much of it as possible.
> They don't actually move with speed, but with haste. All possible corners are cut.
This sentence strikes me at my trauma center.
The company I'm in is at the critical phase of startup. 4 years down the road of hypergrowth, the knowledge gap and quality gap is apparent. But then I realize that corners are cut because of a lack of expert in the area whose authority is recognized.
Most experts and brilliant minds have the same idea about the corner cutting. When foundations are designed to be reliable, velocity may jump exponentially, they know. Reliable isolation, version gating, and enablement over service is how they should move, they know. Product design and physical constraint has at least one coupling string, they know.
But these people aren't given the authority, not invited into the room where the decisions are made. In the end, those who bring the immediate money prevail over those who bring mere potential of 10x more money.
As the article said, hire the best engineers. But what the article miss is to hire the best engineers who is invested, can thrive in the politics of startups and use the attained power to bring everyone to success.
One thing I think can really help is to have two different modes within the same team. You move fast and cut all corners and rapidly iterate when you have an idea by building a prototype, with the only goal being maximizing the speed at which you're learning if/what works and what does. Then you basically throw away all that prototype code and build the feature properly with careful attention to code quality and complexity, testing (++) - since now you actually know what you're building and that it's likely a good idea.
Rinse and repeat, so that you only have prototype code where you're actively learning, to be immediately replaced by properly engineered solutions or wiped away (with the same haste you used to create it initially - no "rarely used, badly engineered" features should stay in your codebase due to priority / resources / attention constraints).
The biggest challenge I've had with this approach has been building and fostering the trust and culture needed to have everyone truly go "all in" for both aspects/modes of working. This breaks down quickly if you have managers or PMs drive engineering to keep the prototype code in for longer, or if engineering insists on not cutting corners during the exploration/prototyping phase...
England's King Charles II chartered the Hudson's Bay Company in 1670. It is considered by many to be the world's oldest continuing commercial enterprise. The company outfitted fur traders for expeditions out of Hudson Bay in Canada. In this part of Canada, referred to as "north of summer," having the right supplies in the correct amounts was literally a matter of survival. And, of course, the company's survival also depended on productive—and completed—expeditions. So the company provisioned the expedition's canoes with the necessary supplies for the trip. Then they sent the expedition team a short distance in their canoes to camp overnight. This was a test. The first camp was merely a "pull out," commonly called for many years a "Hudson's Bay Start" (HBS). Said Sir Samuel Benfield Steele in 1874, "[This was] very necessary so that before finally launching into the unknown one could see that nothing has been forgotten, or that if one had taken too much, being so near to the base, the mistake could be easily corrected." This risk reduction technique was not free. It cost the team precious time in a short trading season. They had shipping deadlines to meet. I can imagine them saying "Yes, but we really have to be on our way or we'll miss our shipping schedule."
This sounds great on paper but I have a hard enough time getting buy-in to finish the prototype. I've never seen anyone throw away the prototype and rebuild from the ground up before waiting years and then calling it "V2"
An alternative way to deal with an unreliable prototype is to make a reliable wrapper around it.
Such as paying time in exchange for reliability (e.g. TCP/IP, retry and backoff), disclaimer for unreliable behavior and law-related issues, version gating for file formats/protocols, redundant storages, and killswitches for going rogue-prone machineries.
If companies took the concept of an 'error budget' (every SRE principle seems to be important to companies except this one) seriously and saw it as a signal instead of an annoyance there would be some ebb and flow in this realm instead of a continued compounding increase where stability sits on the backs of a few people with the tribal knowledge. Just my 2 cents.
Sure, that's solid policy, but I'm not sure it really addresses the "tribal knowledge" issue per se.
I mean, I'd like to think that fixing reliability issues increases knowledge, but decades on multithreaded codebases littered with "sleep(0); // just in case" point in a different direction.
All I've found to work so far is a deliberate attempt to shift to more active knowledge sharing. I was hoping to learn a few new tricks from OP, but "reliability freezes" are not it, based on my experience.
"Fixing the error" is at times subjective. A less nuanced approach is to freeze non-reliability-improving changes (i.e. merges) until the production meets SLO again. That is the canonical example policy given in Google's SRE Book.
I agree that you can't maintain the same velocity you started with as the company grows, but there are definitely ways to keep as much of it as you can. Some companies slow down a lot more than others.
In my experience, it ultimately comes down to SDLC. Teams need to be able to work (mostly) independently of each other. This means being strict about using API boundaries rather than sharing database tables between multiple apps, and avoiding breaking API changes that force you to deploy software written by multiple teams in lockstep. That is the real reason microservices are valuable: they decouple teams.
To use Frederick Brooks's term, the optimizations, the special cases, the interactions with other parts of the system, start to become essential complexities as the system grows.
The skill of your developers is all that stands between the code base and the total chaos of insurmountable technical debt.
Only to the extent that presence of mind and ethics are skills instead of something you are. People don’t clever their way out of these holes. That requires wisdom, not cleverness.
If you treat performance as a feature, then you can do a lot better by this yardstick than most. The problem is that so many people don’t know what that feels like that they fight you on it. I’ve heard stories about managers trying to dismantle processes I “forced” onto people only to find out they didn’t want to give them up. They just didn’t understand why I was so vehement that we have them, until they had them.
Most people assume that suffering is the natural order of things instead of the default state. So it doesn’t really track for them that choosing your form of suffering today can limit the amount you’re doing later.
> Startups are fun to work for early on, when complications like these don't exist. It requires more than "the best engineers" to keep up development velocity, let alone feature velocity.
This blog post (which is pretty high quality for a content-marketing piece) talks about exactly these issues: https://archive.is/FQKJH (athenian.com), Scaling Your Team From 5 to 250 Engineers.
I completely agree that it is not possible to maintain the same velocity. I have always optimized to decrease velocity as little as possible as you scale the team. That you really want to avoid is doubling your team and your velocity dropping by 50% or more.
Ideally you double you team and velocity decreases by 10-15%.
Genuine question that isn't answered in the article: what is it that Faire is doing that suggests they're maintaining engineering velocity as they scale in a way that is equal or better to anyone else out there? Alternatively, what are the engineering challenges of running Faire's storefront/marketplace makes them more qualified to write about their experiences scaling vs. other organizations?
This article opens with an unsupported assumption, "we are really good at this", and doesn't really elaborate on what that means. I'd genuinely like to know what great engineering at scale looks like, not just some suggested ways to do it.
From my experience scaling to a 100 is also pretty easy. You don't get consistent results across all your teams but you're small enough to mask this with direct communication and adhoc process. The foundational stuff really pays off when you go to 200, 300 plus, and you will definitely slow down.
We intentionally focused the article on 1 to 100.
Today we are over 400 engineers and we are still pretty happy with how the engineering velocity progressed as the team grew.
- Have devleads as the central crux for every major decision. Screw "senior" management - make sure that the people who are at the codeface every day are part of the major discussions - that those people need to be persuaded of the need for product X or pivot Y. Because they do whether you treat them like it or not.
- the above means you are "surfacing" politics. Do not keep the backbiting and infighting amoung a cabal of senior managers who talk to each other a lot. Make it public
- Have one place to have these discussions - an email list probably, and the only way to get your project signed off is to get agreement on this list (well, Ok the CTO can have some veto power - this is not a democracy of course)
- Analysis really works. Publish one-page analysis of the proposed project. Watch it get destroyed with evidence and laughter and then watch even better ideas get proposed.
In short scaling is a political problem - treat it like one. And engineer the horror out of politics. Democracy and transparency are pretty good at that.
Edit: Buidling a business IMO has strategic, operational and tactical levels. Strategy should be obvious, be PMF and be well known in the company (a PC on every Desktop). Most of the article is tactics - hiring, metrics, stack etc. The hard part is often operational - and that is almost always orgabisation design and that is about communication, alignment of resources, trade offs, etc etc. That's hard. Dysfunctional organisations blow this. Open politics fights dysfunction
If your systems relies on "hiring the best engineers", it is operating Open Loop. All Open Loop systems will suffer catastrophic failure at some point.
Grit is a dog whistle for grind. You can be tough and resilient and flexible w/o being gritty.
1. Ensure your devs are making choices that do not mortgage tomorrow. Anything that gives O(1) value (one new feature) and requires O(n) investment (forever maintaining) is a bad deal.
2. Make as much of your code someone else's problem through FOSS. (This is kinda the same idea as innovation tokens)
3. Solve the generic problem not the specific one. For example, i could write a program that reads a csv row by row, converts it to my internal model, then emits those data into a ndjson file. Or I could just write a program that converts csv to ndjson , agnostic of the data. The former costs you every time the model changes. The latter is useful for a variety of system architectures again in the future.
4. Let the computer do as much as possible for you. Automation is compounding returns. O(1) value from manual testing, vs O(n) value for automated testing, there's a clear winner in my mind. Typed languages prevent a class of bugs, so does TLA+.
5. Most customers wont tell you it's broken, most engineers wont stay to clean up someone elses' mess. Best to avoid those issues or resolve them immediately.
6. Remove broken incentives where individuals leverage everyone else's resources for their own gain. An example of the pattern is "pleased to announce" emails sent to all@. (along with the litany of reply-all posturing). These kinds of emails cost you for every employee, but only bring value to a very small subset of folks. At it's worst you have N^2 cost for 0 value. Ensure individuals can be recognized and promoted without such emails. Disincentivize those who bring that behavior from previous broken cultures.
> 3. Solve the generic problem not the specific one.
It's good that you provided the example...
> i could write a program that reads a csv row by row, converts it to my internal model, then emits those data into a ndjson file. Or I could just write a program that converts csv to ndjson , agnostic of the data. The former costs you every time the model changes. The latter is useful for a variety of system architectures again in the future.
...because in this case I can agree (although I'd need to drill into the details a bit more). Anyway, far more commonly (in my experience) unnecessary effort is spent on making generic code when a solution only targeting the specific problem in hand would suffice. Then later the requirements change or your original assumptions turned out to be wrong, and you need to spend time changing the generic solution to fit the new problem. This can require 10x the effort vs. just continuing from the specific and simple solution.
I guess the difficult art is to be able to recognize those exceptions where it is in fact reasonable to do the extra work for the generic solution. There was a recent submission [1] here about "YAGNI exceptions" which included a few common cases.
Yeah I basically try to think something along the lines of "for 10% more effort can I make it so when the Product Manager changes their mind I can either have 0, or minimized additional work to do" . In the case of changing the csv to ndjson, we can actually just treat each row as "just data" and not try to parse it into whatever the PM said.
> Make as much of your code someone else's problem through FOSS.
Not all FOSS is created equal. Can you rely on well-established FOSS projects like Linux, PostgreSQL, and Redis? Absolutely! Should you rely on a random library you found that happened to be the first to name itself after a trendy new thing and only has one active maintainer? Probably not, at least not in any important area of your business. If you do, be prepared to fork it or replace it at some point.
I know people reading this are going to say, "Duh, of course I wouldn't do that." Congratulations, you understand the concept of risk management. But I have seen firsthand a number of teams fail this basic test. I have spent far too long trying to chase down a bug in some multiselect dropdown jQuery UI library that hasn't been touched in four years. That doesn't sound like it would be that hard to fix, but if it's spread to every important page in your app and is breaking some key features, you're gonna have a bad time.
If you're going to make your code someone else's problem, they'd better be just as invested in it as you are.
This is a good chance to plug supporting FOSS through monetary donations (or feature/bug bounties etc programs). If you save $100k from not having to implement something crazy, then consider throwing them $100. Part of what makes FOSS so important to the world is we're amortizing capital costs across many people, and using the "Free" part to break first mover disadvantages and challenges in coordination.
Reminds me of the PAGNI posts from a couple of days ago.
2. This depends on how much you want to have control over your core business technicalities. Open source works wonders if we're dealing with complex standards, but not when the subject needs to change often.
4. Can't argue with that. Go for the extreme route of business object modelling the functional way.
I would say it's not necessarily ideal to keep shipping a bunch of features and changes every week as you scale. Once you have an established customer base, change needs to be managed and customers need a heads-up on things that will affect their workflow: you'll want to do deprecation periods, beta periods, have backwards-compatibility, etc. No customer is grumpier than the one whos critical workflow you just totally changed without warning!
I suppose as long as those sorts of tasks count as maintaining engineering velocity, it's all good though.
Every company can't have the best engineers. They cost too much and they're a finite resource. I would argue that the quality of engineers is just going down, too, as the hiring pool is flooded with people who've had zero tech experience, and then took a bootcamp and "churned on algorithms" to finally land a tech job. You probably will end up with poor to average engineers, and you'll have to deal with that.
"Building solid long-term foundations from day one"
You have average engineers. The foundations they make are not going to be solid. And even if they could make solid foundations, your founders won't care. They just want to ship ship ship ship ship. So your foundations aren't going to be solid... and you'll have to deal with that.
"Tracking metrics to guide decision-making"
Never seen it done well. If you have the time and staff to properly do this, you're either crazy lucky or are swimming in cash. Those days are probably gone... and you'll have to deal with that.
"Keeping teams small and independent"
Remember Conway's Law? The more teams you have, the more fractured your architecture becomes. Obviously you can't have one infinitely big team, so the key is to have as few teams as you can get away with. Integrate them tightly - same standards, same methods of communication and work. Eliminate monopolies of knowledge. Enforce workflows that naturally leads to everyone being more informed. (For example, every team having stand-up on a video call means none of the teams have any idea what the other teams are doing)
There's actually a lot of evidence-based research and practical experience out there, that's been around for decades, that shows how to maintain the productivity of organizations, regardless of whether they're growing or not. But they're not trendy, so nobody working today gives a crap. Actual productivity gets ignored while managers pat themselves on the back and start-ups churn out banal blog posts with the same lame advice that thousands of startups have used and still completely failed to maintain velocity with. We all know how startups today really get by: burning through staff and money, and luck.
> Every company can't have the best engineers. They cost too much and they're a finite resource.
Your argument assumes there is an efficient market for engineers, and it also assumes that an engineer’s productivity is independent of the company environment.
There are multiple counterexamples of startups that have built great teams for cheap money. There are many reasons why great engineers can be underpriced or under-productive, and many reasons why mediocre engineers can be overpriced.
You can get the best engineers if you have the smarts to do so. Currently on the front page are two relevant links:
It definitely is not easy - and it takes unusual skill to succeed - but that doesn’t make the advice wrong but only extremely difficult. If you lack the skill, then partner up with somebody who does, or find a mitigating strategy, or don’t become a C*O for a startup. Edit: I am an engineer and I hate rah-rah thinking, but if you don’t have enough belief, and you poohpooh all the advice of others, then perhaps you lack the temperament to be a founder. I was a successful founder, but only by joining with others with the skills I lacked as a cynical engineer-type. I think it is amazing that engineers are now regarded as critical - I am from a decade when MBAs were the hot thing!
> > Every company can't have the best engineers. They cost too much and they're a finite resource.
> Your argument assumes there is an efficient market for engineers, and it also assumes that an engineer’s productivity is independent of the company environment.
You're arguing something different than what you're replying to.
The original post asserts you should hire the "best" engineers. Best is a word with a strong connotation of there being some absolute quality scale independent of the company environment. And then you absolutely run into a problem where everybody can't hire the best, especially if you're not in a position to offer the best compensation or have the current state of your company look like the best possible destination for everyone. There's just not that many of them, after all, by definition.
A smart founder (or manager anywhere, for that matter) will try to hire great engineers for their situation. This requires more self-awareness and critical thinking than just locking in on a few people and spending a year wooing them, like the article describes. How many people you hadn't fixated on could've turned out great for them in less time than that year?
Ok maybe it was not clear but I was looking for pointers to the research
Edit:
https://study.com/academy/lesson/organizational-design-theor... is a surprisingly good coverage of the major parts that I remember from way back when. I have to admit that the research then and what I know now is a) reportage like rather than predictive (ie describing org design as opposed to having good predictive power and b) lacking some meaty meta theory - why is this design bad
So they value "grit" which is defined as the ability to code and push features in near real time, as told to the CEO at a multi day trade show, then follow that up with explaining the importance of building a solid foundation.
I can't express how much I dislike the advice to be "data-driven" and collect all the data you possibly can because it could be useful someday. While this may or may not be sound business advice, it's deeply unsettling to see such profound disregard for user privacy trumpeted as a key to scaling quickly.
toxic datadriven cultures will say "You must have data to do X" while the status quo never had to face that bar. It becomes impossible to prove something because you're never given the resources to even gather the data, let alone try an experiment to move it. So then it becomes "unless its obviously broken, we don't want to know the truth" .
I strongly disagree that being data-driven and collecting a lot of data is directly correlated with disregarding user privacy. You can collect a lot of data on how customers interact with your product without associating any of this data with a specific customer.
And also the goal of being data-driven is not to scale fast. The goal is to validate your features are meeting the assumptions and expectations while providing customers with as much value as possible.
Optus and Medicare in Australia have recently had major breaches and the fallout in public opinion is still going on.
A panel on the news was interesting; one person likened data to uranium - dangerous to hold and difficult to dispose of. The other likened it to the new gold.
It'll be nice when legislation tightens up to minimize the latter feeling.
Huh, I fairly recently started working in healthcare records and I actually told people that they way they talk about PHI (personal health information), you'd think the hard drives were radioactive. In fact we had to come up with and follow some very detailed procedures to dispose of it.
It's interesting to see CI wait times as core engineering metric. I 100% agree, so much so that I'm building a product specifically to speed up CI. We have redesigned how hosted CI works focusing on speed as our north star. We don't have CI wait time - your test starts immediately and we run your tests super fast. How do we do it? We have many workers with your test environment pre-built so we can split your tests and start running your tests in seconds. If anyone is interested you can check it out at https://brisktest.com/, I'd love to hear any feedback from the community.
Tell us more about your prebuilt environments. Are they reused between customers?
How do you protect your customers' code and secret tokens?
What does the customer have to do to parallelize their tests? Separate them into individual (or groups of) small test (rspec etc) files? Or is there some automation magic?
Pricing is reasonable, but there's a big gap between $0 (5x) and $129/mo (60x).
I suspect you want to drive customers to the "Wow that's great" performance level, at the "Yeah that's reasonable" price, because this optimizes your revenue to costs, while making customers happier than they expected to be. (This is a good strategy!)
But might there be some space for "Yeah that's really good" performance at a "Wow that's affordable" tier? 20x for $49/mo?
Nice service BTW. The SaaS CI market is ossified and needs a good shakeup!
So each prebuilt environment is basically a container that is exclusively accessed by your test runner. We set up the environment in the container and code and tests are run there. So there is no sharing of the build environment between customers.
We are pretty paranoid about protecting people's code (as you might expect). We do all the "normal things", encrypt connections, etc. One novelish idea we use is that different customers' code never shares a binary. So, what that means is that the CLI connects through a load balancer directly to it's dedicated server (a binary running in it's own container) which connects to it's dedicated workers (more binaries in their own containers). Once we are finished with a server or worker we destroy the entire container. We have multiple layers of security, and for example once a customer has connected to a container, only a customer with the correct key can connect again (so it should be the same customer). We also offer on-prem for people who need more security wise.
I feel we are a long way from getting the pricing right. At the moment most people are on the free tier just trying things out and getting started. I think as we grow we'll probably fill the space between the two tiers as you suggest. But to be honest, the details of exactly what this will cost us at scale is still very much open. Usage rates and test suite size are very variable and until we have a good idea on that we are just trying to make the best guesses we can and try not to shoot ourselves in the foot too much :)
Just saw the question about testing splitting - yes there is automation magic. You give us a command to list your tests and we split them as optimally as we can.
These are really good answers. I couldn't find any of this info in the docs on your site though!
And BTW, it's been a while since I last checked the numbers, plus we've moved to GitHub Actions recently. Our current setup time is almost 5 minutes for each parallel runner. Saving that time by using prebuilt runner images starts to sound like a great idea.
I'm curious, do you keep an image for each customer? And you must have some heuristic for determining when their setup environment changes and requires rebuilding?
Yes, I'll need to update the site. I'm always torn between putting too much info there or not enough, but I guess I need more.
We don't keep an image per customer. But we do have general images, so something like ruby-2.8-node-14 that contain a lot of the setup. Then building is just running the customers build commands, "yarn install" etc.
The rebuilding heuristic is a list of relevant files. We hash the contents and if the hash doesn't match we know we need to rebuild.
I sympathize with the tension. These are the big questions I still had after reading your site docs. Thanks for taking the time to answer them!
We use RSpec, Capybara, Selenium, Chrome Driver. We parallelize our tests semi-manually, which is a minor inconvenience.
Recently moved to GitHub Actions, which makes it easy to cache dependencies (mostly gems), although it's not quite a pre-built image. Startup cost with fully warmed cache is about 55 seconds, almost all of which appears to be spinning up the container. Not sure what the underlying tech is, but that seems slow.
I think you'd be surprised how long some tests queue waiting for a runner - but I do agree with you that it's not the longest delay running tests, and it is relatively easy to fix with budget.
I would say our solution mostly breaks down to massive parallelization and in order to support this pre-built environments. The problem with massive parallelization with traditional hosted CI is that the first 5-6 minutes (maybe as many as 10 minutes) of your build is just creating your environment. So it is extremely inefficient to spend such a long time building for say 2 minutes of tests. Your total test time is 6 minutes setup + 2 minutes run time. Scaling that up by adding 100 machines is only going to bring the 2 minutes down, you are stuck with the 6 minutes build time.
By maintaining the environment between builds we can add more workers and get that test run time down to the 10 seconds range (with zero build time on most builds). Obviously we add complexity around knowing when to rebuild - but that is the price we pay for getting super fast tests.
> I think you'd be surprised how long some tests queue waiting for a runner
Why would you ever queue? Our runners take about 40 seconds to come up from nothing. If 50 jobs come in we start 50 runners and then kill them again afterwards.
It’s not like 50 x 1 is more expensive than 5 x 10 x 1.
Yea, that sounds somewhat normal. I see builds getting bigger when you add in a bunch of yarn packages, building assets and webpacker etc. As projects get larger these tend to bloat up.
You might be interested in looking at this demo of a Rails Rspec/Capybara test suite (https://brisktest.com/demos#rails) it's over 60 minutes locally on my macbook, we get it down to 90 seconds on brisk. It's got a lot of Selenium Chrome tests which is why it's so long (90 seconds is long for us), and I think fixing their flakiness would shave a lot off those seconds (lots of rspec retry going on) but it's a good example of a real-world test suite. Often fixing flaky tests is not the priority and getting your tests run in 90 seconds is huge.
Python is actually on the roadmap, it's next up. Is it something you are interested in? We are in the process of trialing our solution with one other company, I'd love to add you to our beta test if that was something you'd be interested in ?
Naïve and unavoidable. As you add more people, the people issues become more complex and require more sophisticated approaches. A 10,0000 employee company is very very different from a 2,000 employee company. A 100,000 employee company is again very different.
The art then is how do you evolve your culture to adapt, with intent, based on solid principles but that might express themselves differently at these different sizes.
This only works if a pod is equal to a domain that's fairly independent of the other domains, technically, and business-wise.
If there are N pods per domain, each being their own startup without additional coordination results in chaos and duplicated work. Business complexity not included (two pods from different domains can unknowingly work against each other due to having conflicting goals/targets).
I wish I'd see advice where they prioritize having a roadmap, milestones, an actual plan of execution, after Product Market Fit (PMF) has been found. And before you find PMF, any concern for hiring the 'best engineers', building a 'solid foundation' is moot.
I feel I'm going insane expecting product people to put the time in to define the requirements and context; I get weird looks asking startups about the plan for the next week. "That's how startups do it" is just the most bullshit excuse I keep hearing constantly regarding lack of planning.
Scaling engineering velocity is also dependent on the domain and strategy. If the strategy is throw darts on the wall and see what sticks - one can scale independent teams. If the strategy is leverage what we have to build new features then teams have to communicate with each and this doesn't scale linearly.
lol this feels like exactly the opposite of what's going on at my company now:
1. Hire the best engineers: fire half the dev team and replace them with offshore devs for pennies on the dollar
2. build solid foundations: cut every corner possible to get whatever crazy deal our sales team made yesterday
3. tracking metrics: uptime? who cares, CI taking too long? whatever
well, I suppose our teams are small when they literally fired everyone and made the "team" so small it literally couldn't be smaller or it wouldn't be a team (there is 2 of us now). Hardly independent though since we're shackled to the whims of clueless sales drones that have zero clue how building software works.
> We use Redshift to store data. As user events are happening, our relational database (MySQL) replicates them in Redshift. Within minutes, the data is available for queries and reports.
Honestly, I'm so tired of arbitrary UI and behavior changes, I'd actually say more companies need to back off their engineering velocity as they scale. Get it right, then leave it alone.
Do you feel the same way about the many, many startups with all-Indian or all-East Asian rosters? That they seem to have “avoided” the various ethnicities not currently employed by them?
Here is what I think are several root causes of poor velocity
1. too much focus on hiring
2. lack of clear responsibilities
3. lack of management <-> line worker interaction
4. bad mentor <-> new grad ratios
5. bad product development to infra (build infra/infra infra/dev tools etc) ratios
6. mistaking prolific senior engineers for good senior engineers
7. letting senior engineers off the hook for maintenance
8. lack of some process
9. hiring specialists
One can ask what sacrifices you make to hire good engineers. You might choose to make exciting infrastructure investments rather than a necessary investment. You might promise that the "good engineer" won't have to do incredibly boring work. You might hire people who have made a career out of avoiding the real high risk pain centers of a company and instead working on high visibility low risk problems. How much of which engineer's days will be sacrificed to interviews? The engineering concessions made towards the goal of hiring are likely an underrated root cause of poor velocity.
I watched the most festering pile of code at a company be hot potato-d between the vp of infra and vp of product. The CTO was checked out and not in touch with what was happening enough to know this was a problem. Neither VPs brought it up as a problem, because neither wanted responsibility and therefore the likely black mark by their names for the uphill battle that would result. The company deeply suffered because there was no advocate for the companies highest pain area because everyone with power, clout, or authority avoided responsibility for it.
When management gets insular, and management fails to solicit direct feedback for line workers, they can't be sure the picture they have in their head is what matches reality. This creates management level delusions about the state of their engineering org. We can see this played out in US vs Russian military structure. Management sends goals down and expects them adhered to. Failure results in punishment. This creates rigid planning and low agility. The US military instead gives lower levels large leeway to achieve higher level goals. It is the lower levels responsibility to communicate feasibility or feedback, and more importantly it is upper managements responsibility to adapt plans based on feedback. I was absolutely part of an "e-4 mafia" (https://www.urbandictionary.com/define.php?term=E4-Mafia) and I knew much better than my superiors what was happening, why it was happening, who was doing it, who could help doing it, and its likelihood of success because I was in the weeds. When I laughed directly at managers who told me their plans, they thought it was something wrong with me, not something wrong with their plans. That was half management failing, and half my inexperience in leading upwards.
Every new grad needs one mentor to prevent them from doing absolutely insane overly complicated things. If you do not have a level of oversight, complexity will bloom until it festers. A good mentor preventing new grad over complications can save an incredible amount of headaches. New grads should not be doing other new grads code reviews (for substantial work). Teams should not be comprised entirely of new grads and an inexperienced tech lead. New grads are consistently the largest generators of complexity.
I worked at a place where there was 1 person working on build infra. .2% of the company was devoted to making sure we had clean reliable builds. I estimate 5-15% of the engineering org quit due to pain developing software, which meant there was a lot of time spent interviewing people and catching them up rather than fixing problems. I don't know what the right ratio is, but I can say for sure that if you don't invest in dev tools/build infra etc, early enough, you will hit a wall and it will be damaging if not a mortal wound.
There are lots of engineers who code things to be interesting. They write overly complex code. They lay down traps in their code. It's rare for there to be a senior engineer who writes boring, effective, and, most importantly, simple code. Some of the code I've seen senior engineers write violates every principle of low coupling, no global state, being easy to test, etc. These people are then given new grads who learn to cargo cult their complexity until it gets to the point where someone says 'we have to re-write this from scratch.'
There is an anti-pattern where senior engineers get to create a service with no oversight, then give it to other people to maintain and build upon or "finish." Those teams seem to have low morale and high turnover. The people left on those teams aren't impressive and so it gets harder to hire good engineers for those teams. If a team is the lowest rung on the ladder, clearly evidenced by being given problems and being told to "deal with it," that will show to new hires only exacerbating the problem.
Some people hate process, it slows them down. Bureaucracy is (debate-ably) terrible. One design doc with a review can save quarters of work. Some process slows progress down now, for less road blocks later. If process is not growing at a rate of O(log(n)) or growing at a rate greater than O(log(n)), then there's probably gonna be some problems.
Lastly, while it's important to hire good people, it's also important to hire some specialists. Databases, infra, dev tools, build infra, platform/framework infra, various front-end things, traffic infrastructure. There are all types of specializations, and if you have a good "full stack" product engineer in the room without say a platform/framework specialist, you will get the product development perspective without the product maintenance perspective, and that has exactly the consequences you might expect. The earlier you get an advocate for say "build infrastructure," the more you are able to address future problems before they are major problems.
This article dances around some of the important stuff.
Maintaining engineering velocity is about adapting your culture to focus on the right things. At the beginning, engineering velocity is driven by low communication barriers and fast decision making and lack of existing tech debt. But this doesn't scale.
For engineering velocity at scale you need the following:
- Maintaining a high quality bar: You need to properly manage and prioritize tech debt and infrastructure complexity. This also means keeping dependencies low. The more engineers you have, the more code you're going to have. If this catches up with you, your velocity can be severely impacted.
- You need to have good change management: things like database migrations or big code changes should carry the least amount of risk possible. Documentation is important. You quickly get into hot water if you get stuck here.
- A culture of continuous improvement: teams should be data driven in measuring their performance and motivated to maintain and improve that performance over time. That means, for example, tracking sprint completions, bugs, etc. Each team needs to own this. The goal: ship high quality code faster.
- A close connection with the business and customers: When you focus on what the business needs and what the customer needs, it prioritizes things in such a way that your teams are dialed in to work on only what is needed. You can waste a lot of engineering time on things which don't matter.
- A culture of coaching and personal improvement: hiring good people is key, but ultimately it's how they play together as a team which is the most important. People need guidance on how they can be the most useful in that context. Sometimes people don't work out, so having fast feedback and showing people the door when they don't work out is necessary. Not doing this is a great way to demotivate other team members, at the very least.
- High amounts of ownership. This has sort of become a trendy topic in management philosophy, but ultimately it comes down to: can teams make autonomous decisions and own outcomes in the greatest way possible. This is all aimed at reducing decision making and communications overhead. It is also about making sure people with the greatest context can make decisions rather than somebody higher up with less context.
I'm sure I'm missing other stuff, but if you apply these principles things start aligning themselves in a direction where velocity is constantly improving.
However, I think talking about engineering velocity can sometimes miss the mark. What you really want is consistency.
Engineering lives within a broader organizational structure and other parts of the business need to be able to rely on you for certain things. When you can say: "Yes we can ship this feature by this date" and then hit that, you enable a lot of things. So yes, being fast is great. But you won't get there without being consistent.
This is bullshit advice. Hire the best engineers? Might as well say "don't make mistakes" in a list of advice about how to minimize mistakes. Same thing with building solid foundations from day one, maybe you do it, but more likely is that the person that comes in to make sure velocity can keep up is different from the core group that was just iterating to find PMF.
The other two points are more relevant, but I feel like half of this advice is not really useful. Way more to learn from descriptions of turning around a codebase of CI/CD pipelines that were struggling with slowness, flakeyness, contention and how to dig yourself out. Those stories at least you can learn from and adapt to your situation.
If "hire the best engineers" is your advice for anything, that is only a tenable strategy for VC backed startups willing to burn 200k/person from day 0, but I guess this is on a VC blog, so what can you expect. More useful advice is "how to do X with normal people".
"Hire the best engineers" is somewhat simplistic, and I don't think the article offered great insights on that, but ... all business where I worked that had problems actually Getting Stuff Done were due to either bad hires or excessively bad code (which is often the same as "bad hires", or rather, the result of it).
I've never seen tooling really help. I mean, it's nice, saves time and effort, ensures some degree of quality, and all of that, but it doesn't really move the needle all that much in getting stuff done compared to having a team that works well together. Actually, I've seen desperate adoption of tooling in futile attempts to fix these more fundamental issues, which is how some companies end up with 40 linters that moan and whine about every little thing so everyone is adding //nolint statements every 20 lines.
In short, I do think that it's good advice, although personally I'd phrase it as "hire the right people". Someone who is a good engineer – maybe even among the best – will not necessarily fit in well with any team, or any job, and there are other important qualities too aside from engineering skill. A junior can be a good hire for your team, even in a startup scenario.
How you actually hire the "right people" is of course not easy.
I was the first backend engineer hired at Faire joining 2 of the cofounders in early 2018. I can assure you I wasn't getting paid anywhere close to $200k when I started, and neither were my extremely talented colleagues.
This is going to sound harsh, but based on this, you weren't the "best engineers", I'm sorry to say. If you think the best engineering talent in the world is working on some random ecommerce website for less than 200k/yr, I have news for you. And you know what, that's fine, because people who care about "best engineers" usually suck. What a vapid thing to waste time wondering about. I'd bet you have hardcore impostor syndrome in all your new hires if this is the messaging you put out.
What you want is hard working people that are cool to work alongside and fix problems without much drama or bike shedding, not "the best". This circle jerk of "best engineers" is only useful to pay you less and keep you happy.
Instead of calling each other best, focus on the work and show the work, if it's good, people will judge your organisation's engineering skills appropriately, otherwise you're just patting your own back for no reason.
All good! I agree with you, at the time I wasn't one of "the best". But I think there's a couple key things, one of which you touched on.
The first is that "the best" is subjective. To me, the best engineers _are_ the ones who you enjoy working with and don't cause drama or bikeshedding. Some 10x programmer who everyone hates working with isn't "the best". So perhaps it all depends on what traits others are looking for.
The other is that as a startup you can try to hire "the best" engineers today, _or_ hire people who you see potential in and can grow with the company. I'd put myself in that latter group, along some of my extremely talented coworkers I mentioned. Having that growth opportunity is part of what attracted me to Faire in the first place, and is a way to compete for talent on more than just comp like the original comment implied.
Thanks for sharing! And sorry again for being harsh, I'm keen on reading some more shares about actual strategies in engineering tooling or others that you worked on to keep things chugging along. The focus on metrics I think is very under-appreciated in our industry and the more we can standardize to compare velocity and easily identify companies that don't give a shit about developer experience and velocity, the better in my opinion.
I would say that I believe it's a much happier life the less you think about how good you or your colleagues are and the more you think about the systems and the tools you develop. I really meant what I said about impostor syndrome, I'd be curious what answers you'd get from your engineering team regarding that, if so far you've focused a lot of your internal messaging / culture on "being good", you might find some opportunities for honest feedback and sharing which might get a more tight-knit empowered team, which does wonders for velocity.
"Hire the best engineers?" doesn't read like bullshit to me. A lot of startuos start out with scrappy teams who only half know what they're doing in their domain. That's partly because the domain changes every couple of weeks.
When you scale, that changes. It's crucial that you hire competent people for whatever your startup ended up doing. Have a lot of frontend? Hire UX designers and senior frontend engineers with 10+ years experience. Doing advanced AI? Hire someone who's been in that space for 5-10+ years. Etc. They're worth every penny.
The logic of fake-it-till-you-make it doesn't scale, but many don't realize it early enough.
What is missing is the tradeoff. Hire the best engineers at the cost of what? It’s not a strategy until that’s articulated.
For example you say “they’re worth every penny” but are they? Would you borrow from your 401k to hire them? Pull your kids out of college to afford them? They might be worth a lot- what are you suggesting gets traded off?
The thing is I’ve met founders who implicitly don’t want to hire the best anything. The best are driven, intelligent and experienced which to some founders comes off as a threat. So they hire yes people instead who won’t push back.
This is like a poorly written mission statement like "Win the enterprise market". Of course we want to win the enterprise market and hire the best engineers. Any statement whose converse would be an absurd business tactic should be treated as not particularly useful. "Hire the worst engineers!" This is like a college football coach saying "Recruit every 5-star player in the U.S."
We all want to hire the best engineers from day one. The execution of that strategy requires a combination of early wins, interesting problems, technology stack, compensation, leadership, culture, vision and geography to attract top talent. Faire may be able to check most of the boxes, yet some rock star is going the find the business model boring and the leadership uninspiring and go work for an org that is more aligned to their values and ambition.
Hiring the best engineers is table stakes for a company centered around a love and deep appreciation for building and shipping magical software.
Even from day one, no competent founder says "I should deliberately not hire the best engineer I can find based on the comp I can offer". You may not be able to get them, but that shouldn't change your hiring strategy.
My reaction to the table of contents was similar to yours, and I think it was a poor choice of words to summarize the first section.
Maybe something like "Hire customer-oriented engineers" would've been a better title for section one.
I think the "grit" and expertise in core technology are also important points, but it doesn't hook a reader the same way as saying "don't hire an engineer that can't adapt to the customer" (gloss over the double negative)
> From the beginning, we built a simple but solid foundation that allowed us to maintain both velocity and quality. When we found product-market fit later that year and started bringing on lots of new customers, instead of spending engineering resources on re-architecturing our platform to scale, we were able to double down on product engineering to accelerate the growth.
This is the gold standard. It takes exceptional talent to put together an organization like this while searching for product-market fit. Bravo!
Step 1. Set up data pipelines that feed into a data warehouse.
Amendment if you are using financial data...
Step 1a. Rather then building this stuff yourself, go to rose.ai and use our financial data warehouse, pipelines*, and pre-built models to save yourself months.
*If you are in tradfi, we have Bloomberg, Refinitiv, FRED, CapIQ, FactSet etc.
If you are in crypto, we have integrations with the blockchain, Dune, coinmarketcap, coingecko etc.
The reason is simple, in the beginning you're starting with no legacy code. Features are simple to add, because there is no history to work around. However, even a month in, features are going to start blending together, and new features will have to work around existing features.
Your databases will get new tables, and the existing tables will get more columns. The response code is going to fork as the API versions increment. The API space is going to grow.
Technology which handled 1-1000 requests per minute will fall over when they start getting hit with 10,000 requests per minute, and require optimizations, indexes, caches, and other complexities in the stack which aren't new features.
Startups are fun to work for early on, when complications like these don't exist. It requires more than "the best engineers" to keep up development velocity, let alone feature velocity.