>That said features and bug fixes were often times gated by feature flags
Sorry for maybe a silly question, but how do feature flags work with migrations? If your migrations run automatically on deploy, then feature flags can't prevent badly tested migrations from corrupting the DB, locking tables and other sorts of regressions. If you run your migrations manually each time, then there's a chance that someone enables a feature toggle without running the required migrations, which can result in all sorts of downtime.
Another concern I have is that if a feature toggle isn't enabled in production for a long time (for us, several days is already a long time due to a tight release schedule) new changes to the codebase by another team can conflict with the disabled feature and, since it's disabled, you probably won't know there's a problem until it's too late?
> Sorry for maybe a silly question, but how do feature flags work with migrations? If your migrations run automatically on deploy
Basically they don't. Database migration based on frontend deploy doesn't really make sense at facebook scale, because deploy is no where close to synchronous; even feature flag changes aren't synchronous. I didn't work on FB databases while I was employed by them, but when you've got a lot of frontends and a lot of sharded databases, you don't have much choice; if your schema is changing, you've got to have a multiphased push:
a) push frontend that can deal with either schema
b) migrate schema
c) push frontend that uses new schema for new feature (with the understanding that the old frontend code will be running on some nodes) --- this part could be feature flagged
d) data cleanup if necessary
e) push code that can safely assume all frontends are new feature aware and all rows are new feature ready
IMHO, this multiphase push is really needed regardless of scale, but if you're small, you can cross your fingers and hope. Or if you're willing to take downtime, you can bring down the service, make the database changes without concurrent access, and bring the service back with code assuming the changes; most people don't like downtime though.
>Basically they don't. Database migration based on frontend deploy doesn't really make sense at facebook scale, because deploy is no where close to synchronous; even feature flag changes aren't synchronous.
Our deployments aren't strictly "synchronous" either. We have thousands of database shards which are all migrated one by one (with some degree of parallelism), and new code is deployed only after all the shards have migrated. So there's a large window (sometimes up to an hour) when some shards see the new schema and others see the old schema (while still running old code). It's one click of a button, however, and one logical release, we don't split it into separate releases (so I view them as "automatic"). The problem still stays, though, that you can only guard code with feature flags, migrations can't be conditionally disabled. With this setup, if a poorly tested migration goes awry, it's even more difficult to rollback, because it will take another hour to roll back all the shards.
Serious question: are you going to catch "corrupt data"-style migrations in staging in general?
There are of course "locks up the DB"-style migrations where you can then go in and fix it, so staging helps with that. But "oh this data is now wrong"-style errors seem to not really bubble up when you are just working off of test data.
Not to dismiss staging testing that much, but it feels like a tricky class of error where the answer is "be careful and don't delete data if you can avoid it"...
We don't have a staging environment (for the backend) at work either. However, depending on the size of the tables in-question, a migration might take days. Thus, we usually ask DBA's for a migration days/weeks before any code goes live. There's usually quite a bit of discussion, and sometimes suggestions for an entirely different table with a join and/or application-only (in code, multiple query) join.
Sorry for the silly question, perhaps, but what is the purpose of a db migration? Do schemas in production change that often?
For context, the last couple of services I wrote all have fixed, but implicit schema, (built on key value stores). That is, the DB has no types. So instead, the type system is enforced by the API layer. Any field changes so far are gated via API access and APIs have backwards compatibility contracts with API callers.
I’m not saying that the way I do it currently is “correct” - far from it. I strongly suspect it’s influenced by my lack of familiarity with relational databases.
There is a lot to be said about enforcing the schema in the database vs doing it in application code, but not doing migrations comes with an additional tradeoff.
If you never change the shape of existing data, you are accumulating obsolete data representations that you have to code around for all eternity. The latest version of your API has to know about every single ancient data model going back years. And any analytics related code that may bypass the API for performance reasons has to do the same.
So I think never migrating data accumulates too much technical debt. An approach that many take in order to get the operational benefits of schemaless without incurring technical debt is to have migrations lag by one version. The API only has to deal with the latest two schema versions rather than every old data representation since the beginning of time.
Variations of this approach can be used regardless of whether or not the schema is enforced by the database or in application code.
Relational databases can be very strict, for example if you use foreign key references then the data base enforces that a row exists in the referenced table for every foreign key in the referring table. This strict enforcement makes it difficult to change schema.
The way you handle things with API level enforcement is actually a good architecture and it would probably make schema changes easier to deal with even on a relational database backend.
A fairly recent example is a couple of tables for users who are “tagged” for marketing purposes (such as we sent them an email and want to display the same messaging in the app). These tags have an expiration date at the tag level but we wanted the expiration date per-user too. This enables marketing to create static tags. This requires a migration to the data so this can be supported.
Schemas don’t change that often, in my experience.
For minor changes there's a simpler path, where you add a new field to the database, default the value to some reasonable value, then add it to the workflow in stages.
Depending on how your database feels about new columns and default values, there may be additional intermediate steps to keep it happy.
The idea is to have migrations that are backward compatible so that the current version of your code can use the db and so can the new version. Part of the reason people started breaking up monoliths is that continuous deployment with a db-backed monolith can be brittle. And making it work well requires a whole bunch of brain power that could go into things like making the product better for customers.
> another concern
Avoiding "feature flag hell" is a valid concern. It has to be managed. The big problem with conflict is underlying tightly coupled code, though. That should be fixed. Note this is also solved by breaking up monoliths.
> tight release schedule
If a release in this sense is something product-led, then feature flags almost create an API boundary (a good thing!) between product and dev. Product can determine when their release (meaning set of feature flags to be flipped) is ready and ideally toggle themselves instead of roping devs into release management roles.
>The idea is to have migrations that are backward compatible so that the current version of your code can use the db and so can the new version
Well, any migration has to be backward-compatible with the old code because old code is still running when a migration is taking place.
As an example of what I'm talking about: a few months ago we had a migration that passed all code reviews and worked great in the dev environment but in production it would lead to timeouts in requests for the duration of the migration for large clients (our application is sharded per tenant) because the table was very large for some of them and the migration locked it. The staging environment helped us find the problem before hitting production because we routinely clone production data (deanonymized) of the largest tenants to find problems like this. It's not practical (and maybe not very legal too) to force every developer have an up-to-date copy of that database on every VM/laptop, and load tests in an environment very similar to production show more meaningful results overall. And feature flags wouldn't help either because they only guard code. So far I'm unconvinced, it sounds pretty risky to me to go straight to prod.
I agree however that the concern about conflicts between feature toggles is largely a monolith problem, it's a communication problem when many teams make changes to the same codebase and are unaware of what the other teams are doing.
> Well, any migration has to be backward-compatible with the old code because old code is still running when a migration is taking place.
This is definitely best practice, but it's not strictly necessary if a small amount of downtime is acceptable. We only have customers in one timezone and minimal traffic overnight, so we have quite a lot of leeway with this. Frankly even during business hours small amounts of downtime (e.g. 5 minutes) would be well tolerated: it's a lot better than most of the other services they are used to using anyway.
> Well, any migration has to be backward-compatible with the old code because old code is still running when a migration is taking place.
This doesn't have to be true. You can create an entirely separate table with the new data. New code knows how to join on this table, old code doesn't and thus ignores the new data. It doesn't work for every kind of migration, but in my experience, it's preferred by some DBAs if you have billions and billions of rows.
Example: `select user_id, coalesce(new_col2, old_col2) as maybe_new_data, new_col3 as new_data from old_table left join new_table using user_id limit 1`
I think their question was more "if I wrote a migration that accidentally drops the users table, how does your system prevent that from running on production"? That's a pretty extreme case, but the tldr is how are you testing migrations if you don't have a staging environment.
Put the DB on docker (or provide some other one touch way to install a clean database). Run all migration scripts to get a current schema, insert sample data, now do your testing. Then make this part of the build process. Then, be sure that you detect after the migrations that the regression tests are failing and prevent merge. The key is having a DB that can be recreated as a nearly an atomic operation.
I'd think they create "append-only" migrations, that can only add columns or tables. Otherwise it wouldn't be possible to have migrations that work with both old and new code.
> Otherwise it wouldn't be possible to have migrations that work with both old and new code.
Sure you can. Say that you've changed the type of a column in an incompatible way. You can, within a migration that executes as an SQL transaction:
1. rename the original table "out of the way" of the old code
2. add a new column of the new type
3. run an "INSERT ... SELECT ..." to populate the new column from a transformation of existing data
4. drop the old column of the old type
5. rename the new column to the old column's name
6. define a view with the name of the original table, that just queries through to the new (original + renamed + modified) table for most of the original columns, but which continues to serve the no-longer-existing column with its previous value, by computing its old-type value from its new-type value (+ data in other columns, if necessary.)
Then either make sure that the new code is reading directly from the new table; or create a trivial passthrough view for the new version to use as well.
(IMHO, as long as you've got writable-view support, every application-visible "table" should really just be a view, with its name suffixed with the ABI-major-compatibility-version of the application using it. Then the infrastructure team — and more specifically, a DBA, if you've got one — can do whatever they like with the underlying tables: refactoring them, partitioning them, moving them to other shards and forwarding them, etc. As long as all the views still work, and still produce the same query results, it doesn't matter what's underneath them.)
Not to be rude but this isn't how this works at all. Things like 'run an "INSERT ... SELECT ..."' can't happen at scale due to locking. How they actually do it is super rad:
tl;dr; They setup a system of triggers (updates,inserts,etc) , copy the data over, then run through all the data in the trigger system. percona developed all these fancy features as well to monitor replica data etc. Another way with cloud vms (terabyte+ tables), you image a replica, do the alter, let the replica catch up, image it, build replicas off that, promote this to master.
"This isn't how things work at all" implies that more than 0.0000001% of DBs are running "at scale." Most DBs people will ever deal with — including within big enterprises! — are things like customer databases that have maybe 100k records in them. The tables lock waiting for lower-xact-ID read traffic to clear, yes — for about 400ms. Because queries running on the DB are all point queries that don't run any longer than that. Or they're scheduled batch jobs that happen during a maintenance window.
At scale, you're hopefully not using an RDBMS as a source of truth in the first place; but rather, using it as a CQRS/ES aggregate, downstream of a primary event store (itself likely some combination of a durable message-queue, and archival object storage for compacted event segments.) In that kind of setup, data migrations aren't the job of the RDBMS itself, but rather the job of the CQRS/ES framework — which can simply be taught a new aggregate that computes a single new column, and will start catching that aggregate up and making the data from it available. If your RDBMS is columnar (again, hopefully), you can just begin loading that new column in, no locking on the rest of the table required.
IMHO, the trigger-based approach is a weak approximation of having a pre-normalization primary source. It's fine if you want to avoid rearchitecture, but in a HOLAP system (which is usually inevitable for these sorts of systems when their data architects have tried to focus on simplicity, as this leads to eschewing denormalized secondary representations in OLAP-focused stores) it will cause your write perf to degrade + bottleneck, which is actually the worst thing for locking.
(I should know; I'm dealing with a HOLAP system right now with large numbers of computed indices, append-only OLTP inserts, and random OLAP reads; where the reads hold hundreds of locks each due to partitioning. Any time a write tx stays open for more than a few hundred milliseconds, the whole system degrades due to read locks piling up to the point that the DB begins to choke just on allocating and synchronizing them. The DB is only a few TB large, and the instance it's on has 1TB of memory and as many cores as one can get... but locking is locking.)
My impression is that once you're at Facebook scale, most of your migrations are massive undertakings that need to take into account things like "How many terabytes more space do we need?", "How do we control the load on our DB nodes while the migration is going on?", "How do we get data from cluster A to cluster B?", "Is adding this index going to take hours and break everything?", and so on. Some of the time you'll be spinning up an entirely new cluster rather than changing the schema of an old one, and when you do migrate an existing cluster there's some five-page document specifying a week-long plan with different phases.
Then internally, they work around the inflexible db schemas by using offline batch processing tools or generic systems that can handle arbitrary data, for tasks that would be handled by the one DB in smaller systems.
> Another concern I have is that if a feature toggle isn't enabled in production for a long time (for us, several days is already a long time due to a tight release schedule) new changes to the codebase by another team can conflict with the disabled feature and, since it's disabled, you probably won't know there's a problem until it's too late?
what if you insisted that database down time for migrations was not acceptable and that your code needs to work with different versions of database (this can be adapters)?
It's common, but it's more like "this is common at young companies where the cost of maintaining staging can't pay for itself in improved productivity, because there aren't enough engineers for a 5% productivity improvement to be worth hiring multiple engineers for."
I'm sure that at FB at one point there wasn't a staging env. Today at FB there are multiple layers of staging, checked automatically as part of the deployment process. I'm ex-FB as well, and we definitely used staging environments every single day as part of the ordinary deploy pipeline. You probably worked there when it was younger, and smaller, and the tooling was less advanced.
Large tech companies have advanced dev tooling; eventually, the cost of paying people to make the tools is paid for in productivity gained per engineer, with large enough eng team sizes.
I think it's a difference in relationship with your users. For software I'm currently working on, we require verification that the changes we made were correct before they make it out to production, so there's a requirement for a staging environment they can access. That's a software > business relationship, where there's a known good outcome. This was also true in the ecommerce agency environments I worked in, the business owners want the opportunity to verify and correct things before they go out to production.
If it were a product > user relationship, where you're the product owner and you are trying to improve your product without an explicit request from your users, I can see how no staging environment makes sense. You have no responsibility of proof of correctness to your users, what you put out is what they get, and breakages can be handled as fixes after the fact.
This is how my current place does it. The only issue we are having is library / dependency updates have a tendency to work perfectly fine locally and then fail in production due to either some minor difference in environment or scale.
It's a problem to the point that we have 5 year old ruby gems which have no listed breaking changes because no one is brave enough to bump them. I had a go at it and caused a major production incident because the datadog gem decided to kill Kubernetes with too many processes.
Do you have a replicate of the production environment codified somehow, like into a VM? It's rarely perfect, but I usually try and develop locally on the same stack I deploy to which can help with the environment differences.
It's also why I think it's smart to rebuild the environment on deploy if it makes sense for your pipeline, so that you wipe any minor differences that have been accruing over time. Working on a long running product, you quickly find yourself with disparities building up, and they're not codified, so they're essentially unknown to the team until they cause an issue.
Yes, there is a docker compose file which is usually sufficient. The issue in this case was the problem only shows up if there is sufficient load on the background workers.
Look at it the other way around - you could have a different outage every single day and as long as that outage only impacted 4.3m users and they were different users each day, it would look like a once-a-year event to the average user.
They’re saying there’s a lot of leeway to break things (in a small way) at scale.
Was this true for the systems that related to revenue and ad sales as well? While I can believe that a lot of code at Facebook goes into production without first going through a staging environment, I would be extremely surprised if the same were true for their ads systems or anything that dealt with payment flows.
I worked on ad systems at Facebook, and yes it's (approximately) true for those as well.
The thing to realize that "in production" almost never means "rolled out to 100% of users from 0%". Instead you'd do very slow rollouts to, say, 1% of people in Hungary (or whatever) and use a ton of automatic measurements over time as well as lots and lots of tests to validate that things were working as expected before rolling out a little more, then a little more after that. By the time the code is actually being hit by the majority of users, it's often been run billions of times already.
I don't know about Facebook, but at other companies without similar, each git branch gets deployed to its own subdomain, so manual testing etc. can happen prior to a merge. Dangerous changes are feature flagged or gated as much as possible to allow prod feedback after merge before enabling the changes for everyone.
Problem I always seem to run into is that these optional features always seem to be added at a rate that's a bit higher than the rate at which the flags are retired. It doesn't take much of a multiplier for the number to become untenable pretty quickly.
I am leading a small team working on a social audio product. We follow the same process. The vast majority of our content is live audio conversations and so we need live/production data. If our stakeholders have to test the product it means they have to join those conversations, or setup live conversations in a parallel universe. Feature flags in production are the simplest way forward, but they carry a fair amount of risk. This is offset by automated tests wherever possible.
What about third party integrations? Don’t you need some non-production environment to test them in until both parties are satisfied with the integration and it’s impact on users?
You can see that there's a common backend ("configerator") that a lot of other systems ("sitevars", "gatekeeper", ...) build on top of.
Just imagine that these systems have been further developed over the last decade :)
In general, there's 'configuration change at runtime' systems that the deployed code usually has access to and that can switch things on and off in very short time (or slowly roll it out). Most of these are coupled with a variety of health checks.
More seriously, at my old company they just never got removed. So it wasn’t really about control. You just forgot about the ones that didn’t matter after awhile.
If that sounds horrible, that’s probably the correct reaction. But it’s also common.
Namespacing helps too. It’s easier to forget a bunch of flags when they all start with foofeature-.
I’ve seen those old flags come in handy once. Someone accidentally deleted a production database (typo) and we needed to stop all writes to restore from a backup. For most of it, it was just turning off the original feature flag, even though the feature was several years old.
At a previous workplace we managed flags with Launch Darkly. We asked developers not to create flags in LD directly but used Jira web hooks to generate flags from any Jira issues of type Feature Flag. This issue type had a workflow that ensured you couldn't close off an epic without having rolled out and then removed every feature flag. Flags should not significantly outlast their 100% rollout.
I work at a different company. Typically feature flags are short-lived (on the order of days or weeks), and only control one feature. When I deploy, I only care about my one feature flag because that is the only thing gating the new functionality being deployed.
There may be other feature flags, owned by other teams, but it's rare to have flags that cross team/service boundaries in a fashion that they need to be coordinated for rollout.
You have automated tools that yell at you to clean up feature flags and you force people to include sensible expiration dates at part of your PR process. Flags past the date result in increased yelling. If your team has too much crap in the codebase eventually someone politely tells you to clean it up.
You also have tooling that measures how many times a flag was encountered vs. how many times it actually triggered etc. Once it looks like it's at 100% of traffic, again you have automations that tell people to clean up their crap.
They can't spin up test environments quickly, so they have windows when they cannot merge code due to release timing. They can't maintain parity of their staging environments with prod, so they forswear staging environments. These seem like infrastructure problems that aren't addressing the same problem as the staging environment eo ipso.
They're not arguing that testing or staging environments are bad, they're just saying their organization couldn't manage to get them working. If they didn't hit those roadblocks in managing their staging environments, presumably they would be using them.
> They're not arguing that testing or staging environments are bad, they're just saying their organization couldn't manage to get them working.
That is exactly what I got from reading this article. Their staging process was poorly set up and they simply abandoned ship. Additionally, I was getting poor software culture vibes.
Indeed, I found the "We only merge code that is ready to go live" part odd. It seems unrelated to the presence of absence of a staging environment. Where I work, we use staging and also only merge code that is ready to go live.
Similarly, "Poor ownership of changes" and "People mistakenly let process replace accountability" just don't seem staging-related to me. I've been in environments where people throw code over the fence straight into production.
The way I read that is, if you have a staging environment, it is someone else's job to test things before they deploy (Product Manager etc.) which allows them to merge it and forget it. I agree if it is the dev's job then it wouldn't make sense.
The other issue that is fair is the potential lag between dev and production if you have a gate at staging. This way, a developer is likely to move onto something else instead of watching their baby swim into production with all the errors that could cause!
I actually empathize with the statements in the article about the challenges of managing a staging environment. For a lot of systems I worked on, short of just copying customer data into the staging environment (which, depending on your industry or the contracts with your customers may be a big no-no, and the more-distributed your system, the harder that also was to do) it was extremely hard to populate the staging environment with representative data. Or representative state in third-party systems.
(Not to say that testing on a developer's machine wouldn't have these same problems, of course, which I also find the article glosses over.)
I don't represent the original author. But the way I read that is: the staging environment is said to be a production clone, but it turned out it wasn't, so let's not pretend it is (and fool ourselves) and instead embrace a different test strategy altogether. Thinking of how I test things locally and how I write my own unit and integration tests, I guess that it means doing very isolated, functional tests. An investment in which, to some teams, may be more valuable than throwing it into staging and hoping that the state you found your staging environment in is going to surface some insight about your code change that isn't covered by your test suite.
I agree that that is what they are trying to do, but they don't appear to be testing with all that different of a strategy; something has been removed, but it doesn't seem like anything has taken it's place.
Staging is useful because you cannot predict how your changes will impact the entire application, or how it will interact with configuration. This is precisely where isolated tests fall short. Differences between production and staging are real, but differences between production and local are much more profound.
Staging is an imperfect strategy, but this approach doesn't appear to be sound. It seems to shrug it's shoulders and settle for something even worse. I'm baffled, frankly.
Having staging always encourages this. It’s really difficult to replicate prod in any non trivial way that exceeds what can be created on a workstation.
Eg. Even if you buy the same hardware you can’t replicate production load anyway because it’s not being used by 5 million people concurrently. Your cache access patterns aren’t the same, etc.
It’s far better to have a fast path to prod than a staging environment in my opinion.
I think it's too much to expect staging to match the load and access patterns of your prod system.
I find staging to be very useful. In various teams I have been a part of, I have seen the following productive use cases for staging
1. Extended development environment - If you use a micro-services or serverless architecture, it becomes really useful to do end-to-end tests of your code on staging. Docker helps locally, but unless you have a $4,000 laptop, the dev experience becomes very poor.
2. User acceptance testing - Generally performed by QAs, PMs or some other businessy folks. This becomes very important for teams that serve a small number of customer who write big checks.
3. Legacy enterprise teams - Very large corporations in which software does not drive revenue directly, but high quality software drives a competitive advantage. Insurance companies are an example. These folks have a much lower tolerance for shipping software that doesn't work exactly right for customers.
> I think it's too much to expect staging to match the load and access patterns of your prod system.
For a lot of things, this makes staging useless, or worse. When production falls over, but it worked in staging, then staging gave unwarranted confidence. When you push to production without staging, you know there's danger.
That said, for changes that don't affect stability (which can sometimes be hard to tell), staging can be useful. And I don't disagree with a staging environment for your usecases.
> For a lot of things, this makes staging useless, or worse.
That depends on what Staging is used for, if its used to run e2e tests, giving a demo to PMs etc, you can use Staging. For performance testing you can setup a similar env like Prod, run your perf tests and then kill the perf env or you can scale up the staging env, dont let anyone use it except for performance and then scale it down.
It’s crazy sometimes how big of a difference it is. One recent example - I had to build a custom Docker image of some OSS project. Not even a huge one - only what I would call small-mid size. Just clone the repo and run the makefile, super simple. It took 35 minutes to build on my 2020 Mac Mini (Intel) and would have been probably half that if I had the most recent machine.
Why would I build on a local machine vs running the build on a server in a datacenter? Per your own arguments, server grade hardware is going to compile much faster than any local workstation.
Ah, good old "compiling" [0]. When a worker needs a $4000 machine to actually do his work then it's unavoidable. The slow machine? $2000 out to be enough™ for everyone else.
when I worked for big corp, the reason we were told in engineering for getting $1,000 laptops was that it wasn't fair to accounting, HR, etc for us to have better machines. In the past people from these departments complained quite a bit.
The official reason (which was BS) was "to simplify IT's job by only having to support one model"
Who cares what is "fair"? A decision like that should be based on an elementary productivity calculation. If not the inmates have taken over the asylum.
Perhaps we have different ideas about what a staging environment is for. I wouldn't expect a staging environment to give accurate performance numbers for a change, the only solution to that is instrumenting the production environment.
If you're choosing to pay large sums of money for SQL Server instead of the open source alternatives, you should also factor in the large sums of money to have good development/staging environments too.
All the more reason to just use Postgres or MySQL.
EDIT: as someone else hinted at, it does look like the free Developer version of SQL Server is fully featured and licensed for use in any non-prod environment, which seems reasonable.
Sure different planning 20 years ago would have made a big difference. Or the will/resources to transition. I am just saying that this scenario exists.
It's pretty easy to create your own amis with developer versions. It makes sense why AWS doesn't necessarily provide this out of the box. But it still stands for fully managed versions of licensed software, you'll pay for the license even if it's non-prod
Yes, that's not to say it is not possible to create a similar env, but I thought the debate was how precisely you are replicating your production env.
Sure it may be "good enough", but I thought the debate was about precision. How your own ami setup may differ from the AWS built from the developer version compared to the AWS ami? I don't know.
Trying for an identical setup in staging is expensive, this is just a scenario I am familiar with. I am sure there are a lot like this.
> More often than not, each environment uses different hardware, configurations, and software versions.
They can't even deploy the same software versions to their staging environment. We're a long way off talking about precisely replicating load characteristics
> Depending on your tech, staging environments can be very expensive
For our business & customers, a new staging environment means another phone call to IBM and a ~6 month wait before someone even begins to talk about how much money its gonna cost.
I worked at a financial company that had an exact copy of its largest datacenter deployment in a lab for full-system testing. They recorded packets from the exchange at nanosecond precision, and had the equipment to do exact playback of those packets in the lab.
It was an amazing engineering tool. You could test all sorts of end-to-end performance tweaks there, and have great confidence that you were right.
I also worked at a financial company that did not have one of these. The computers ran substantially slower, and lots of end-to-end performance improvements were left on the table and mired in debate. The mock datacenter cost about $10 million, and was worth every penny.
I think we can establish that the database is the biggest culprit in making this difficult.
As an independent developer, I have seen several teams that either back sync the prod db into the staging db OR capture known edge cases through diligent use of fixtures.
I am not trying to counter your point necessarily, but just trying to understand your POV. Very possible that, in my limited experience, I haven't come across all the problems around this domain.
The variety of requests and load in prod never matches production along with all the messiness and jitter you get from requests coming from across the planet and not just from your own LAN. And you'll probably never build it out to the same scale as production and have half your capex dedicated to it, so you'll miss issues which depend on your own internal scaling factors.
There's a certain amount of "best practices" effort you can go through in order to make your preprod environments sufficiently prod like but scaled down, with real data in their databases, running all the correct services, you can have a load testing environment where you hit one front end with a replay of real load taking from prod logs to look for perf regressions, etc. But ultimately time is better spent using feature flags and one box tests in prod rather than going down the rabbit hole of trying to simulate packet-level network failures in your preprod environment to try to make it look as prodlike as possible (although if you're writing your own distributed database you should probably be doing that kind of fault injection, but then you probably work somewhere FAANG scale, or you've made a potentially fatal NIH/DIY mistake).
The article doesn't talk about any of that though. The article says staging diffs prod because of:
> different hardware, configurations, and software versions
The hardware might be hard or expensive to get an exact match for in staging (but also, your stack shouldn't be hyper fragile to hardware changes). The latter two are totally solvable problems
With modern cloud computing and containerization, it feels like it has never been easier to get this right. Start up exactly the same container/config you use for production on the same cloud service. It should run acceptably similar to the real thing. Real problem is the lack of users/usage.
I was responding to other commentors not really the title article.
The stuff you cite there is pretty simple to deal with, configuration management is basically a solved problem and IDK how you can't just fix the different hardware.
The more universal problem of making preprod look just like prod so that you have 100% confidence in a rollout without any of the testing-in-prod patterns (feature flags, smoke tests, odd/even rollouts, etc) is not very solvable though.
A lot of things seem like they shouldn’t be, until you’ve debugged a weird kernel bug or driver issue that causes the kind of one-off flakiness that becomes a huge issue at scale.
IME, when you are not webscale, the issues you will miss from not testing in staging are bigger than the other way round. But that doesn't mean that all the extra efforts you have to put in the "test in prod only" scenario should not be put even when you do have a staging env.
But you need infrastructure and paying delicate attention to this problem. It is hard to define exactly what does replicating prod mean. And sometimes it might be difficult, e.g. prod might have access controlled customer data store that has its own problem, or it is about cost. But doesn't necessarily mean if you can't replicate perfectly, it is useless, you can still catch problems with things that you can replicate and do go wrong.
Ofc it is impossible to catch bugs 100% with staging, however, that argument goes either way.
I've worked at a number of places where we had multiple products with replicated staging/production environments. One approach is to have your environments codified, prod gets built from the same pipeline staging does, and to automate database refreshes from prod > staging so they don't fall behind. It isn't rocket science, but of course it doesn't come at zero cost. Some production environments are pretty hard to replicate too, like anything with third party integrations that don't offer a staging environment of their own.
If you're a small shop I can understand it, but bigger companies with infrastructure teams, there's no excuse really, the technologies are all there.
Tech debt can definitely accrue that makes it difficult. For instance recently I had an odd difference between prod and staging, and when I looked into it I realized there was a legacy behavior for old customers, and our testing users in different environments were on different sides of the cutoff.
Naturally that's a nasty tech debt and we should find a way to clean it up, but it was a pragmatic solution to a business problem that occurred before any of my team mates started working there, and it meant we could continue to serve our customers as we made a transition. These things happen.
For many, many cases, it is literally impossible, if not incredibly unrealistic. Doing away with it eliminates all of the problems associated with it and allows you to gain more confidence in your deploys. It simplifies while adding reliability and velocity. This is why abandoning staging is the best strategy.
I don't see how this can scale beyond a single service.
Complex systems are made of several services and infrastructure all interconnected.
Things that are impossible to run on local. And even if you can run on local, the setup
is most likely very different from production. The fact that things work on local
give a little to zero guarantees that they will work in prod.
If you have a fully automated infrastructure setup (e.g: terraform and friends), then
it is not that hard to maintain a staging environment that is identical to production.
Create a new feature branch from main, run unit tests, integrations tests. Changes are
automatically merged in the main branch.
From there a release is cut and deployed to staging. Run tests in staging, if all good,
promote the release to production.
The problem with staging environments is that replicating the functionality is easy but replicating the data, interactions, and behavior of people in a real environment is not. It's better to think in terms of early access releases and some kind of controlled roll out of new software so you catch bugs and issues before they impact most of your users.
I've seen many projects where the staging environment is a bad joke and where most real testing happens in production anyway. These days alternative strategies are being more clever about how you work with rolling out software to your production environments. There are various ways of doing this but it always boils down to having both the old and the new software running in the same environments and controlling who gets to see what using feature flags, dns, routing, etc. Also, if you run any kind of AB tests, this is what you would need. I've seen some companies do that but mostly this is more of an aspirational thing than an actual thing of course.
For the SAAS company I'm a CTO of, I actually stumbled on a nice mechanism when I realized that our customers' desire for dedicated setups lead us to a natural state where we update those last, thus making our multi-tenant environment a natural place to test / provide early access. Likewise our webapp rolls out immediately from our master branch but we package it up for Android/IOS less frequently because of the release bureaucracy Apple and Google impose. So that branch effectively is our stable release. And we have a matching web server for that branch as well that updates only when we merge to our production branch. The other server uses the same infrastructure (database, redis, etc.) but updates straight from our master branch. So, our staging server is part of our production environment and serves the same data, is exposed to the same user behavior, etc.
That also makes it easier to verify that old and new client software needs to work with both our latest server as well as the production servers for our dedicated setup.
you need both, in my experience working in SaaS, enterprises expect reliable and stable platforms. A staging environment is that extra safety net that can help preventing shipping a completely broken product. In a staging env you can turn on/off experiments and feature flags before doing that in production.
That said, you should also build the product so that you can run experiment and only turn on a feature for a small sub set of production customers, usually the free tier. To then gradually rollout to everybody else.
Last, staging env should be considered a production-grade env, thus if it breaks there should be SRE/DEV on call ready to jump and fix it.
Note that this also necessarily requires that critical variables and configuration options are stored in version control, rather than database. Bootstrapping staging databases with values necessary to run the application is a constant challenge.
Otherwise, your production environment would have massively different feature flags and other config than staging.
There are tools that can perform a diff of your databases and generate a change script. So we just diff local vs staging and capture the changes and check that in along with the code changes. Every change to the database create a schema change record so its easy to apply only the latest changes if they're versioned.
I think the problems they have with managing non prod environments is actually a symptom of having many systems.
Staging environments are easy to maintain when it’s one system, when you have a complicated service oriented architecture, it becomes much more difficult and expensive to maintain non prod environments.
This is good insofar as it forces you to make local development possible. In my experience: it's a big red flag if your systems are so complex or interdependent that it's impossible to run or test any of them locally.
That leads to people only testing in staging envs, causing staging to constantly break and discouraging automated tests that prevent regression bugs. It also leads to increasing complexity and interconnectedness over time, since people are never encouraged to get code running in isolation.
Ehh... once your systems use more than a few pieces of cloud infrastructure / SaaS / PaaS / external dependencies / etc, purely local development of the system is just not possible.
There are some (limited) simulators / emulators / etc available and whatnot for some services, but running a full platform that has cloud dependencies on a local machine is often just not possible.
The answer (IMHO) is to not use services that make it impossible to develop locally, unless you can trivially mock them; the benefits of such services aren't worth it if they result in a system that is inherently untestable with an environment that's inherently unreproducible.
(I can go on a rant about AWS Lambda, and how if they'd used a standardized interface like FastCGI it would make local testing trivial, but they won't do that because they need vendor lock-in...)
Awesome. you just cost your company $500K in salaries for people to maintain databases, networks, storage, servers and a bunch of other stuff Google/AWS already do much better than you.
How lucky you are that management pays you to pursue your hobbies!
Agreed. And stay away from proprietary cloud services that lock you into a specific cloud provider. Otherwise, you'll end up like one of those companies that still does everything on MS SQL Server and various Oracle byproducts despite rising costs because of decisions made many years ago.
Forcing developers to deal with mocks right from the beginning is critical in my opinion. Unit testing as part of your CI/CD flow needs to be a first priority rather than something that gets thought of later on. Testing locally should be synonymous with running your unit test suite.
Doing your integration testing deployed to a non-production cloud environment is always necessary but should never be a requirement for doing development locally.
> In my experience: it's a big red flag if your systems are so complex or interdependent that it's impossible to run or test any of them locally
At one time this was a huge blocker for our productivity. Access to a reliable test environment was only possible by way of a specific customer's production environment. The vendor does maintain a shared 3rd party integration test system, but its so far away from a realistic customer configuration that any result from that environment is more distracting than helpful.
In order to get this sort of thing out of the way, we wrote a simulator for the vendor's system which approximates behavior across 3-4 of our customer's live configurations. Its a totally fake piece of shit, but its a consistent one. Our simulated environment testing will get us about 90% of the way there now. There are still things we simply have to test in customer prod though.
This is what we do as well. We just stub out the 3rd party integration and inject a dynamic configuration to generate whatever type of response we need.
I don't think you can take infrastructure seriously without a staging environment. For many companies that is fine - they don't have significant infrastructure to maintain (or just don't maintain the infrastructure they have).
I work on a team that maintains our database layer and the lack of a staging environment is incredibly painful. Every test has to be done in production and massive effort needs to be taken to proceed safely. With a staging environment you can be more aggressive and come up with a solid benchmark and test suite to gain confidence rather than having to data collect in prod
The short answer appears to be "we are cheap and nobody cares yet."
It's easy to damn the torpedoes and deploy straight into production if there's nobody to care about, or your paying customers (to the extent you have any) don't care either.
Once you start gaining paying customers who really care about your service being reliable, your tune changes pretty quickly. If your customers rely on data fidelity, they're going to get pretty steamed when your deployment irreversibly alters or irrevocably loses it.
Also, "staging never looks like production" looks like a cost that tradeoff that the author made, not a Fundamental Law of DevOps. If you want it to look like production, you can do the work and develop the discipline to make it so. The cloud makes this easier than ever, if you're willing to pay for it.
Ooof I think I have to agree with "we are cheap and nobody cares yet.". If we had a bad release go out that blocked nightly processing, for example, it was how amazing fast it became a ticket to CEOs start calling.
One of the things that we did really well is we had tooling that spun up environments. The same tooling DevOps stood up production environments also stood up environments for PRs and UAT. Anyone within the company could spin up an environment for which ever reason be it from master or to apply a PR. When it works it works great, if it doesn't work fix it and don't throw out the entire concept.
This is a pretty weird article. Their "how we do it" section lists:
- "We only merge code that is ready to go live"
- "We have a flat branching strategy"
- "High risk features are always feature flagged"
- "Hands-on deployments" (which, from their description, seems to be just a weird way of saying "we have good monitoring and observability tooling")
...absolutely none of which conflict with or replace having a staging environment. Three of my last four gigs have had all four of those and found value in a staging environment. In fact, the often help make staging useful: having feature-flagged features and ready-to-merge code means that multiple people can validate their features on staging without stepping on eachother's toes.
FWIW I don't think it is weird at all. Maybe a little short on details of what ready really means for example. While I don't think going completely staging-less makes a lot of sense, going without a shared staging environment is a good thing.
It is absolutely awesome to be able to have your own "staging" environment for testing that is independent of everyone else. With the Cloud this is absolutely possible. Shared staging environments are really bad. Things that should take a day at most turn into a coordination and waiting game of weeks. And as pressure mounts to get things tested and out you might have people trying to deploy parts that "won't affect the other tests" going on at the same time. And then they do and you have no idea if it's your changes or their changes that made the tests fail. And since it's been 2 weeks since the change was made and you finally got time on that environment your devs have already finished working on two or more other changes in the meantime.
FWIW we have a similar set up where devs and QA can spin up a complete environment that is almost the exact same as prod and do so independently. They can turn on and off feature flags individually without affecting each other. Since we don't need to wait (except for the few minutes to deploy or a bit longer to create a new env from scratch) any bugs found can be fixed rather quickly as devs at most have started working on another task. The environment can be torn down once finished but probably will just be reused until the end of the day.
(while it's almost the same as prod it isn't completely like it for cost reasons meaning less nodes by default and such but honestly for most changes that is completely irrelevant and when it might be relevant it's easy to spin up more nodes temporarily through the exact same means as one would use to handle load spikes in prod).
YMMV as always, i.e. this might be faster/easier to implement for some use cases/companies than others.
I would argue that for most changes in most companies it does not matter if you have a full data set equivalent of Prod. Especially since it's an ever growing target (hopefully for you :)). As we can see from your use case, that can create challenges. If we were testing every little change with a complete replica of our Prod environment we'd spend a lot more money on this and would also have to wait way too long for these environments to come up. This might then drive us towards keeping a small set of staging environments running all the time and share them.
Now I don't know your specific field, what kind of system you provide, what the cost of failures in Prod would be, what kind of guards you have in your systems against a complete outage like that (e.g. can a bad query running for one customer take out your entire product or will it be an isolated incident to that one customer or a small group of customers?) etc. If so maybe you first want to isolate failures between customers more.
FWIW, I also see devs write these kinds of queries and code and we catch most of them in PRs and most people learn from this and next time they don't write this kind of query any longer. But let's say they do and we don't catch them. Can you analyze your data set and figure out a small enough example that is not simply an ever growing anonymized version of actual Prod data that would exhibit the same catastrophic characteristics for most queries these devs might write? I'm pretty sure the kinds of queries that fail on 10TB of data w/ 20M files fail just as spectacularly on 9TB and 19M files? Have you tried if they also fail 'badly enough' on 500GB of data and 1M files? How fast would it be to restore a fixed data set like that? Maybe refresh it every month from Prod if that's needed for some reason or another.
A lot of what you might actually want to try out in the end will depend on your exact system set up I would argue and it's hard to give general advice that will definitely fit.
We have a cut-down but “production shaped” dataset for fast environment creation but even with code reviews it hasn’t been successful at catching the O(n^2) unindexed query or accidental O(scary) at the app tier caused by a refactor. Or the data migration which wasn’t batched but needs to be. Developer turnover is only accelerating, with predictable results.
At this point we are looking into volume snapshots as a potential speedup for creation of a full-sized per-release staging environments but that still leaves the problem of generating a realistic multi-tenant customer load to solve and maintain over time.
Catch the 90% of stuff that works with your cut down dataset. The other 10% includes things you might only catch not only with O(n^2) queries over a 10TB, but also when under prod load like you say.
Just either gradually roll out those high-risk changes with feature flags, etc. or make sure your monitoring is up-to-scratch so you can catch issues and fix them.
Totally seconding the 80/20 rule here. It's probably less frustrating and faster for everyone if you can work on compartmentalizing failures that do slip through or only happen for those two customers with very special data sets that you most probably wouldn't catch in testing either because your test suite won't execise the full 10TB data set anyway (assuming here because we don't know your product)
> "staging" environment for testing that is independent of everyone else
That's not usually what people mean by staging. Staging is a type of pre-production test environment where several different features that are continously developed by different teams can be tested together. Third party integrations such as logistics, ordering and payment systems can also have their integration testing here.
> Shared staging environments are really bad
That sounds dangerously close to "testing is hard, let's go shopping". That it can be a logistical challenge to test code that touches many parts of a multi stakeholder system does not mean we shouldn't do it.
Having to wait weeks to test a feature sounds like the process has broken down, not that the process is unnecessary.
Staging is a type of pre-production test environment where several different features that are continously developed by different teams can be tested together
Yes, that is exactly what we can do on these environments that any dev or QA has at their fingertips. We have multiple teams that work on several different features/parts of the application that sometimes live in the same services as other teams work on as well or they are smaller, more dedicated services and they're the only team working on those specifically but they still usually work together with other services in the overall system. Some of those have third party integrations. FWIW, in our case usually we are the ones testing integrations with a third party and not the other way around.
That it can be a logistical challenge to test code that touches many parts of a multi stakeholder system does not mean we shouldn't do it.
As mentioned earlier, these individually deployable environments are fully functional and fully integrated. Now I understand that in some situations it can be hard to do a full integration with a third party, because the third party is not able to able to accommodate the numerous environments that you are able to provide. In those cases compromises might need to be made. That is overall a bad thing though. E.g. to take your "third party ordering system" example again. It would be best, if you could have a separate account/tenant/instance (whatever makes sense in the exact circumstances and nature of system) in said third party system but sometimes that might not be possible and a third party system might need to be shared somehow between all your own staging/dev envinronments.
That sounds dangerously close to "testing is hard, let's go shopping". Having to wait weeks to test a feature sounds like the process has broken down, not that the process is unnecessary.
That is never what I said. I said that we can create such staging environments that are a fully integrated set of services on our end really easily and that that is awesome to have. The "having to wait weeks on end" is something I have experienced at previous clients/employers and I absolutely agree with you that it's a broken system. I all too well remember the "yes, we can have INT-3 for 2 hours next Tuesday, do you think we can get all our tests done in that time frame? After that they need it for extensive performance testing for a week and INT-1 won't be available until Thursday at 11". And then you gotta answer "Sorry 2 hours is barely enough to do the deploy and re-configuring of the environment because we need to manually restore that special data set for the third party logistics system mock and then adjust the configuration and that alone takes those 2 hours if the issues we had last time are any indication".
Sooo much better to click the deploy button (or in the dev case we usually use the command line ;) ) and 10 minutes later you have your code and everything else deployed.
There's a difference between permanent staging environments that need maintenance and disposable "staging" environments that are literally a clone of what's on your laptop that you trash once UAT/smoke is done.
The former costs money and can lie to you; the latter is literally prod, but smaller.
This makes it sound so easy, but in my experience, permanent staging environments exist because setting up disposable staging environments is too complex.
How do you deal with setting up complex infrastructure for your disposable staging environment when your system is more complex than a monolithic backend, some frontend and a (small) database? If your system consists of multiple components with complex interactions, and you can only meaningfully test features if there is enough data in the staging database and it's _the right_ data, then setting up disposable staging environments is not that easy.
Sibling here but I can talk a bit about how we do it.
Through infrastructure as code. We do not have a monolithic backend. We have a bunch of services, some smaller, some bigger. Yes there's "some frontend" but it's not just one frontend. We have multiple different "frontend services" serving different parts of it. As for database, we use multiple different database technologies, depending on the service. Some service uses only one of those, while others use a mix that is suited best to a particular use case. For one of those we use sharding and while a staging or dev environment doesn't need the sharding, these obviously use the only shard we create in dev/staging but the same mechanism for shard lookup are used. For data it depends. We have a data generator that can be loaded with different scenarios, either generator parameters or full fledged "db backup style" definitions that you can use but don't have to. We deploy to Prod multiple times per day (basically relatively shortly after something hits the main branch).
Through the exact same means we could also re-create prod at any time and in fact DR exercises are held for that regularly.
Absolutely. The answer is better integration boundaries but then you’re paying the abstraction cost which might be higher.
It’s particularly difficult when the system under test includes an application that isn’t designed to be set up ephemerally such as application-level managed services with only ClickOps configuration, proprietary systems where such a request is atypical and prevented by egregious licensing costs, or those that contain a physical component (e.g. a POS with physical peripherals).
it's actually "pretty easy" to do when you start from first principles.
I usually ask "can I build your code on my laptop? is this the same as what's in prod?" usually the answer is no, so I work to turn that into a yes.
often times, I find that much of the complexity that you speak of is due to shared services that few have invested time into running locally precisely because of long-lived dev/staging envs, like access to data (databases, filesystems, secrets managers, etc) or tight dependencies (config services, databases, and other APIs come to mind).
example. i once worked with a team where we tried to get their app running locally in docker. (they used pcf back when it was called that; it's called tas now.) their app needed to use a dev instance of a db when it was not in a prod env. we asked if we could get a mocked schema. they said yes, but it would take three days.
it took three days because another team would manually produce the dataset from querying prod and modifying values. since they loaded it into the dev/staging environments, teams just used that. leadership also had no way of knowing whether devs were using data with real values on their workstations (because lack of automation and auditing), so politics were involved in producing a local schema that we could load into Postgres on Compose. (this was a financial company, so any environment with PII is fair game for auditors, which costs time and money.)
we landed up reverse-engineering the tables they needed so we could produce fake data good enough for integration to pass, but of course that introduces environment stratification of another kind since this team didn't own the data.
honestly, now that i wrote this, if every CTO forced their teams to make their core applications 12-factor, then staging environments would go away naturally while improving code quality and platform safety.
If at all possible your entire infrastructure should be defined as code.
At my workplace we use aws cdk for infrastructure and standing up a new environment is as easy as calling ‘cdk deploy’ and then we have a script which runs after the provision to copy in data.
Yeah, it sounds to me like OP had the former, which they've dropped, and haven't yet found a need for the latter.
I work for a tiny company that, when I joined, had a "pet" prod server and a "pet" staging server. The config between them varied in subtle but significant ways, since both had been running for 5 years.
I helped make the transition the article described and it was huge for our productivity. We went from releasing once a quarter to releasing multiple times a week. We used to plan on fixing bugs for weeks after a release, now they're rare.
We've since added staging back as a disposable system, but I understand where the author is coming from. "Pet" staging servers are nightmarish.
The company does some analysitics on highly redundant data (user behavior on website). They run a system with low requirements for a avaibility, correctness, and feature churn. Their product is nice to have but not important to mission on a daily basis. If their entire system went down for a day, or even 3 days a week, their customers would be only mildly inconvenienced.
They aren't Amazon or Google. So they test in prod.
Those bullets together explain how they can avoid having a staging environment.
There's a whole section of the article entitled "What’s wrong with staging environments?" that explains why they don't want staging.
They even presented their "why" before going into their "how." There is absolutely nothing weird about this.
Well, ok, it's weird that not all so-called "software engineers" follow this pattern of problem-solving. But that's not Squeaky's fault. They're showing us how to do it better.
I'm assuming this is not an April Fools' joke, and my comments are targeted at the discussion it sparked here anyway.
A flat branching model simplify things, and the strategy they describe surely enables them to ship features to production faster. But the risks I see there:
- who decides when a feature is ready to go to production? The programmer who developed them? The automated tests?
- features toggleable by a flag must, at least ideally, be double-tested -- both when turned on and off. Being in a hurry to deploy to production wouldn't help on that;
- OK, staging environments aren't in parity with production. But wouldn't they be better than the CD/CI pipeline, or developer's laptop, testing new features in isolation?
- Talking about features in isolation: what about bugs caused by spurious interaction between two or more features? No amount of test would find them if they only test features in isolation
> - who decides when a feature is ready to go to production? The programmer who developed them? The automated tests?
Exactly. That's the standout claim from the whole article. "We only ship when we're sure code is ready for prod". What, after running a few tests on your laptop? That's a good one :D
If you can, provide on-demand environments for PRs. It's mostly helpful to test frontend changes, but also database migrations and just demoing changes to colleagues.
If you have that, you will see people's behaviour change. We have a CTO that creates "demo" PRs with features they want to show to customers. All all the contension around staging as identified in the article is mostly gone.
if you have a relatively self-contained system with few or zero external dependencies, so the system can be meaningfully tested in isolation, then i agree that standing up a ephemeral test environment can be a great idea. i've done this in the past to spin up SQL DBs using AWS RDS to ensure each heavyweight batch of integration tests that runs in CI gets its own DB isolated from any other concurrent CI runs. amusingly, this alarmed people in the org's platform team ("why are you creating so many databases?!") until we were able to explain our motivation.
in contrast, if the system your team works on has a lot of external integrations, and those integrations in turn have transitive dependencies throughout some twisty enterprise macroservice distributed monolith, then you might find yourself in a situation where you'd need to sort out on-demand provisioning of many services maintained by other teams before before you could do nontrivial integration testing.
an inability to test a system meaningfully in isolation is likely a symptom of architectural problems, but good to understand the context where a given pattern may or may not be helpful.
You point out another kind of use of staging I've seen. "Don't touch staging until tomorrow after <some time> because SoAndSo is giving a demo to What'sTheirFace" so a bunch of engineering activity gets backed up.
in enterprisey environments with large numbers of integrated services, its even worse if a single staging environment is used to do end-to-end integration testing involving many systems. lots of resource contention for access to staging environment.
Not endorsing this point blank but.. One positive side effect of this is that it becomes much easier to rally folks into improving the fidelity of the dev environment, which has compound positive impact on productivity (and mental health of your engineers).
In my experience at Big Tech Corp, dev environments were reduced to low unit test fidelity over years, then as a result you need to iterate (ie develop) in a staging environment that is orders of magnitude slower (and more expensive if you're paying for it). It isn't unusual that waiting for integration tests is the majority of your day.
Now, you might say that it's too complex so there's no other way, and yes sometimes that's the case, but there's nuance! Engineers have no incentive to fix dev if staging/integration works at all (even if super slow) so it's impossible to tell. If you think slow is a mild annoyance, I will tell you that I had senior engineers on my team that committed around 2-3 (often small) PRs per month.
They're not mutually exclusive. You can achieve local + staging environments at the same time. Stable local env + staging. Local is almost always the most comfortable option due to fast iteration times, so nobody would bother with staging by default. Make it good, people will come.
In their perception, is the rest of tech industry gambling in every pull request that some untested code would work in production?
I work at a large company. We extensively test code on local machines. Then dev test environments. Then small roll out to just a few data centers in prod bed. Run small scale online flight experiments. Then roll out to the rest of prod bed.
And I've seen code fail in each of the stages, no matter how extensively we tested and robustly code ran in prior stages.
How many of the failures caught in dev would have been legitimate problems in production? How about the ones in staging?
If your environments are that different are you even testing the right things?
And if yes, if you need all of those, then why not add a couple more environments? Because more pre-prod environments means more bugs caught in those, right? /s
The whole point of a dev environment is so you can catch errors you wouldn't in production. Keeping it as close as possible to production is the point. If you're catching errors in dev, you'll see them in production
More environments don't catch more bugs. What is this corollary? More testing catches more bugs. A dev environment allows free integration testing without act users of production being affected
Good monitoring, logs, metrics, feature flagging (allowing for opening a branch of code for a % of users), blue/green deployment (allowing a release to handle a % of the user's traffic) and good tooling for quick builds/releases/rollback, in my experience, are far better tools than intermediate staging environments.
I've had great success in the past with a custom feature flags system + Google's App Engine % based traffic shifting, where you can send just a small % of traffic to a new service, and rollback to your previous version quickly without even needing to redeploy.
Now, not having those tools as a minimum, and not having either staging environment is just reckless. No unit/integration/whatever tests are going to make me feel safe about a deploy.
And yes, you need blue/green deployments in addition to feature flags, as it is not easy to feature flag certain things, such as a language runtime version update or a third party library upgrade, among many other things.
The trick is to be able to route users traffic to different deployments. You can run two versions of your application concurrently, and have a dial to progressively shift traffic to the new version, as soon as you notice anything wrong you shift it back to the previous version which wasn't stopped at all.
After 100% of the traffic is in the new version, and no customer complaints for 1h then you can shut down the old version.
Google App Engine had all of this at least 5 or 6 years ago.
We used to believe staging environments are not important enough. If you believe that then I would argue that you have not crossed a threshold as an org where your product is critical enough for you consumers. The staging environment or any for that matter just acts as a gating mechanism to not ship crappy stuff to customers.
You cannot have too many gates, then you would be shipping lates but with less number of gates you end up shipping low quality product.
Staging environment saves unnecessary midnight alerts and easy to catch issues that might have a huge impact when a customer has to face it. I wouldn't be surprised if in few quarters or a year or so they would have an article about why they decided to introduce a staging environment.
This reminds me of the "bake time" arguments I've had. There's some magical idea that if software "bakes" in an environment for some unknowable amount of time, it will be done and ready to deploy. Very superstitious.
what is the actual value gained from staging specifically? Once you have a list of those, a specific list, figure out why only staging could do that and not testing before or after. And "it's caught bugs before" is not good enough.
Firstly, There is no magical idea of software "baking" in an environment. It is about the risk appetite of the org., how willing is an org to push a feature that is "half-baked" their customers.
I believe modern day testing infrastructure looks very different. I have seen products like ReleaseHub that provides ondemand environments to dev to testing their changes out which eliminates the need for common testing env. That naturally means you need atleast one "pre-release" environment where all the changes are which would eventually becomes the next release. If you don't have this "pre-release" environment you will never be able to capture the side-effects of all the parallel changes that are happening to the codebase.
Thirdly, you have to see the context. When you have a microservice architecture, having a staging environment does not matter as fault tolerance, circuit breaking and other concepts makes sure that failed deployment of one services does not impact others. However, when you have a monolithic architecture you will never know what the side-effects of changes are unless you have a staging environment which would get promoted to production.
If you value customers, you should have a staging environment as a guardrail. The cost of not adhering or having a process like this is huge and possibly company-ending.
In my company it’s that it gives an opportunity for business side to experience the change before it goes live as a final check re business requirements of new functionality. There can be misalignment between the product person who specced the product, the engineer who implemented it, and the ultimate stakeholders.
This sounds like something I would write if a hypothetical gun was pointed at my head in a company where the most prominent customer complaint was that time spent in QA and testing was too expensive.
I have zero trust in any company that deploys directly from a developer's laptop to production, not in the least starting with how much do you trust that developer. There has to be some process right?
> company that deploys directly from a developer's laptop to production
Luckily, there's no sign of doing that here. There's no mention of how their CI/CD works, probably because it's out of scope for an already long article, but that's clearly happening.
"We only have two environments: our laptops, and production. Once we merge into the main branch, it will be immediately deployed to production."
Maybe my reading skills have completely vanished but to me, this exactly says they deploy directly from their developers' laptops to production. Those are literally the words used. The rest of the article goes on to defend not having a pre production environment.
They literally detail how they deploy from their laptops to production with no other environments and make arguments for why that's a good thing.
It says they "merge into the main branch" and it will be immediately deployed to production presumably via CI/CD system that detects code changes and does the necessary dirty dance.
Yup. If you have 4 production data centers, I imagine they're different sizes (autoscaling groups, Kubernetes deployment scale, perhaps even database instance sizes). So just build a staging environment that's like those, except smaller and not public. If you can't do that, then I'm willing to bet you can't deploy a new data center very quickly either, and your DR looks like ass.
Easier said than done, obviously. And even with docker images and Infra as Code and pinned builds and virtual environments, it is difficult to be absolutely sure about the last 1% of the environment, and it requires a ton of effort and engineering discipline to properly maintain.
Reducing the number of environments the team has to maintain means by definition more time for each environment.
Is it possible to make staging 100% identical with prod? Load is one thing I can think of that is difficult to make identical; even if you artificially generate it, user behaviour will likely be different.
I don't have experience with the true CI he describes, but I do have experience with pre-production environments.
> "People mistakenly let process replace accountability"
I find this to be mostly true. When the code goes somewhere else before it goes to prod, much of the burden of responsibility goes along with it. Other people find the bugs and spoon feed them back to the developers. I'm sure as a developer this is nice, but as a process I hate it.
> "People mistakenly let process replace accountability"
Who would do this? If a bug goes into production, the one responsible for the deployment is the one who rolls it back and fixes it. Even it it becomes a sev-3 later down the line, they're usually the one who gets looped back in thanks to Git commits.
I would say that a pre-prod environment allows teams to incorporate a larger set of accountability, such as UX validation, dedicated QA, translation teams (think intl ecom) even verifying third party integrations in their pre-prod environments.
You can have both process and accountability. Process for the things that can be automated or subject to business rules; accountability for when the process fails (either by design or in its implementation) or after lapses in judgment.
What I infer from the article is this company does not handle sensitive private data, or they do but are unaware of it, or they are aware of it and just handle it sloppily. I infer that because one of the biggest advantages of a pre-prod environment is you can let your devs play around in a quasi-production environment that gets real traffic, but no traffic from outside customers. This is helpful because when you take privacy seriously there is no way for devs to just look at the production database, or to gain interactive shells in prod, or to attach debuggers to production services without invoke glass-breaking emergency procedures. In the pre-prod environment they can do whatever they want.
Most of the rest of the article is not about the disadvantages of pre-prod, but the drawbacks of the "git flow" branching model compared to "trunk based development". The latter is clearly superior and I agree with those parts of the article.
> People mistakenly let process replace accountability
> We only merge code that is ready to go live.
This is one of the most off-putting things I have read on HN lately. Having worked on several large SaaS where leadership claimed similar stuff, I simply refuse to believe it.
It really depends on the product and what you work on. For the front end this makes a ton of sense, for backend systems I’m less confident that this is reality.
We duplicate the production environment and sanitize all the data to be anonymous. We run our automated tests on this production-like data to smoke test. Our tests are driven by pytest and Playwright. God bless, I have to say how much I love Playwright. It just makes sense.
This is my first time hearing about Playwright. Curious to know what you like about it over other frameworks? I didn't glean a whole lot from the website.
How big is your production dataset? Are you duplicating this for each deploy? Asking this because I work on a medium size app with only about 80k users and the production data is already in the tens of terabytes.
We are in tens of gigabytes and not tens of terabytes. I don't think our approach would work well for that dataset size unless you are able to shed some historical data that you don't need to assert functionality.
Cool story, but you don't _know_ if its ready until after.
Look, staging environments are not great, for the reasons described. But just killing staging and having done with it isn't the answer either. You need to _know_ when your service is fucked or not performing correctly.
The only way that this kind of deployment is practical _at scale_ is to have comprehensive end-to-end testing constantly running on prod. This was the only real way we could be sure that our service was fully working within acceptable parameters. We ran captured real life queries constantly in a random order, at a random time (caching can give you a false sense of security, go on, ask me how I know)
At no point is monitoring strategy discussed.
Unless you know how your service is supposed to behave, and you can describe that state using metrics, your system isn't monitored. Logging is too shit, slow and expensive to get meaningful near realtime results. Some companies expend billions taming logs into metrics. don't do that, make metrics first.
> You’ll reduce cost and complexity in your infrastructure
I mean possibly, but you'll need to spend a lot more on making sure that your backups work. I have had a rule for a while that all instances must be younger than a month in prod. This means that you should be able to re-build _from scratch_ all instances and datastores. Instances are trivial to rebuild, databases should also be, but often arn't. If you're going to fuck around an find out in prod, then you need good well practised recovery procedures
> If we ever have an issue in production, we always roll forward.
I mean that cute and all, but not being able to back out means that you're fucked, you might not think you're fucked, but that's because you've not been fucked yet.
its like the old addage, there are two states of system admin: Those who are about to have data loss, and those who have had data loss.
All good advice, but do you also have a rule where our DBs have to be less than a month old in prod? Doesn't look very practical if your DB has >100s of TBs
> Doesn't look very practical if your DB has >100s of TBs
If that's in one shard, then you've got big issues. with larger DBs you need to be practising rolling replacement replicas, because as you scale the chance that one of your shards cocking up approaches 1.
Again, it depends on your use case. RDS solves 95% of your problems (barring high scale and expense)
If your running your own DBs then you _must_ be replacing part or all of the cluster regularly to make sure that your backup mechanisms are working.
For us, when we were using cassandra (hint: dont) we used to spin up a "b cluster" for large scale performance testing of prod. That allowed us to do one touch deploys from hot snapshots. Eventually. This saved us from a drive by malware infection, which caused our instances to OOM.
I work in VFX and we have 1 primary 1 replica for the render farm (MySQL), and another one for an asset system. They both have 100s of TBs many cores and a lot of RAM, we treat them a bit like unicorn machines (they're bare metal), which isn't ideal, but yeah.. our failover and whatnot is to make the primary the replica and vice versa.
I cannot imagine reprovisioning it very often, when I worked in startups and used rds and other managed DBs it was easier to not have to think about it.
We run a lot of workflows in our systems, we keep a rolling window of 3 months for the farm data, the rest goes to a warehouse system. For the asset we keep for a show duration and we have many shows being worked from many studios. We don’t have binary data in them
This makes sense. With a high-enough release velocity to trunk, a super safe release pipeline with lots of automated checks, a well-tested rolling update/rollback process in production, and aggressive observability, it is totally possible to remove staging in many environments. This is one of the popular talking points touted by advocates of trunk-based development.
(Note that you can do a lot of exploratory testing in disposable environments that get spun up during CI. Since the code in prod is the same as the code in main, there's no reason to keep them around. That's probably how they get around what's traditionally called UAT.)
The problem for larger companies that tend to have lots of staging environments is that the risk of testing in production vastly exceeds the benefits gained from this approach. Between the learning curve required to make this happen, the investment required to get people off of dev, the significantly larger amounts of money at stake, and, in many cases, stockholder responsibilities, it is an uphill battle to get companies to this point.
Also, many (MANY) development teams at BigCo's don't even "own" their code once it leaves staging.
I've found it easier to employ a more grassroots approach towards moving people towards laptop-to-production. Every dev wants to work like Squeaky does (many hate dev/staging environments for the reasons they've outlined); they just don't feel empowered to do so. Work with a single team that ships something important but won't blow up the company if they push a bad build into prod. Let them be advocates internally to promote (hopefully) pseudo-viral spread.
This is good practice, except that blue/green is not exactly what you want. You want a smart load balancer that can shuffle an exact amount of traffic to a new service with your new deploy version. It must then evaluate the new service for errors and metrics, and then do an increase in shuffling traffic, etc, until you reach 100% shuffled traffic, at which time the old services can be decommissioned.
If at any time the monitoring of logs or metrics becomes unusual, it must shuffle all traffic away from the new service, alert devs, and halt all deploys (because someone needs to identify the bad code and unmerge it, thus requiring rework for all the subsequent work about to be merged). This is called "pulling the andon cord".
It is sad that there's all these comments saying this doesn't work. This has been the best practice established by Etsy, Martin Fowler, and others in the DevOps community for... 10 years? I guess until you see it for yourself it seems unbelievable. It requires a radical shift in design, development and operation, but it works great.
We have review environments so there is an easy way to have a fairly persistent config to QA features, but our environment that's named staging is more of a historical artifact. It's basically the same as a review environment because we recognize that after testing that the feature works as intended, it's going out and, we may be surprised by real production use. Our test suite, which is kicked off after you hit the merge button takes about 10-15 minutes and build/deploy to Amazon ECS is 8 to 10 minutes so there is pretty quick feedback. We also use feature flags when possible, but most deploys are very granular and we generally don't worry if something passes our test suite which is currently about 6k tests. Once we decided that merge to main get deployed automatically our staging environment became just another environment, our velocity increased, security patches are deployed almost immediately and we mostly don't worry about launches.
Disclaimer: I worked for a major feature flagging company, but these opinions are my own.
This article makes a lot of valid points regarding staging environments, but their reasoning to not use them is dubious. None of their reasons are good enough to take staging environments out of the equation.
I'd be willing to be that the likelihood of anyone merging code that isn't ready to go live is close to zero. You still need to validate the code. Their branching strategy is (in my opinion) the ideal branching strategy, but again, that isn't good enough to take staging away.
Using feature flags is probably the only reason they give that comes to close to being okay with getting rid of staging, but even then, you can't always be sure that the code you've built works as expected. So you still need a staging environment to validate some things.
Having hands-on deployments should always be happening anyway. It's not a reason to not have a staging environment.
If you truly want to get rid of your a staging environment the minimum that you need to feature flagging of _everything_, and I do mean everything. That is honestly near impossible. You also need live preview environments for each PR/branch. This somewhat eliminates the need for a staging because reviewers can test the changes on a live environment. These two things still aren't good enough reason to get rid of your staging environment. There is still many things that can go wrong.
The reason we have layered deployment systems (CI, staging etc) is to increase confidence that your deployment will be good. You can never be 100% sure. But I'll bet you, removing a staging environment lowers that confidence further.
Having said all of this, if it works for you, then great. But the reasons I've read on this post, don't feel good enough to me to get rid of any staging environments.
> If you truly want to get rid of your a staging environment the minimum that you need to feature flagging of _everything_, and I do mean everything. That is honestly near impossible. You also need live preview environments for each PR/branch. This somewhat eliminates the need for a staging because reviewers can test the changes on a live environment. These two things still aren't good enough reason to get rid of your staging environment. There is still many things that can go wrong.
This can be done very easily with many modern PaaS services. I had this like 6 or 7 years ago with Google App Engine, and we didn't have staging environment as each branch would be deployed and tested as if it were its own environment.
Point is, staging environment is there to increase the confidence that what you are deploying won't fail. Removing that is doable, but I wouldn't recommend it.
I’ve worked with multiple teams where QA tests in prod behind feature flags, canary deploys, etc. Staging environments and QA don’t always go hand in hand.
I've seen this before at very large companies. All testing done in local and very little manual smoke testing in QA by either the PM or other engineers.
There are big tech companies that don't have QA people.
But you don't need to have a single staging env shared by all QA testers. Why not create individual QA environments on an as-needed basis for testing specific features? Of course this requires you to invest in making it easy to create new environments, but it allows QA teams to test different things without interfering with each other.
This worked reasonably well as v-hosts per engineer, though it did share some production resources. QA members would then run through test plans against those hosts to exercise the code. I prefer it to a single monolithic env. Though branches had to be kept up to date and bigger features tested as whole.
No, it says if you have a QA process it doesn't including a staging environment.
A QA process is just a process - it doesn't have necessary parts - as long as it's finding the right balance between cost, velocity, and risk for your needs, it's working. Some parts like CI are nearly universal now that they're so cheap; some like feature flags managed in a distributed control plane are expensive; some like staging deployments are somewhere in the middle.
Speaking as the guy who pushed for and built our staging environments, neither do staging environments. (Speaking also as the guy who has taken the whole site down a few times.)
> We only merge code that is ready to go live
> If we’re not confident that changes are ready to be in production, then we don’t merge them. This usually means we've written sufficient tests and have validated our changes in development.
Yeah I don't trust even myself with this one. Your database migration can fuck up your data big time in ways you didn't even predict. Just use staging with a copy of prod. https://render.com/docs/pull-request-previews
Sounds like OP could benefit from review apps, he's at the point where one staging environment for the entire tech org slows everybody down.
They mention database as a factor not to have a staging env due to different size, but they don’t mention how they test schema migrations and any feature which touches the data which usually produce multiple issues, or even data loss.
This makes some sense for a single application environment. In our system, however, there are dozens of interacting systems, and we need an integration environment to ensure that new code works with all the other systems.
This probably also depends on your core business. If your product does not deal with real money, crypto, or other financial instruments and it is not serious if something goes wrong with a small number of people in production, this may work for you. It is probably cheaper and simpler.
Lots of products are not like that. I built a bank and work on stock exchanges. Probably not a good idea to save money by not testing as people get quite annoyed when their money goes missing.
I think we are missing some contexts here. I have been trying to find more information about them. From what I found [1] (hopefully accurate) it looks like they are a new team - Beta in August 2021 and just incorporated in this February.
The founder/CTO is a full stack developer. I speculate they are a very small team (1-2 developers at the most) and a relatively straightforward architecture. In that context I suspect it is quite feasible to go from local to production without going through staging: They are likely to have a self sustained stack that can be packaged; they don't have a huge database or collection of edge cases; they have few customers, low expectation in terms of service level; they don't have stakeholders to review and approve features done (they are their own bosses). I emphasize with where they are, I have been in the same place at some point. It will be interesting to see whether this is sustainable without staging, or for how long, as they grow in team and offering.
An important piece of context missing from the article is the size of their team. LinkedIn shows 0 employees and their about page lists the two cofounders so I assume they have a team of 2. It's odd that the article talks about the problems with large codebases and multiple people working on a codebase when it doesn't look like they have those problems. With only 2 people, of course they can ship like that.
How do you do QA? I mean, staging in our case is accessible by a lot of non technical people that test things automated test cannot test (did I say test?).
It seems like an April 1st troll (based on publication date), but I am assuming its not.
I can only say that this is a fairly poor decision from someone who appears knowledgeable to know better.
They could do everything they are doing as-is in terms of process, and just add a rudimentary test on a Staging environment as it passes to Production.
Over a long enough timeline it will catch enough critical issues to justify itself.
Isn’t the concept of a single staging environment becoming a bit dated? Every recent project I’ve worked on uses preview branches or deploy previews, eg what Netlify offers https://docs.netlify.com/site-deploys/deploy-previews/
no you're right, "staging" is gradually being replaced with per-commit "preview". but at enterprise scale when you have distributed services and data, and strict financial controls, and uncompromising compliance standards, it can often be unrealistic to transition to that until a new program group manager comes in with permission to blow everything up
When they lose all their most important customers’ data because the feature flags got too confusing… they can take this same article and say:
“BECAUSE WE xxxx that led to YYYY.
In future we will use a Staging or UAT environment to mitigate against YYYY and avoid xxxx”
Saving time on authoring a Post Mortem by pre-describing your folly seems like an odd way to spend precious dev time
I use a somewhat similar approach for Pirsch [0]. It's build so that I can run it locally, basically as a fully fledged staging environment. Databases run in Docker, everything else is started using modd [1]. This has proven to be a good setup for quick iterations and testing. I can quickly run all tests on my laptop (Go and TypeScript) and even import data from production to see if the statistics are correct for real data. Of course, there are some things that need to be mocked, like automated backups, but so far it turned out to work really well.
You can find more on our blog [2] if you would like to know more.
> Pre-live environments are never at parity with production
Same with your laptops... and this is only true if you make it that way. Using things like Docker containers eliminates some of the problem with this too.
> There’s always a queue
This has never been a problem for any of the teams I've been on (teams as large as ~80 people). Almost never do they "not want your code on there too". Eventually it's all got to run together anyway.
> Releases are too large
This has nothing to do with how many environments you have, and everything to do with your release practices. We try to do a release per week at a minimum, but have done multiple releases in a single day as well.
> Poor ownership of changes
Code ownership is a bad practice anyway. It allows people to throw their hands up and claim they're not responsible for a given part of the system. A down system is everyone's problem.
> People mistakenly let process replace accountability
Again - nothing to do with your environments here, just bad development practices.
> Code ownership is a bad practice anyway. It allows people to throw their hands up and claim they're not responsible for a given part of the system. A down system is everyone's problem.
Agreed with a lot of what you said up until this - this is, frankly, just completely wrong. If nobody has any ownership over anything, nobody is compelled to fix anything - I've experienced this first-hand on multiple occasions.
There have also been several studies done to refute your point - higher ownership correlates with higher quality. A particularly well-known one is from Microsoft, which had a follow up study later that attempted to refute the original findings but failed to do so. Granted, these were conducted from the perspective of code quality, but it is trivial to apply the findings to other scenarios that demand accountability.
Whoever sold you on the idea that ownership of _any and all kinds_ is bad would likely rather you be a replaceable cog than someone of free thought. I don't know about you, but I take pride in the things I'm responsible for. Most people are that way. I also don't give two shits about anything that I don't own, because there's not enough time in the day for everyone to care about everything. This is why we have teams in the first place.
There is a mile of difference between toxic and productive ownership - Gatekeepers are bad, custodians are good.
This sounds like an organisational issue, not a technical, and I predict that this simply won't scale organisational-wise. It sounds like they have given no thought about their platform architecture, deploy pipelines, testing strategies, ... It's probably not yet causing issues because they're working in a small team, but rectifying this later will be an absolute pita.
That said, at scale, having a big staging/test/... can be impossible, but then things are split up organisationally, each team managing/service group/... managing their own environments, being responsible for the reliability/stability and availability towards other teams.
Also, with service meshes it has become feasible to actually test in production so you can let select users end up on specific (test) versions of a certain backend service.
This works for services that are growing slowly in features or have few other services integrating with it. I’m not sure how this scales when there are multiple services across multiple teams with dependencies on one another where services are being rapidly developed with new features. At work, we have staging/pre-prod environments across most teams that my team works with so new features can be tested in staging and other teams can test integrating with it. This is also possible to do with just a production environment but requires some engineering effort to add feature flags and special headers indicating a request is from a team looking to try a new API.
All of their “problems” with staging are fixable bathwater that doesn't require baby ejection.
I avoid staging for solo projects but it does feel a bit dirty.
For team work or complex solo projects (such as anything commercial) I would never!
On the cloud it is too easy to stage.
To the point where I have teared down and recreated staging environment to save a bit of money at times because it is so easy to bring back.
The article says to me their not using modern devops practices.
It is rare a tech practice “hot take” post is on the money, and this post follows the rule not the exception.
Have a staging environment!
Just the work / thinking / tech debt payoff to make one is worth it for other reasons: including to streamline your deployment processes both human and in code.
I have a lot of questions, but one above all the others. How do you preview changes to non-technical stakeholders in the company? Do you make sales people and CEOs and everyone else boot up a local development environment?
Also my main thought. Among other things, we sometimes use UAT as the place for broad QA on UX behavior a member of eng or data might not think to test. For quickly developed features that don’t go through a more formal design process, we’ll also review copy and styling.
They already said they use feature flags. Those usually allow betas or demos for certain groups. Just have whomever owns the flag system add them to the right group.
I guess that makes sense, but it means you would have rough versions of your feature sitting on production, hidden by flags. I could certainly be wrong about the potential for issues there, but it would definitely make me nervous.
I’m working at megacorp at the moment as contractor. The local dev, cloud dev, cloud stage, cloud prod pipeline is truly glacial in velocity even with automation like Jenkins, kubernetes, etc. it takes weeks to move from dev complete to production. It’s a middle manager’s wet dream.
I used to wonder why isn’t megacorp being murdered by competitors delivering features faster, but actually, everyone is moving glacially for the same reason, so it doesn’t matter.
I’m kinda reminded by pg’s essay on which competitors to worry about. I might be a worried competitor if these guys are pulling off merging to master as production.
A previous client was paying roughly 50% of their AWS budget (more than a million per year) just to keep up development and staging.
They were roughly 3x machines for live, 2x for staging and 1x for development.
Trying to get rid of it didn't work politically, because we had a cyclical contract with AWS where we were committing to spend X amount in exchange for discounts. Also, a healthy amount of ego and managers of managers BS.
In terms of what that company was doing, I'm pretty sure I could have exceeded their environment for 2k per month on hetzner (using auction).
It might be enough for this company, but if you are a big corporate, it's definitely not something to do. You cannot expect millions of consumers to just be ok with the fact that the mobile app is done because it's too hard to keep in sync staging and prod.
I am maintaining the infra for a big mobile app and our staging environment allowed us in the last year to have only two production incidents and they were not due to code source (networking).
I really recommend any serious business to at least try it and see by themselves the advantages
When I started at Twitter, a guy was proposing building a staging environment. At that point it would have cost about $2M, so it was within the realm of conceivable. I was a pretty immediate “no”, for all sorts of reasons. Keeping the data in reasonable shape would have been a big project all by itself, getting reasonable load on it would have been another, and then of course it’s a large environment you need to be on call for but that doesn’t have the priority production has. It’s just an all around bad idea.
No disrespect but you can do this for an analytics dashboard or a content web site with canaries (Facebook). May not be the best for high liability sites like financial systems.
People hold up banking as the pinnacle of serious and responsible high-quality high-reliability software engineering and operations, but my bank is the only web property or mobile app I use that's routinely unavailable for hours at a time.
Compared to a system with maintenance windows, yeah, maybe? Banking seems pretty forgiving honestly. Customers accept 3-5 business days latency per transaction. And even if you get it wrong the first time, money is fungible. It seems like it should call for more relaxed practices than systems that need to process non-fungible data in tens of milliseconds 24x7.
This appears to be just a naming convention issue. All the potential problems of staging environments can occur, for the same underlying reasons, in the approach advocated here, but they don't happen in staging merely because there isn't anything called that.
Personally, I think the approach advocated here is feasible, and even necessary if you are operating at global scale, but I am skeptical of tendentious stories about how it makes a number of problems just disappear.
Without staging environment, your chance of finding critical bugs rely on offline testing. Not all bugs can be found in unit tests, you need load tests to detect certain bugs that doesn't break your program from correctness perspective, but on latency/memory leakage front. And such tests might take longer time to run.
Staging slows things down, but it is intended, it creates a buffer to observe behavior. Depending on the nature of your service, it can be quite critical.
In one of my very first jobs in the mid 90ies there was also an incoming team we took over from a major competitor, who made the processes I introduced much simpler by removing dev testing, staging and CVS.
They preferred to work as root on the live servers with 80.000 customers. Development was apparently so much easier with immediate feedback.
I liked that so much, that I resigned, and found a much better job 2 years later.
I guess you could say bad cultural fit.
> Pre-live environments are never at parity with production
As a B2B vendor, this is a conclusion we have been forced to reach across the board. We have since learned how to convince our customers to test in production.
Testing in prod is usually really easy if you are willing to have a conversation with the other non-technical humans in the business. Simple measures like a restricted prod test group are about 80% of the solution for us.
If they're not a parity then you are doing CI/CD wrong and aren't forcing deploys to staging before production. If you set the pipelines correctly then you *can't* get to production without being at parity with pre-production.
> they don’t want your changes to interfere with their validation.
Almost like those are issues you want to catch. That's the whole point of continuous integration!
The only way that you can create stable and safe systems is by introducing processes to ensure that your systems are stable and safe. It doesn't matter how much personal responsibility you claim to take, you are going to make mistakes, and processes are the mechanism for limiting the damage of those mistakes. This is core to the best practices behind any safety critical industry, and is embedded in functional safety. The logic of this article appears to be "We just concentrate very hard to make up for not having a decent staging environment". Which is fine if no one cares if your stuff breaks.
>When there is no buffer for changes before they go live, you need to be confident that your changes are fit for production.
This is just completely wrong headed. It's like saying you should learn to tight rope walk 100 metres from the ground because it's going to make you concentrate on not falling more. The solution for making mistakes isn't to increase the fallout of those mistakes. You can absolutely build a culture where you value putting the onus on the developer to make sure they have a sense of responsibility for keeping master clean and working, without abandoning the processes that help mitigate when you fail to do that.
The funny thing is, that when you see articles saying the opposite of this, almost always they will also say "over the course of X months, our new staging environment caught Y additional bugs that would have impacted production". I'd love to see the same here - some actual data on how much they "We're just going to concentrate harder" impacts production.
One approach I’m experimenting with is that all services communicate via a message channel (e.g. NATS or Pub/Sub).
By doing this, I can run a service locally but connect it to the production pubsub server and then see how it effects the system if I publish events to it locally.
I could also subscribe to events and see real production events hitting my local machine.
Yea that sounds like a nice way to do things. I could see there being security concerns that devs can directly access data streams from their local setup. For places with data controls I could see this being a no-go.
I guess you could have an anonymizer which consumes the production pub-sub and then anonymizes the data for consumption by non-prod environments.
At my previous job we had a single staging environment, which was used by dozens of teams to test independent releases as well as to test our public mobile app before release. That said, it never matched production, so releases were always a crapshoot as things suddenly happened no one ever tested. Yes, it was dumb.
This is how we work at fastcomments... soon we will have a shard in each major continent and will just deploy changes to a shard, run e2e tests, and then roll out to the rest of the Shards.
But if you have a high risk system or a business that values absolute quality over iteration speed, then yeah you want dev/staging envs...
I currently have this with a client. When I was the only backend developer, the staging server setup worked perfectly because it was basically the master branch + 1 or 2 queued changes. Now that other people have joined the mix it's become more of a nuisance.
different business or organisational contexts have different deployment patterns and different negative impacts of failure.
in some contexts, failures can be isolated to small numbers of users, the negative impacts of failures are low, and rollback is quick and easy. in this kind of environment, provided you have good observability & deployment, it might be more reasonable to eliminate staging and focus more on being able to run experiments safely and efficiently in production.
in other contexts, the negative impacts of failure are very high. e.g. medical devices, mars landers, software governing large single systems (markets, industrial machinery). in these situations you might prefer to put more emphasis on QA before production.
What are some useful tools for running a development environment for each dev?
I have a pretty common setup of AWS services, Terraform, Docker, etc. Deploying this to a fresh AWS account is largely automatic but it takes about 20 minutes and it’s also expensive.
What works for you works for you. If you can't have a staging environment you obviously found a work around. There are many ways to deploy. Basically, you decide what risk you want to accept when you define a lifecycle.
I use my staging environment to let prospective clients or colleagues create and play with accounts without touching "real" data, and in the past used it to let a pentester test the non-prod site
>>>If we’re not confident that changes are ready to be in production, then we don’t merge them. This usually means we've written sufficient tests and have validated our changes in development.
This isn’t very uncommon. In fact, it actually is exactly what the article is trying to explain it’s not: a staging/pre-live environment. Only instead of having it be deployed online, you keep it local.
How do you define a large scale database migration? If you're just updating data or schema, that can be done locally via integration test. No need for a separate environment.
You want to me extract one of the columns in that table out to a new, separate table - a job that will take several hours to complete.
You want to do this without any visible downtime or breakage to your end-users - likely with some kind of complex dual-write and/or dual/read mechanism during that operation.
Does QA just pull and test against a dev instance? Do they test against prod? Do engineers get prod API keys if they have to test an integration with a 3rd party?
Solution TL;DR: "Test your code, and push to production."
They completely misunderstood the problem and their solution literally changed nothing other than making devs test their code now. Staging could stay as is and would provide some significant risk mitigation with zero additional effort.
"Whenever we deploy changes, we monitor the situation continuously until we are certain there are no issues."
I'm sure customers would stay on the site, monitoring the situation too. Good luck with that strategy.
or they could maybe use a specific OS as their golden image, use ansible or chef or puppet or any of the hundreds of tools that config machines and keep their staging and prod in sync. Bonus points for introducing a service that produces mock data for staging.
Instead of going back to a known good version, they release a hotfix to prod. This will probably backfire once they encounter a bug which is hard to fix.
> Pre-live environments are never at parity with production
My experience is that is is fairly trivial to have feature parity with production. Whatever you do for production, just do it again for staging. That's what it is meant to be.
> Most companies are not prepared to pay for a staging environment identical to production
Au contraire. All companies I've been to are more than willing to pay this. And secondly, it is pennies compared to production environment costs, because it isn't expected to handle any significant load. And, the article does mention being able to handle load as being one of the things that differ. I have not yet found the need to use changes to staging to verify load scaling capabilities.
> There’s always a queue
I don't undestand this paraph at all. It seems like an artificial problem created by how they handle repository changes, and has little to do with the purpose of a staging environment. It smells fishy to have local changes rely on a staging environment. The infrastructure I set up had a development environment be spun up and used for a development testing pipeline. Doesn't, and shouldn't need to rely on staging.
> Releases are too large
Well... one of the main benefits of having a staging environment is to safely do frequent small deployments. So this just seems like the exact wrong conclusion.
> Poor ownership of changes
This again, is not at all how I understand code should be shipped to a staging environment. "I’ve seen people merge, and then forget that their changes are on staging". What does this even mean? Surely, staging is only ever something that is deployed to from the latest release branch, which also surely comes from a main/master? The following "and now there are multiple sets of changes waiting to be released", also suggest some fundamental misunderstanding. *Releases* are what are meant to end up in staging. <Multiple set of changes> should be *a* release.
> People mistakenly let process replace accountability
> "By utilising a pre-production environment, you’re creating a situation where developers often merge code and “throw it over the fence"
Again. Staging environment isn't a place where you dump your shit. "Staging" is a place where releases are verified in an as much-as-possible-the-same-environment-as-production. So, again. This seems like entirely missing the point.
----
It seems to me that they don't use a staging environment, because they don't understand what such a thing should be used for. I'd be completely OK with someone rationalizing this as "too much of a hassle". But to try and justify something so poorly...
From their conclusion:
> Dropping your staging environment in favour of true continuous integration and deployment can create a different mindset for shipping software. When there is no buffer for changes before they go live, you need to be confident that your changes are fit for production. You also need to be alert and take full ownership of any changes you make.
Well... of course there is a shift in mindset when you'll be shitting your pants every time you make a change in production, since that's when you'll get to see if you broke something. The whole point of a staging environment is to have a buffer.... so that you don't have to be "confident". So that you don't have to be on high alert, because you have alerts that can trigger without anything important going offline. So that ownership isn't crucial in a post-fuck-up blame game.
I struggle with a lot of the arguments made here. I think one key thing is that staging can mean different things. In the authors case, they say "can’t merge your code because someone else is testing code on staging." It is important to differentiate between this type of staging for development testing development branches vs a staging where only what's already merged for for deployment is automatically deployed.
Many of the problems are organizational/infrastructure challenges, not inherent to staging environments/setups. Straightening out dev processes and investing in the infrastructure solves most of the challenges discussed.
Their points:
What's wrong with staging environments?
* "Pre-live environments are never at parity with production" - resolved with proper investment in infrastructure.
* "There’s always a queue [for staging]" - is staging the only place to test pre-production code? If you need a place to test code that isn't in master, consider investing in disposable staging environments or better infrastructure so your team has more confidence for what they merge.
* "Releases are too large" - reduced queues reduces deployment times. Manage releases so they're smaller.
* "Poor ownership of changes" Of course this happens with all that queued code. address earlier challenges and this will be massively mitigated. Once there, good mangers's job is to ensure this doesn't happen.
* "People mistakenly let process replace accountability" - this is a management problem.
Solving some of the above challenges with the right investments creates a virtuous cycle of improvements.
How we ship changes at Squeaky?
* "We only merge code that is ready to go live" - This is quite arbitrary. How do you define/ensure this?
* "We have a flat branching strategy" - Great. It then surprises me that they have so much queued code and such large releases. I find it surprising they say, "We always roll forward." I wonder how this impacts their recovery time.
* "High risk features are always feature flagged" - do low risk features never cause problems?
* "Hands-on deployments" - I'm not sure this is good practice. How much focus does it take away from your team? Would a hands-off deployment with high confidence pre-deploy, automated deployment, automated monitoring and alerting, while ensuring the team is available to respond and recover quickly?
* "Allows a subset of users to receive traffic from the new services while we validate" is fantastic. Surprised they don't break this into its own thing.
This sounds horrible unless they have a super reliable way to roll back changes to a consistent working state, both in their deployments and their databases.
Agreed, this sounds crazy. One argument raised is because staging is often different from prod. But their laptop are even more different. It seems the main goal was to save money. All this make sense only for a very small team and code base
Or, on the flip side, how much do you lose by deploying an 'oops', resulting in customers having a bad experience and posting "This thing sux!" on social media?
I can sympathize with the costs in both time and money to maintain a staging environment, but you're going to pay for those bugs somehow - either in staging or in customer satisfaction.
You really need to use canary deployments/feature flags with this style. i.e. release to production but only for a group of users or be able to turn a feature off without another deployment.
Staging, tests, previews and even running code locally is for people who make mistakes. It's dumb and a total waste of time if you don't make any mistakes.
No testing at all, that's what I call optimizing for success!
On a more serious note:
Sometimes staging is the same as local, and in those situations there is very limited use for staging.
We often deploy to production directly because a customer wants a feature right now. I was thinking of changing the staging server to be called beta. Customers can use new features directly, but at their own risk.
Staging environments should be separate from production environments. If the Beta is expected to persist data in the long term, then it's not staging. Staging environments should be nukable. You don't want a messy Beta release to corrupt production data or to have customers trying to sue you if you reset staging.
I don't know about your customer but wanting a feature yesterday may be a sign of some dysfunctional operating practices. Shortening your already short deployment pipeline shouldn't be your answer, unless its currently part of the problem. Otherwise, this should be solved with setting better expectations.
What I found with customers is that they really like it if they talk to you about a feature, and next week it's there, although it's a preview version of the feature. After that they forget about it a bit and you've got plenty of time to perfect it.
It's mostly front-end features that change a lot, so there is not much danger in running them on the prod api and db. Our api is very stable because it uses event streaming. Mostly the front-end is different for different customers.
it's a good idea to be crystal clear about which environments are running production workloads. if you end up with "non-production" environments running production workloads then it becomes much easier to accidentally blow away customer data, let alone communicate coherently. "beta" is fine provided it is regarded as a production environment. you may still want a non-production staging environment!
i worked somewhere that had fallen into this kind of mess, where 80% of the business' production workloads were done in the Production environment, and 20% of the business' production workloads (with slightly different requirements) were done in a non-production test environment. it took calendar years to dig out of that hole.
I like to go even farther, I advocate only merging code that won't break anything. If you're feature flagging as many changes as possible then you can merge code that doesn't even work, as long as you can gate users away from it using feature flags. The sooner and more often you can integrate unfinished code (safely) into master the better.
Imagine writing this entire blog post and being completely wrong about every topic you discuss. This is the most amateur content I've seen make it to the front page, let alone top post.
At Facebook too there was no staging environment. Engineers had their dev VM and then after PR review things just went into prod
That said features and bug fixes were often times gated by feature flags and rolled out slowly to understand the product/perf impact better
This is how we do it at my current team too…for all the same reasons that OP states