
Ask HN: Testing in production, do you do it, and if so, how? - JeanMertz
At our company, we&#x27;ve neglected our &quot;staging environment&quot; for quite some time, and are feeling the pain. It is hurting our iteration speed, and even though we do rigorous unit and integration testing, undefined (or badly-defined) business logic still ends up causing problems in production. It has resulted in a culture where (client-)developers and our support team create &quot;fake&quot; accounts in our production environment to validate customer complaints, or to test newly built customer flows.<p>It has to be said that our systems are pretty stable, and the amount of bugs we ship is minimal, but the real pain is in the time it takes to validate whatever you are currently working on in an end-to-end environment, outside of your unit-tested world.<p>We are designing a plan to make real-world testing easier for our different teams, and asked them what their biggest pain points were:<p><pre><code>  - Missing real-world data to mimic production-usage
  - Missing (or broken) key systems in staging, causing missing functionality
  - Difficult to start testing from a particular customer state onwards
  - Sometimes you really want to test the actual payment flow of customers, without using the sandbox environment of a PSP
</code></pre>
We&#x27;ve been thinking on possible solutions, but wanted to survey the land first, and see how other companies (of different sizes) tackle these problems.<p>I&#x27;m interested in your thoughts on this topic, how do you handle testing functionality outside of using unit and integration tests? Do you maintain two (or more) environments? Do you allow testing on your production environment, and if so, how do you model such a system to keep garbage data from impacting other systems?
======
cimmanom
I’ve had good results with a script that sanitizes a subset of production data
for use in a staging environment.

------
JeanMertz
We've been discussing potential solutions, and came up with three avenues,
listing their pros and cons. We're leaning towards a "Sandbox accounts in
production" approach at the moment, but recognize that none of the solutions
are free, and whatever we decide on will have a big impact going forward.

Production-like environment

    
    
      + Clear, separate environment
      + No impact on production data / users (events, emails)
      + Not a problem if we mess up data without cleaning up afterwards
      + Runs the same code as on production
      + Requires very little changes to existing code
    
      - non-production code can (and will be!) deployed, making it no longer "production-like"
      - data needs to be kept in sync, to be usable by clients
      - every service needs to run both a production instance, and a production-like instance
      - all external services need to work with this production-like setup (payments, emails, etc...)
      - higher ongoing maintenance overhead
      - requires "manual agreements" on how to manage/manipulate this environment
    

Individual temporary environments

    
    
      + all the pros of the "Production-like environment"
      + even more isolated, higher guarantee of your expected state of the environment
    
      - requires a lot of CI/operational changes
      - requires ongoing maintenance to keep working
      - requires "operational" knowledge to make new changes work with this setup
      - requires more syncing of data
      - higher costs of running
    

Sandbox accounts with special "capability" flags, in production

    
    
      + All test-data is scoped to a (sandbox) user account, single "source of truth" on wether some piece of data is test-data
      + One single environment to maintain, no divergence
      + Data is always the same as production
      + Whatever code runs on production, is what you test
      + Because you want to test your changes, you automatically make sure your new code works with sandbox accounts / capability flags
      + You deploy your pre-production changes behind a capability flag, for testing
      + A preference page allows you to enable/disable certain capabilities to enable real/fake PSP environments, pre-release functionality, etc...
    
      - Much more complex to realise
      - Only works (well) for user-scoped test-data
      - Might result in higher learning curve to ship a feature that works with sandbox+capability flags
      - Production data can be changed accidentally, there's nothing but our own code between test and production data
      - Other systems need to be able to handle (and/or ignore) test data
      - Only suitable to the specific use-case of testing user-flows, not for testing f.e. if a library upgrade broke anything

