I agree that the popular load testing tools leave plenty to be desired, but have you given k6[1] a try? (Full disclosure: I'm one of the maintainers.)
Tests are written in JavaScript and there's support for HTTP, WebSockets and (unary) gRPC. You can easily script a combination of these protocols to mimic real world traffic.
Furthermore you can record a user flow with a browser extension[2] and convert the generated HAR file to a k6 script[3], which would give you an even closer real world scenario. The conversion is not perfect and depending on the service you might need to manually modify the script, but it would get you 90% of the way there.
You might get a less biased response from someone else, and I don't have much experience with JMeter, but:
- k6 prioritizes modern developer experience. Tests are written in JavaScript, run from the CLI and can be committed to the same repository along with unit and integration tests. As such it blurs the line between responsibilities of a traditional QA team and pushes for load/performance/stress testing to be done by application developers themselves. This enables developers to easily integrate k6 in their existing CI pipelines and run load tests early on in the development cycle rather than as an afterthought.
- k6 is written in Go, so it's much easier to deploy and use as a static binary, which also brings considerable performance benefits over JMeter. You can see a detailed breakdown and comparison in this article[1], written by another commenter here and one of the original k6 authors: rlonn. :)
OTOH JMeter is a much more mature tool and has many more integrations and protocol support than k6. We recently launched an extension system for k6[2] that allows developers to add support for other protocols or features via native Go libraries, but the ecosystem is still in its infancy. So if you don't mind the JMeter UX, its performance or need some of its features, by all means stick with JMeter. But I would encourage you to give k6 a try for yourself, as it was written precisely out of the same frustrations mentioned in the article.
it's easier to kick off in a jenkins pipeline since it's standalone but is otherwise less powerful, less general purpose, and the architecture to support the commercial/cloud offering causes some weird trade-offs when you are crafting test cases.
As a url-blaster it's working fine for me, though.
IMHO how you load test is almost as important as where you load test and understanding why. There are three separate reasons to do load testing:
1) Performance testing. Confirm the system does not degrade under a specified load and find out what performance can be expected under these circumstances. This is basically ensuring your system can handle X amount of traffic without issues and knowing your baseline performance. You should get the same kind of response times as you are getting from live server telemetry.
2) Stress testing. finding out what happens when the system is stressed beyond its specs. How does it degrade & where.
3) Reliability testing. Find out how your system breaks and when. The goal here is to try to break the system and test things like failover and making sure you don't lose or corrupt data. Better to die gracefully then abrubtly.
If that's new to you, you've been probably doing it wrong because those 3 require very different approaches.
I tend to avoid having to do load testing as it sucks up time without telling me much if interest. I instead opt for having decent telemetry on the live system. It will tell me how it performs and where the bottlenecks are. I can set alerts and take action when things degrade (e.g. because of a bad change). Besides there is no substitute for having real users doing real things with real data. And in any case, having telemetry is crucial to do any meaningful stress or reliability testing. Otherwise you just know it degrades without understanding why.
There are still valid reasons to do separate load testing of course but I seem to get away with mostly not doing this. When I do, vegeta is what I reach for first. I find most of the need for load testing is requirements related to SLAs or otherwise nervous product owners that need to be reassured. They tend to be more interested in performance tests (i.e. the happy case) whereas technically stress testing is where you learn stuff about your system.
Thanks for breaking out the 3 sub-types. I would add if someone is not careful they may not be testing what they think they are testing. For example, when testing response latency, beware of caching. If you hit the same data a million times, your average can look really good, but that's not what your users will experience.
At Mattermost we went for the do-it-yourself option and wrote a custom tool for the job [1]. After a lot of research on all the existing open-source frameworks we couldn't really find anything that would fit our use-case. We are quite happy with the result although, as the OP mentioned, there's a significant maintenance cost attached. As new features gets implemented and more API calls added you need to go back and make sure your user behaviour defining logic stays in sync with the real world.
If I were to do it all over again, I'd probably give k6 [2] a chance but I am still convinced a tailored solution was the best choice.
Last time I had to do load testing in a professional context, we required 10+ million long-lived TCP and websocket connections transacting multiple times a second. There weren't any off-the-shelf solutions at that time that could come within an order of magnitude of that at a reasonable cost - the most viable solution we tested required a thousand EC2 instances to sustain that traffic.
Ended up hand-coding a load test framework in a few hundred lines of highly async epoll-based C. Scenarios controlled by a simple configuration syntax in a text file. Trivially generated enough load to break the load balancer in front, our service, and all the services and databases behind it - all this running on just a handful of instances.
Pre-existing tools are great, but often don't solve the problem to hand. If load test frameworks aren't doing it for you, don't be afraid to write your own.
I've used Jmeter quite a bit over the last year, and what seems to be the largest issue is that it doesn't "break" apps because it works at the protocol level and isn't a full "browser". As you increase load and response times start to increase, since it runs sequentially through the test plan the time between the requests also increases. But, when actual humans with browsers use apps there are loads of AJAX requests being fired which don't necessarily go in order or wait for others to complete first.
What I found is to really bring things to the breaking point you actually need two tests at the same time, one to generate static load, and another to spank it. Of course, most people don't need to do this - we had to bring our app to its knees to find a particularly elusive (possibly concurrency related) bug that only appeared with high load & many active sessions.
At the end of the day, I think load testing will only get you to ~80% accuracy. I completely agree with the author about the conclusion.
That is – gradually increase load on your service with concurrent requests until it is saturated.
Measure how the latency and errors progress with more concurrent requests and understand at what point and how your service starts to breakdown under heavy load.
Based on this you may have to do many things –
1. Can you optimize your service or your service's downstream dependencies or the application calling your service.
2. Can you build in graceful degradation into your service – functionality reduction to get more useful throughput out of your service – with same resources.
3. Build circuit breakers and throttlers before your downstream dependencies so that you don't overload them and cause them to fail or you don't fail totally when they do indeed fail.
4. If you do get overloaded for some reason (say your server pool suddenly became half its size), are you able to recover quickly.
5. Do you have monitoring and alerts for these load scenarios?
If you do this for each of the services in the service call graph, then your distributed service application would be robust.
Load Testing has been both a blessing and bane in my career.
Most managers only want to see how traffic will perform in what-if scenarios (can we handle Black Friday, what will happen if our traffic goes up 10x during a special event, etc). For these, JMeter and whatever service you're using for analytics/metrics work perfectly fine (New Relic, is the current favorite). Then we can compare average latency, Error Rates, etc. with what we get during JMeter.
The issue is that most managers or other PHB want more than just a one off for load testing. And this article greatly covers that.
A lot of time people want a Swiss Army Knife of tools for "automation", where load testing falls under that category for them.
This magical tool should be able to:
* Test your API calls from a functional, integration, and a performance (not load testing; just making sure we're under specific latency)
* Test your Web-sites from the same testing perspectives, as well as being modular (ie: Selenium's POM)
* Integrate with CI so specific tests, test suites/flows can be tested for every new build, and we can run specific workflows by clicking a button
* Be used during load testing so we can measure latency and run tests while recreating customer experience from the Web side of things
For each of these steps, there is at least one good tool or library to handle things. However, I've found it difficult to find something that handles all of these well and where (most importantly) the other devs are willing to pick it up.
Selenium/Web Driver handles the UI tests. Postman and a handful of others can perform the API tests. JMeter is the de facto for load testing at every place I've been to. And you can write any of these into your CI.
Is there something that can do all of this well, and has a low enough overhead where devs on your team are willing enough to actually use it as more than just a toy or experiment?
My startup, BrowserUp has something that does this. We're looking for early-beta users and feedback. I'd love to chat when you have a chance. On gmail, I'm ebeland.
In an era before cloud, one of the major project risks was under-provisioning hardware. Load testing made sense then. It makes less sense now.
I've given up on load testing all together. Most applications or services quickly grow to complex to maintain in a cost-effective way with the pace of change demanded of them.
Instead every team I talk to about load testing or performance I shift it to observability. If the team can't understand current performance and load in production, any sort of load testing results in another environment will be poorly understood and hold little value.
This approach positions the team much better to react to regressions in production vs holding up work trying to create or pass a load test.
The exception I make for this is load testing for validating technology choices as an effort in risk mitigation that the technology can't perform. (i.e. Will this query work moving from SQL to ElasticSearch? What happens if I have 100x amount of data in that table?) Targetted specific scenarios, to confirm behavior of things too expensive to do in production.
I'm sure there are a few performance critical apps that need these tools, but the vast majority of software doesn't. Don't burn 100s of hours like I did to validate performance before release. Start with gaining a deep understanding of your production behavior and push for production experimentation. It is significantly less time-consuming and pays way more dividends.
Prioritising observability and load stress mitigation mechanisms/playbooks is the right thing to do for modern cloud native application service clusters.
In such an environment, you could fairly easily make it possible to subject your prod deployment to synthetically generated end-to-end scenario traffic in a safe manner (through request tainting) so that you can create controlled stress situations and see how your application behaves at breaking conditions.
This is important for understanding if the observability and mitigation mechanisms/playbooks are in good shape or not.
It is especially important to be able to do this periodically because the functionality is continuously evolving. Cloud can lull you into a false sense of safety – not everything in cloud scales or fails over the way you expect them to at all operating points.
So it is important to continuously do controlled assessments at loads 2x higher than your current peak 6 months before that becomes your reality.
Of course there is always live and learn by fire approach where you wait until the traffic growth (from that frenzy flash sale your business team sprung on you) topples your stack over.
You can create user scenarios in JMeter. It is manual work to setup but it works fine. You can add think times, randomization, scaling of number of users, use scripts to generate input data, etc. It's not a simple tool, but it's not a simple problem to solve either.
I never needed WebSocket testing in JMeter but my quick Google shows that it exists.
Can confirm, I've made some pretty extensive scenarios with JMeter and while not exactly fun to do it works great. We integrated "normal" requests with WebSockets in the scenarios.
>Great tool that, when implemented as part of a larger tool stack, makes performance testing very doable.
In the past, I've leveraged JMeter along with other tools at the same time. i.e. Using JMeter to create a baseline level of requests and then using something else to fill in any gaps to make the load feel a little more "real" or to test specific scenarios that were difficult to accomplish w/JMeter.
I've tested a few of these, even setup multiple VPSs to do load testing at distributed scale, but it's expensive and hard to orchestrate and tests take a lot of time to setup. Plus, we don't test often enough for it to make sense...
Now we just use cloud platforms for load testing, there's quite a few and the one we use the most is even free [0]
Good article. It brings up the biggest challenges with load testing. I think the "Maybe don't load test everything?" idea is the solution.
We have to realise that there is no silver bullet. You can spend 101% of your hours on load testing only, and still not achieve a realistic load test scenario. The trick is to determine how much effort to spend creating (and maintaining) your tests, and/or what parts of your app to focus on. Limit the scope. The 80/20 rule applies here, like in many other situations; you get 80% of the positive results (i.e. insights into the performance of your app) by spending 20% of the effort. Simple, unrealistic load tests are a million times better than no load tests at all.
I would like to see load testing done in some kind of 100% deterministic environment. Make a system emulator where everything is fully deterministic, including perfectly timed IO and network characteristics.
Then your load test won't have any random noise from things like an SSD with inconsistent performance, or a CPU that throttles down depending on room temperature.
A zero-noise load test can be much shorter to get good results, and means you won't be investigating spurious performance regressions. You can compare traces between tested software versions down to clock-cycle accuracy to really locate issues, rather than just guestimating.
What is the state of the art in deterministic computation environments?
Naively: I would start with the basics like dedicated heavy duty hardware with all the ECC and radiation hardening available. Then I would make sure to compute literally everything multiple times and check that the results match. At this point, the execution characteristics of the computation are dramatically altered from the real world, the test results are unlikely to be useful.
You suggest tactics that stop short of redundant computation. Would that really work? Like, how do you gain perfect confidence in your system emulator? I don’t see how you would fully account for bitflips...
love the gatling focus on CI/CD (though would imagine it's not that simple) -- repeatability and automatic regression testing are as important as testing new architecture
there's value in testing with a live network because your prod system has a live network, but a network conditioner can simulate some crappiness, and it forces you to understand your system; when you can repeat a failure with a conditioner, it's much more likely you've understood the failure
re noise, 100%. Many LTs have unclear acceptance criteria that generally record a 'fail', can only be interpreted by the original author, and may not always predict prod
I'm actually curious about why it makes sense to load test an entire web UI (which is what this post seems to be suggesting).
In my experience, it was enough to load test individual back end APIs to figure out load profile for each API and use that to scale your h/w and serving HTML is rarely a bottleneck. This is complicated if you have a big monolith with a diverse traffic pattern - so knowing your auth API can support up to 1k QPS on your existing h/w is not super helpful for example. There are ways to work around this, but kind of complicated to enumerate all the scenarios in a comment.
This still needs some sort of state building, but this should be a lot easier than setting up state using a UI / clicking buttons.
It really depends on what you're trying to test. You can hit the backend to prove stability of those elements, but you can't hit the backend to show what the user will experience while using your website.
Does it change meaningfully under load? As long as the API p50 / p95 latencies are what you expect, the HTML generation / rendering is rarely a problem.
That's a bit handwavey. Experience plays a part here and I've at least never seen a scaling issue where the root cause is not some form of API latency.
Most of the raw HTML generation, rendering is done by frameworks and browsers respectively and they're pretty standard / efficient pieces of software.
> I've at least never seen a scaling issue where the root cause is not some form of API latency.
I guess we all come at this with our own assumptions. When I think about testing, I'm also thinking about the base experience, not just adding resources and adding load.
At my workplace we're using a modified version of ShadowReader - https://github.com/edmunds/shadowreader; note it is AWS-specific. It works by re-running the load balancer logs, either in near-real time or a replay window. We use it test changes to our Solr search engine. It complements our Locust testing which is more about red-lining our servers and isn't as user-specific.
“You have a few options for getting a great picture of how your system performs:
• Good old analysis.”
I’ve been trying to do that (because it seemed the only realistic method for figuring out how to cost optimize a cloud setup (should we have more smaller API servers or a few big ones? One or two databases? (When) should we scale down instances in less busy times? ...), but didn’t get far.
Being mathematically schooled, I went for queueing theory. That isn’t simple, but not the main stumbling block. The complexity of cloud infrastructure is. Figuring out how many simultaneous requests AWS load balancers/Kubernetes can handle, what delays they introduce, how they route calls (round-robin/random/...) how large their queues are, how Kubernetes scales down/up instances is quite a challenge.
So, has anybody collected ballpark numbers for these kinds of things, or hints on what can be ignored in the setup? How do people handle cost optimization of cloud instances?
Or do people just bring up variations on the infrastructure, benchmark them, and, when getting bored/running out of time/having found something that isn’t too expensive, stick to the best configuration found?
I don't think load testing is hard, it just requires time to do properly. Maybe some people think they can get a tool, record a script and hit play. In some cases that might be enough. Once you are concerned about performance testing, you have to do work to get it right.
I usually have a test from the 'front' of the app and separate tests developed for each component (services, dbs etc). I do load and stress testing. Stress testing is very valuable as a preventative measure. Depending on what is available I use different hardware monitoring tools to gather data. Sometimes its OS specific or a vendor tool like PRTG. On the software side I run profilers for the language the app is using.
I run focused tests - what the 'expected' use will be and general tests that hammer as much as possible.
I have found and addressed problems at all levels. Investigation is the hard part. I find it fun and you learn a lot about infrastructure configuration. Most hardware problems that show up are can usually be remedied by throwing more resources at it. Often times the cost is lower than diving into full investigation / software changes. Server configuration is another inexpensive option. It requires more investigation, if the bottleneck is the number of connections you have to determine if its: one of the many webserver settings or one of the many lower level services. I once discovered a cap on TCP connections on a low level OS service was limiting performance. After those you hit software. Modern profilers will catch most issues. I find with code network latency is the biggest offender. I had an app that took 1.5 seconds round trip. The client wanted it 500ms or less. The fix was simply being smarter with database requests and having a local cache of some data. In the end a cold request took 300ms with subsequent requests being approx 70 - 100ms.
I could go on but performance testing is a discipline like any other. It's not hard if you put in the time.
No, when I read the article 'hard' seemed to equate to 'took longer than expected'. It takes time, but I think the tools that are out there are easy enough to use.
If you wrote the software you already have a good idea on how it should be tested so it's just learning how to use the tool.
The fun part is the part I said was hard, which is investigating / interpreting the results. One other thing, which you don't have to do, were setting up and coordinating multiple servers to launch the tests and automating the data processing and report generation.
I didn't say that its easy because its fun. I said load testing isn't hard - it just a lot of work. Maybe being having to do a lot of work is hard.
This is one of those areas where I will almost always opt for DIY.
And in regards to simulating users, that's not always necessary if you know what type of load users generate you can simply simulate the load that users would generate.
A significant problem with traffic replay in a few cases I've considered in the past is HTTP POSTs and usage of headers. For most, logging that extra info is not realistic and rules out traffic replay as a possibility for simulating realistic load.
Quite often, everyone one the team will already know which endpoint(s) are slow or unscalable. In that case, it is possible to set up a much less realistic load test in order to generate some flame charts of the problematic area. It can be more pragmatic than having to maintain what is essentially a new integration test suite as the author mentions.
Exactly what I just wrote in another comment :) Start with simple, "unrealistic" load tests of suspected hot paths and you're likely to get very good ROI on the hours you spend.
the adage from service level perf testing (you don't know what's really slow/brittle until you measure...) holds equally well at the systems level. Yes, you can probably identify decent areas to improve by guessing, but you'll never find your blind spots...
I've found this to be a significant problem in testing a side project I've been working on for making music together over the Internet (https://echo.jefftk.com). I can load test the backend by simulating requests reasonably easily, but the hard part is all the cases where my simplistic testing is unrealistic.
Repeating what a few others here have said, but k6 has been a really easy tool to use, with great open source tools, and a managed service with pretty competitive pricing. Creating "real world traffic" is a little more tricky, but you can easily spin up a load test for just about anything in 15min.
I remember years and years ago I had a specific application to test, and wrote a small test harness for it.
...and as the application got better I ran out of tcp port numbers. (because instead of requests from multiple ip addresses, it got requests from the test harness machine)
I work in the corporate performance test / performance tuning space in a brownfield environment with a lot of co-dependent systems supported by a lot of different teams. My background prior to this was a programmer / DBA.
It's all hard - but it's not any harder than building and testing end to end solution working in the equivalent environment - there's a lot of similarities, benefits and also disadvantages. Overall performance testing provides
"Performance Testing" often covers quite a few different areas:
- User Experience (more difficult with SPA and Mobile applications)
- Server Response
- Capacity Planning / Utilization
- Performance Tuning
- Enterprise / Systems Architecture
- Production Support for Performance Problems
- Monitoring (both for testing purposes and planning monitoring for production)
Like programming, most the time actual coding / scripting isn't the challenge. Some of the bigger challenges I deal with are:
- Test Data Setup, Availability, Size, Distribution, Consistency, Security, Backup/Restore over multiple systems
- Understanding / Confirming Scenarios
- Complex Application Authentication (more secure authentication methods can be harder to test or work-around)
- Data integrity / security (difficult to use SaaS solutions for regulatory reasons)
- Environments, Sizes, Accuracy (Prod Like), Support (Application and Infrastructure), Monitoring, Limitations, Architecture
- Load Planning/Estimation/Reporting (both in Performance and Production)
- Deployment Planning / Management
- Investigating Performance Problems
- Working with lots of different development teams.
A lot of the recent trends in DevOps / Agile / IaaS / PaaS are helping improve the situation with significant improvements to deployments and scaling.
Good commercial tools and OSS tools are available. The move to web / HTTP based systems has actually made testing much simpler because there's really only one protocol to target these days (HTTP) - older applications platforms using non-HTTP based protocols often make things more difficult. There are still some smaller niche areas for industries such as call centres where there are physical requirements (such as generating call traffic), and Desktop Virtulization / Remote Access (due to COVID) that don't fit the more common testing patterns. Mobile App performance testing (which is closer to User Experience testing) is also relatively immature compared to SPA and HTTP testing. There are options out there though.
I also view there being (at least) two different traditional paths into performance testing - Functional Testing (Non Technical) and Developers/Tech Admin (Technical) who bring different strengths and weaknesses to large performance testing efforts.
I'm a fairly big fan of trying to push most performance testing down into individual application teams as they know the technical details of the applications and often have better technical skills than a lot of performance testers. However, you still need cross-system / whole environment perf testing where you have complex interconnected systems and dependencies.
It's difficult trying to balance everything between purely scientific / methodological and pragmatic (just prove it's capable of meeting the volumes) approaches.
Tests are written in JavaScript and there's support for HTTP, WebSockets and (unary) gRPC. You can easily script a combination of these protocols to mimic real world traffic.
Furthermore you can record a user flow with a browser extension[2] and convert the generated HAR file to a k6 script[3], which would give you an even closer real world scenario. The conversion is not perfect and depending on the service you might need to manually modify the script, but it would get you 90% of the way there.
[1]: https://github.com/loadimpact/k6
[2]: https://k6.io/docs/test-authoring/recording-a-session/browse...
[3]: https://github.com/loadimpact/har-to-k6