Web Performance Days, New York: "Load Testing" Open Spaces Session

One of the best hours I spent in New York was in an open spaces section of web performance days, which followed on after Velocity NY. It was on "Load Testing" and was led by Alexander Podelko and had a few other contributors, including myself.

I'm writing this partly for my own benefit because it was one of the few sessions where I didn't take any notes straight after, and I want to brain dump whilst it's still relatively fresh. Below, what I remember us saying is in normal text whilst my follow-on thoughts and commentary are in [italics].

Load Testing can be broken down into three kinds: Smoke Testing, Regression Testing and Scalability Testing [aka: 'Did I just break the build?', 'Will this affect production?', and 'Oh frack, we're on Reddit, what now?']

[We didn't talk much about Smoke Testing, but ...] Smoke Testing is all about doing enough to sanity check that you haven't done something incredibly stupid. [In the context of Load Testing this means hitting the system under test with enough load and with enough variety of requests to trigger obviously bad behaviour. For example, for a search service this could be submitting a mix of query difficulties over a short period of time, and asserting that the response time is always below some threshold. The most simplistic way to derive a synthetic set of tests is to run your integration tests repeatedly.]

How you go about Regression Testing depends a lot on how you do releases. If you have big-bang releases where the time between them is large, e.g. weeks, then Regression Testing is more about modelling and/or testing your system under expected inputs, and seeing how it responds.

If you have releases that take hours or days, then you have the luxury of up-to-date samples of system behaviour, but very little time to understand the model that underlies it all. [It wasn't stated explicitly, but I think there was also general skepticism that straight-forward models even exist]. Given this context it's possible to think of your systems performance as one of the many measures in an A/B test, and to skip Regression Testing. As long as you have a way to quickly roll back your changes and have bounded your possible losses as part of normal A/B test procedure then this can be a way to test in production safely. If full A/B test rigour is not required, then an even simpler version is to have a canary (aka staging) box in production, which is examined for performance degradations over time.

Another technique blurs the Smoke/Regression test distinction: replaying a subset of your real traffic against a pre-production system and checking for performance regressions against what production did with the same input. [This can be automated to an extent by comparing summary statistics e.g. differences in standard deviations.] If your system under test can't be fully isolated from outside influence, which is typical, then you'll have to allow for some skew between production and test. [For example, if you're storing information about a customer then there will typically be more of it later on so performance of anything based on it will be different. ]

[If you want to minimise the effect of the skew, then you could replay the traffic against two non-prod systems, one running production code and the other running the new code. However, this has a lot of setup cost and isn't too dissimilar from an A/B test, or canary in production, so why not just do that?] Additionally, each call replayed may have a real monetary cost, over and above the hardware required, which could become significant if calls are replayed in bulk.

[Replay testing may be justified if what you're testing isn't ready for production for non-performance reasons. For example, if you're re-implementing an API using a different back-end, but it isn't functionally complete yet. In these cases, it's desirable to treat the performance of the existing system as a feature that has to be maintained in parallel with any features added.]

[We didn't talk much about Scalability Testing, which is what people normally associate with Load Testing, unless I missed something?]

[Here endeth the braindump. If you were at the meet-up and any of the above is incomplete, misleading, or just plain wrong, then please let me know and I'll update it.]