LogoCost of Delay vs Cost of Failure: a balance of forces

This article presents my understanding of how the Cost of Delay model, in combination with Cost of Failure, can be used to choose between and decompose alternatives.

TLDR: A DevOps team can get long-term rewards of minimising Cost of Delay by balancing it against the stabilising force of Cost of Failure.

I went to a presentation on Cost of Delay many months ago; thanks to Ozlem Yuce for presenting it, and to Seb Rose and BCS for organising it.

There is a lack of clarity below over whether we are measuring possible cost or actual benefit, but you’ll just have to live with this until I integrate/understand this further myself! Think of this as a progress report on my understanding. Feedback appreciated, as always.

An aside on numerical interpretations (can be skipped):

I’m currently interested in using Cost of Delay as a qualitative method, partly because I don’t yet fully understand the implications of the quantitative side. However it is mostly because I am wary of numerical interpretations of methods that affect people (where “people" includes both “employees” and “customers”). As soon as a qualitative measure is represented as a number there is a strong tendency to start doing number-like things with it e.g. summing, averaging, comparing. This is often done very naively and can lead to pain for both the producer of the number and ultimately the consumer. It’s not that it is impossible to do correctly, or even necessarily hard, but it requires a bit of nuance to keep the number attached to it’s history e.g. some measure of uncertainty, like error bounds. To some this may come across as elitist or an over complication, but to me it’s about having respect for the numbers and the people they ultimately effect.

Cost of Delay

The basic idea behind Cost of Delay is to consider the loss you make as a result of not making an investment. The accumulated cost over time is some function of time. This is a very bland way of putting it, so I think it deserves a sketch:

The shapes on the right illustrate some of the different ways that the cost function can behave. To pick one example (green rectangle): say you currently have an account on a cloud provider. They have a deal where you can use their services for free for 12 months, but thereafter you have to pay a yearly fee. This highlighted graph shows the cost over the next 12 months (it would look like a step function if extended indefinitely). Another example (blue ellipse), the classic compound interest on debt: you incur some debt, and then the interest on the debt becomes part of the debt itself, cost increases exponentially.

I hope you get the gist, but let’s try an example closer to home. You have a great new idea you are working on, but you know it could take months to make it perfect, and you are not totally sure of it’s eventual worth. You think there may be a simpler idea you can ship pretty-much immediately, but it’s a bit crap in comparison. Let’s call these choices A ("definitely ok but available now”) vs B ("probably great, but not available for a while”):

Maybe I just think more in terms of functions, but to me the diagram on the left is already helpful in forcing us to think about choices. So, what are some of the possible futures, based on this formulation? Note that now we are talking about possible futures, so these costs are turned into predicted benefits. From left to right:

  1. B2 is a tweak on B which has higher value than A (larger gradient), not as much as B, but which is shippable now. We never come back and do B, because B2 was good enough.
  2. We go ahead and ship A, and let it make some money in production, whilst we work on B. B replaces A when it is ready, and it turns out to be great.
  3. This is like 2, except another more important priority comes up and we don’t do B. Regardless, we’ll have made money on A in the mean time.
  4. We decide to immediately do an experiment on B, but on a cut-down version which will only work for two weeks. It turns out it wasn’t so great (perhaps customers don’t like it). We end up going with A after-all.

There are many more choices or combinations than these, but I’ll leave you to come up with those yourself. My point is that thinking this way, even just diagrammatically, allows us to compare these various choices against each other, and provides a means to communicate these choices to others.

Great, is this all we need to consider? No, but before we go further, let’s call out one aspect I am explicitly not emphasising here: waste. One of the things that can hold you back from shipping an A followed by a B (future 2 above) is that option A is work that you plan to discard later. There is no way round this; it’s definitely the plan. However, we are not optimising to minimise total work done. If anything, we are optimising for expected total value accrued over time. It’s not that efficiency is unimportant, but more that it should not be considered first. I could write more on this but not here; perhaps another blog post :-)

Ok, back to Cost of Delay: there is some risk in these choices which we are not considering, which brings us to Cost of Failure.

Cost of Failure

A very simple model of the Cost of Failure consists of:

  • “probability”, or 'time till failure': how long from now until it fails
  • “impact”: when it does fail, how bad is it?
  • “time to recover”: once you’ve identified it has failed, how quickly can it be rectified or limited in cost?

Variation in each of these produces different models of cost, which dictates the accrued cost over time.

Here, cost may be measured in units of ‘downtime’, in actual money lost, or some other measure. These formulations are still qualitative and the ‘cost' in ‘Cost of Delay’ is not exactly the same as the ‘cost’ above.

The relevance of this comes when you see this as the flip-side of a choice that you want to make for Cost of Delay reasons. For example, consider option 4 previously, where we want to do an experiment over two weeks on B to measure value. If the failure curve looks like an escalation curve (red rectangle above, for example a small increase in site latency suddenly turns into an outage) then we won’t take the risk. However, if we can make it auto-recover (green rectangle above, for example an auto-restart of a service to clear memory leaks) then we may be happy with that for the duration of the experiment. The key aspect here is that the same team, or group of decision makers, must be willing and able to take both types of cost into account.

The great thing about DevOps teams, if you’ve set them up properly, is that you have awareness of both sides, and can choose the best compromise at all times; you see the pain and the gain. If you only see one side, then then there is an imbalance.

In summary: A DevOps team can reap the long-term rewards of minimising Cost of Delay by balancing it against the stabilising force of Cost of Failure.