Software failures are systematic, not random

Posted May 15, 2006 in misc

Just read an article by Steven R. Rakitin in the latest issue of the IEEE Computer magazine (April 2006). The title is Coping with Defective Software in Medical Devices.

The article has a good overview of how to use fault tree analysis and a tool called “failure modes effects criticality anaysis”. The last one is a table with events like “memory leak” and information about the risk and possible mitigations to reduce the probability of the event.

But what I like most is something taken from the AAMI TIR32 notes that I’ve used as title of this entry. The author draws the conclusion that software risk management should “focus on severity, not probability of occurence”.

In my experience trying to attach a probability to a bug doesn’t work very well: how likely is this crash to happen? is not an easy question to answer meaningfully when you try to do it in real life. You could also try to quantify the risk and talk about revenue lost due to the bug, but again this is based on a difficult-to-guess probability of occurence.

I’ve found it much more useful to use the existence or absence of a workaround as criterium. This is less fine-grained, but more helpful. As an example, a crash that causes data-loss has no clear workaround (beyond regular customer backup of data) and is therefore worse that a non-lossy crash that just causes some user annoyance.