Monday 2 April 2007

Handling Failure Gracefully

Prevention is better than cure but face facts - it’s going to happen at some point. The degree of severity of the failure may change as will the causes but you need to be in a position to know about an incident before the important people (your customers) and have a plan to handle it. This might be a software solution or a business process solution but in order to plan effectively you will need to know what can fail in order to plan for it and ideally prevent it in the first place.

Commonly ‘Combustible’ Infrastructure components

We should briefly consider what could go wrong to get an idea of the breadth of planning we are talking about here:

- Web site connectivity
- Server outage
- Hardware failure
- Malicious activity
- Software errors
- Presentation layer
- Dropped images
- Missing links
- Back end
- Bugs

So, these issues you will mostly know about although you may not be aware of malicious activity as it is happening. The ‘best’ intrusion is one done without knowledge of the system owner…When did you last have a penetration test on your system by the way?

Having started the list above, in order to formalise our planning we should explore exhaustively how our website can fail, what those failures mean and how to mitigate the risk through panning and systems design. This is not a new concept as we shall now see.

Failure Mode and Effects Analysis

Wikipedia’s definition of FMEA gives an adequate introduction to the exercise we need to embark on. Our aim is to consider what can go wrong from the major to the trivial, how likely the failure is and how likely the detection. This will require input from all stakeholders in the website from designers to DBAs. The input from these parties should read very much like a test plan:

Given input ‘A’, a system actor should see behaviour ‘B’ but may see behaviours ‘C’,’D’ or ‘E’ where the list of potential ‘other’ outcomes is what we are interested in.

Derivation of the universe of failures from a global test plan is possible during initial stages but users should be aware of being guided to what should happen rather than what might happen. Bear in mind that we are not dealing with multiple subsystem failure modes here – maybe a future article though?

In their best selling book Freakonomics, Steven Levitt and Stephen Dubner discuss human risk perception highlighting situations where perceived hazard may be low but outrage to a failure is high resulting in a negative overreaction. Basically, people react worse to things that have gone wrong that shouldn’t have. Likewise, those things that people expect to go wrong (because they often do) cause an under-reaction because the perceived hazard of a failure is diminished due to desensitization. Conversely, reaction to success does not always result in excessive delight and joy for users. The key is to normalise the actual and perceived level of severity for failures in order to accurately plot the Risk Profile Number. An unlikely event such as hacking, given its difficulty in detection and likely disastrous results, scores highly in our risk profile. Significant events such as the company logo being absent scores fairly low given the ease of detection despite the high level of reaction such a failure would cause. Frequent failures such as images not loading score lower as they are also easily detected and fixed.

Avoiding failures – even graceful ones

We can stop here and build our plans to reduce the severity and frequency of error modes and increase our abilities to detect the failure modes but we are striving for prevention rather than cure – any fire is a bad one:

“A large safety factor does not necessarily translate into a reliable product. Instead, it often leads to an over-designed product with reliability problems."

Failure Analysis Beats Murphey's Law
Mechanical Engineering, September 1993

We will consider four avenues for exploration in searching for our solution:

- Self aware systems reduce effects of failure through increased detection likelihood.
- Redundant systems reduce likelihood of failure through ‘belt and braces’ style precautions.
- Fully tested systems reduce likelihood of failure by catching them before they go live.
- Modularity, high cohesion and loose coupling reduce the number of failure modes by reducing system interdependencies.

1 comment:

Web Marketing Lies said...

For more information about FMEA see http://www.fmeainfocentre.com/