Monday 23 April 2007

Kate is my hero

I am going to run the Flora London Marathon. This is without doubt the biggest feat for any amateur athlete and my friend Kate did it this weekend. The event was widely recognised as one if the hardest of recent years - if not ever - so all the more credit to Kate and anyone else who did it.

From a spectating point of view, the athmosphere is awesome - a great day out. I reccommend that budding supporters take some comfy shoes, buckets of patience and some throat soothing medication... Be prepared for lousy transport, idiot tourist types (if you are used to weekday travel in London, watch out!) and lots of vociferous support for all the worthy athletes - it makes a huge difference. Don't just cheer your friends who take part though - cheer everyone. Random cheering of strangers may seem odd but it's all the more uplifting to receive support from someone who you have never met. Believe me, when you see these folk at the finish and say 'saw you running, well done!' it means a huge amount.

To all those who took part - WELL DONE!!

I seriously hope to be there next year.

Wednesday 11 April 2007

More on avoiding failure

Self-aware systems

Homo-sapiens is a species capable of self awareness through, for example constant health status checking. Indeed, as Google offers more and more information on myriad subjects as to what can go wrong with the homo-sapiens ‘infrastructure’, hypochondriacs are turning into cyberchondriacs. We self monitor and correct our own failure modes. Computer systems are a long way from being able to replicate this particular human trait but a facsimile of the process is possible. The starting point is of course the output from an FMEA exercise.

Leaping ahead slightly, in order to introduce the concept of self awareness we will assume our system is modular in design. We can utilise a centralised ‘health’ module to receive health reports from all other modules. A centralised health module or ‘subsystem’ can tell administrators what is happening system-wide with respect to the various potential failure modes of other sub-systems.

quis custodiet ipsos custodes

“Who watches the watchers?” or more correctly “Who is set to protect those who are themselves protectors?” The health system itself needs to be healthy but we could end up tying ourselves in knots. A system to watch the health system? Maybe not but we can utilise the operating system or third party monitoring tools to ensure the bare minimum of key system components are alive. At some point either a human needs to be notified of a failure mode in order to act on it or a third party monitoring system can restart a process. The process restart obviously has a time and retry limit where breaching either invokes human intervention.

To illustrate the monitoring point we shall assume each module exists on a separate system node. This is more for clarity as, although still possible, efficiency and speed may not be optimal.



Notice that the lines of communication are bidirectional and contain both status of the subsystem health and configuration information from the health system to the sub-systems(s). Administrators can alter system environmental factors in real time and know that all subsystems will receive this information the next time they check in with their health status.

So, consider the UML sequence diagram below that describes how a system will start and discover it’s environment and becomes aware of all nodes, how over time health reports are sent, the search process is restarted and new configurations are disseminated and used:



Health reports can take a number of forms and can vary in sophistication. At a very basic level the health report is in the form of a simple ‘ping’. Just getting in touch with the health system is enough to signify goodness is afoot. The server can log the sender, date and time of the ping to track who is late and who is okay.

The ping payload can contain further data from the sender, flagging failure modes such as full or near full disk space, heavy resource utilisation or stopped processes. These messages are flagged to administrators via dashboards for their attention or dealt with by the health system immediately.

The opposite end of the spectrum is the full and complete health report at each ping. Including full diagnostics and logs ensures administrators are aware of system behaviour but such a payload can carry a heavy price. Flooding systems with large amounts of data can impact system performance and exacerbate issues. Common sense dictates that one should only turn on this kind of diagnostic via the health system when it’s absolutely necessary.

Redundant Systems

From Wikipedia: Redundancy in engineering is the duplication of critical components of a system with the intention of increasing reliability of the system. This is exactly what we need to achieve but quickly, easily and cost effectively. Ideally we can combine increased redundancy with extra power through horizontal scaling - that’s adding more machines rather than adding more power to machines and designing redundancy into our systems with good modular system design.

Horizontal Scaling

Expanding on the idea of nodes being able to discover their environment and talk to centralised resources allows us to harness the power of this idea for enhanced system scalability – ideally we would want to wheel in a new server (or set of servers), turn on and let the new node(s) find out who they are and what they should do from a brief conversation with a centralised configuration server. As a useful aside, building new test environments becomes a total no-brainer…

As far as business owners are concerned, horizontal scalability can answer a number of thorny issues around the ease of performance tuning, handling seasonal load and the associated costs.

As far as systems administrators and software architects are concerned there are further gnarly issues presented that need to be considered as scaling occurs. With the increase in nodes there is an increase in inter-node communication. There is a finite amount of bandwidth and processing power available to each node and a threshold will be crossed at some point.....hopefully a long way away!

As far as systems designers are concerned, systems need to be happy running on cloned nodes. A system that requires some processes on some nodes and others elsewhere is going to be a major headache to scale and support.

Modularity

Good systems can be built such that the main components can be separated across cloned hardware nodes. System components or modules can be easily plugged together without causing interdependencies. This quality is known as high cohesion with low coupling. Cohesive systems can exist when decoupled and indeed report when decoupling happens. Modular systems are easily replicable (scaling horizontally) and exhibit the self awareness that we require. Supporting a buffet type architecture where system owners can build their system behaviour by coupling the required components that are available and appropriate.

Tested Systems

It seems obvious to suggest that a properly tested system is more resilient but is actually often overlooked for a number of reasons. Of course, knowing what to test is as much of a challenge as an FMEA exercise but FMEA and test generation are complimentary. By forcing stakeholders to consider what can go wrong and ensuring measures are taken to prevent failure modes ensures a comprehensive failure matrix is built before-hand.

Considering when testing should take place is a key question. Avoiding the ‘boring’ task for as long as possible is more than just delaying donkey work. Testing before just before go-live is asking for a bad day at the office. Later testing increases the overall effort required to achieve a reliable system significantly as well as increasing the levels of risk associated with a new roll out.
Dr. Khaled El Emam estimates costs can be as high as 80% of total development costs for reworking software.

Testing is an integral part of building a system. Test Driven Development requires tests are written before the software. Indeed software is written to pass the test. This way each unit of software can be demonstrated to meet the requirements and reduce the risk and severity of failure modes before integration.

Integration is also an on going exercise. We advocate the employment of Continuous Integration as described by Martin Fowler in his wiki cum blog in order to prevent ‘blind spots’ in the system.

Considering FMEA and test creation, ALAC or Act Like A Customer is a suitable methodology for finding a high risk subset of failure modes. Customers generally don’t find all potential failure modes but will find the most common. A useful set of tools can be employed from SiteConfidence to analyse your site from a customer’s perspective. These tools can indicate quality issues as part of testing exercises as well as on an ongoing quality measure.

Ongoing testing and regression testing ensure failure modes do not creep in unexpectedly and unseen over time as software is refactored and systems grow. Reuse of test scripts and procedural releases ensure quality is maintained but of course the right tools for the job help in a big way. Continuous Integration is made possible through Cruise Control and Subversion but without discipline and rigorous procedures quality cannot be ensured, risk will increase as will failure mode likelihood and severity.

Conclusion

Our first goal in handling ‘fires’ is to understand what can go wrong; put as simply as possible, if it can go wrong – when it does – it is important. Use Failure Mode Effect Analysis to understand what can go wrong and to generate comprehensive and accurate test cases. Use these test cases to test your system continuously through development and on an ongoing basis. Never use users to test your systems unless you are conducting and A/B split or multivariate test programme.

Don’t ignore the risk of fire, no matter how small. It will matter to customers. For example Akamai Technologies and Jupiter Research have determined through a survey of 1058 experienced online shoppers, 75% who experienced a site freezing or crashing, that is too slow to render, or that involves a convoluted checkout process would no longer buy from that site.

What we can do with our knowledge of the failure modes is to reduce severity and mitigate risk with a view to reducing cost of ownership and enhancing the level of customer experience.

With this level of risk in mind we can build systems that degrade gracefully and handle failure modes before users detect them; informing systems administrators that remedial action has taken place in order to maintain a good level of service.

Again, Jupiter Research and Akamai state: ‘The consequences for an online retailer whose site underperforms include diminished goodwill, negative brand perception, and, most important, significant loss in overall sales.’

Monday 9 April 2007

Hold the agile essays for a moment

Today has rescued my bank holiday (public holiday in the UK - 4 whole days!). After a galactic sized moron drove Mike Whisky into a BIG SHINY METAL BUILDING causing damage that prevented my planned two days of flying joy I thought I was ready to jack in the flying thing and become a spotter....heaven forbid!

But no - I have pounded around my new fast 5k course and blown my personal 5k target asunder. I thought at the start of the fourth kilometer I was on for a sub 26min but sub 25 - that's a major target for me. Chuffed? I nearly wet myself...but that's an aspect of exertion that athletes learn to deal with - innit?

For a change, and to prove the times are real, I have decided to share the data from my Garmin 305 GPS via the links in the pb section (that's personal best - not lead...) on the right. Motion based is one nifty site. Always act on data - never gut feel.

Going to focus on 10k now...and back to the technical/business articles for a while (sorry Kate!).

Monday 2 April 2007

Handling Failure Gracefully

Prevention is better than cure but face facts - it’s going to happen at some point. The degree of severity of the failure may change as will the causes but you need to be in a position to know about an incident before the important people (your customers) and have a plan to handle it. This might be a software solution or a business process solution but in order to plan effectively you will need to know what can fail in order to plan for it and ideally prevent it in the first place.

Commonly ‘Combustible’ Infrastructure components

We should briefly consider what could go wrong to get an idea of the breadth of planning we are talking about here:

- Web site connectivity
- Server outage
- Hardware failure
- Malicious activity
- Software errors
- Presentation layer
- Dropped images
- Missing links
- Back end
- Bugs

So, these issues you will mostly know about although you may not be aware of malicious activity as it is happening. The ‘best’ intrusion is one done without knowledge of the system owner…When did you last have a penetration test on your system by the way?

Having started the list above, in order to formalise our planning we should explore exhaustively how our website can fail, what those failures mean and how to mitigate the risk through panning and systems design. This is not a new concept as we shall now see.

Failure Mode and Effects Analysis

Wikipedia’s definition of FMEA gives an adequate introduction to the exercise we need to embark on. Our aim is to consider what can go wrong from the major to the trivial, how likely the failure is and how likely the detection. This will require input from all stakeholders in the website from designers to DBAs. The input from these parties should read very much like a test plan:

Given input ‘A’, a system actor should see behaviour ‘B’ but may see behaviours ‘C’,’D’ or ‘E’ where the list of potential ‘other’ outcomes is what we are interested in.

Derivation of the universe of failures from a global test plan is possible during initial stages but users should be aware of being guided to what should happen rather than what might happen. Bear in mind that we are not dealing with multiple subsystem failure modes here – maybe a future article though?

In their best selling book Freakonomics, Steven Levitt and Stephen Dubner discuss human risk perception highlighting situations where perceived hazard may be low but outrage to a failure is high resulting in a negative overreaction. Basically, people react worse to things that have gone wrong that shouldn’t have. Likewise, those things that people expect to go wrong (because they often do) cause an under-reaction because the perceived hazard of a failure is diminished due to desensitization. Conversely, reaction to success does not always result in excessive delight and joy for users. The key is to normalise the actual and perceived level of severity for failures in order to accurately plot the Risk Profile Number. An unlikely event such as hacking, given its difficulty in detection and likely disastrous results, scores highly in our risk profile. Significant events such as the company logo being absent scores fairly low given the ease of detection despite the high level of reaction such a failure would cause. Frequent failures such as images not loading score lower as they are also easily detected and fixed.

Avoiding failures – even graceful ones

We can stop here and build our plans to reduce the severity and frequency of error modes and increase our abilities to detect the failure modes but we are striving for prevention rather than cure – any fire is a bad one:

“A large safety factor does not necessarily translate into a reliable product. Instead, it often leads to an over-designed product with reliability problems."

Failure Analysis Beats Murphey's Law
Mechanical Engineering, September 1993

We will consider four avenues for exploration in searching for our solution:

- Self aware systems reduce effects of failure through increased detection likelihood.
- Redundant systems reduce likelihood of failure through ‘belt and braces’ style precautions.
- Fully tested systems reduce likelihood of failure by catching them before they go live.
- Modularity, high cohesion and loose coupling reduce the number of failure modes by reducing system interdependencies.