Wednesday 11 April 2007

More on avoiding failure

Self-aware systems

Homo-sapiens is a species capable of self awareness through, for example constant health status checking. Indeed, as Google offers more and more information on myriad subjects as to what can go wrong with the homo-sapiens ‘infrastructure’, hypochondriacs are turning into cyberchondriacs. We self monitor and correct our own failure modes. Computer systems are a long way from being able to replicate this particular human trait but a facsimile of the process is possible. The starting point is of course the output from an FMEA exercise.

Leaping ahead slightly, in order to introduce the concept of self awareness we will assume our system is modular in design. We can utilise a centralised ‘health’ module to receive health reports from all other modules. A centralised health module or ‘subsystem’ can tell administrators what is happening system-wide with respect to the various potential failure modes of other sub-systems.

quis custodiet ipsos custodes

“Who watches the watchers?” or more correctly “Who is set to protect those who are themselves protectors?” The health system itself needs to be healthy but we could end up tying ourselves in knots. A system to watch the health system? Maybe not but we can utilise the operating system or third party monitoring tools to ensure the bare minimum of key system components are alive. At some point either a human needs to be notified of a failure mode in order to act on it or a third party monitoring system can restart a process. The process restart obviously has a time and retry limit where breaching either invokes human intervention.

To illustrate the monitoring point we shall assume each module exists on a separate system node. This is more for clarity as, although still possible, efficiency and speed may not be optimal.



Notice that the lines of communication are bidirectional and contain both status of the subsystem health and configuration information from the health system to the sub-systems(s). Administrators can alter system environmental factors in real time and know that all subsystems will receive this information the next time they check in with their health status.

So, consider the UML sequence diagram below that describes how a system will start and discover it’s environment and becomes aware of all nodes, how over time health reports are sent, the search process is restarted and new configurations are disseminated and used:



Health reports can take a number of forms and can vary in sophistication. At a very basic level the health report is in the form of a simple ‘ping’. Just getting in touch with the health system is enough to signify goodness is afoot. The server can log the sender, date and time of the ping to track who is late and who is okay.

The ping payload can contain further data from the sender, flagging failure modes such as full or near full disk space, heavy resource utilisation or stopped processes. These messages are flagged to administrators via dashboards for their attention or dealt with by the health system immediately.

The opposite end of the spectrum is the full and complete health report at each ping. Including full diagnostics and logs ensures administrators are aware of system behaviour but such a payload can carry a heavy price. Flooding systems with large amounts of data can impact system performance and exacerbate issues. Common sense dictates that one should only turn on this kind of diagnostic via the health system when it’s absolutely necessary.

Redundant Systems

From Wikipedia: Redundancy in engineering is the duplication of critical components of a system with the intention of increasing reliability of the system. This is exactly what we need to achieve but quickly, easily and cost effectively. Ideally we can combine increased redundancy with extra power through horizontal scaling - that’s adding more machines rather than adding more power to machines and designing redundancy into our systems with good modular system design.

Horizontal Scaling

Expanding on the idea of nodes being able to discover their environment and talk to centralised resources allows us to harness the power of this idea for enhanced system scalability – ideally we would want to wheel in a new server (or set of servers), turn on and let the new node(s) find out who they are and what they should do from a brief conversation with a centralised configuration server. As a useful aside, building new test environments becomes a total no-brainer…

As far as business owners are concerned, horizontal scalability can answer a number of thorny issues around the ease of performance tuning, handling seasonal load and the associated costs.

As far as systems administrators and software architects are concerned there are further gnarly issues presented that need to be considered as scaling occurs. With the increase in nodes there is an increase in inter-node communication. There is a finite amount of bandwidth and processing power available to each node and a threshold will be crossed at some point.....hopefully a long way away!

As far as systems designers are concerned, systems need to be happy running on cloned nodes. A system that requires some processes on some nodes and others elsewhere is going to be a major headache to scale and support.

Modularity

Good systems can be built such that the main components can be separated across cloned hardware nodes. System components or modules can be easily plugged together without causing interdependencies. This quality is known as high cohesion with low coupling. Cohesive systems can exist when decoupled and indeed report when decoupling happens. Modular systems are easily replicable (scaling horizontally) and exhibit the self awareness that we require. Supporting a buffet type architecture where system owners can build their system behaviour by coupling the required components that are available and appropriate.

Tested Systems

It seems obvious to suggest that a properly tested system is more resilient but is actually often overlooked for a number of reasons. Of course, knowing what to test is as much of a challenge as an FMEA exercise but FMEA and test generation are complimentary. By forcing stakeholders to consider what can go wrong and ensuring measures are taken to prevent failure modes ensures a comprehensive failure matrix is built before-hand.

Considering when testing should take place is a key question. Avoiding the ‘boring’ task for as long as possible is more than just delaying donkey work. Testing before just before go-live is asking for a bad day at the office. Later testing increases the overall effort required to achieve a reliable system significantly as well as increasing the levels of risk associated with a new roll out.
Dr. Khaled El Emam estimates costs can be as high as 80% of total development costs for reworking software.

Testing is an integral part of building a system. Test Driven Development requires tests are written before the software. Indeed software is written to pass the test. This way each unit of software can be demonstrated to meet the requirements and reduce the risk and severity of failure modes before integration.

Integration is also an on going exercise. We advocate the employment of Continuous Integration as described by Martin Fowler in his wiki cum blog in order to prevent ‘blind spots’ in the system.

Considering FMEA and test creation, ALAC or Act Like A Customer is a suitable methodology for finding a high risk subset of failure modes. Customers generally don’t find all potential failure modes but will find the most common. A useful set of tools can be employed from SiteConfidence to analyse your site from a customer’s perspective. These tools can indicate quality issues as part of testing exercises as well as on an ongoing quality measure.

Ongoing testing and regression testing ensure failure modes do not creep in unexpectedly and unseen over time as software is refactored and systems grow. Reuse of test scripts and procedural releases ensure quality is maintained but of course the right tools for the job help in a big way. Continuous Integration is made possible through Cruise Control and Subversion but without discipline and rigorous procedures quality cannot be ensured, risk will increase as will failure mode likelihood and severity.

Conclusion

Our first goal in handling ‘fires’ is to understand what can go wrong; put as simply as possible, if it can go wrong – when it does – it is important. Use Failure Mode Effect Analysis to understand what can go wrong and to generate comprehensive and accurate test cases. Use these test cases to test your system continuously through development and on an ongoing basis. Never use users to test your systems unless you are conducting and A/B split or multivariate test programme.

Don’t ignore the risk of fire, no matter how small. It will matter to customers. For example Akamai Technologies and Jupiter Research have determined through a survey of 1058 experienced online shoppers, 75% who experienced a site freezing or crashing, that is too slow to render, or that involves a convoluted checkout process would no longer buy from that site.

What we can do with our knowledge of the failure modes is to reduce severity and mitigate risk with a view to reducing cost of ownership and enhancing the level of customer experience.

With this level of risk in mind we can build systems that degrade gracefully and handle failure modes before users detect them; informing systems administrators that remedial action has taken place in order to maintain a good level of service.

Again, Jupiter Research and Akamai state: ‘The consequences for an online retailer whose site underperforms include diminished goodwill, negative brand perception, and, most important, significant loss in overall sales.’

No comments: