Layered Defenses for Safety Critical Systems

Even if designers mitigate all single point faults in a design, there is always the possibility of some unexpected fault or combination of correlated faults that causes a system to fail in an unexpected way. Such failures should ideally never happen, but in practice no design analysis is perfectly comprehensive, especially if there are unconsidered correlations that make seemingly independent faults happen together, so they do happen. To mitigate such problems, system designers use layers of mitigation, which is a practice sometimes referred to as “defense in depth.”

Consequences: 
If a layered defensive strategy is defective, a failure can bypass intended mitigation strategies and result in a mishap.

Accepted Practices:
  • The accepted practice for layered systems is to ensure that no single point of failure, nor plausible combination of failures, exists which permits a mishap to occur. For layered defense purposes, a single point of failure includes even a redundant component subsystem (e.g., a 2oo2 redundant self-checking CPU pair might fail due to software defect present on both modules, so a layered defense provides an alternate way to recover from such a failure) 
  • The existence of multiple layers of protection is only effective if the net result gives complete, non-single-point-of-failure, coverage of all relevant faults.
  • The goal of layered defenses should be maximizing the fraction of problems that are caught at each layer of defense to reduce the residual probability of a mishap.
Discussion:

A layered defense system typically rests on an application of the principle of fault containment, in which a fault or its effects are contained and isolated so as to have the least effect on the system possible. The starting point for this is using fault containment regions such as 2oo2 systems or similar design patterns. But, a prudent designer admits that software faults or correlated hardware faults might occur, and therefore provides additional layers or protection.


Layered defenses attempt to prevent escalation of fault effects.

The figure above shows the general idea of layered defenses from a fault tolerant computing perspective. First, it is ideal to avoid both design and run-time faults. But, faults do crop up, and so mechanisms and architectural patterns should be in place to detect and contain those faults using fault containment regions. If a fault is not contained as intended, then the system experiences a hazard in that its primary fault tolerance approach has not worked and the system has become potentially unsafe. In other words, some fraction of faults might not be contained, and will result in hazards.

Once a hazard has manifested, a "fail-safe" mitigation strategy can help reduce the chance of a bigger problem occurring. A fail safe might, for example, be an independent safety system triggered by an electro-mechanical monitor (for example, a pressure relief valve on a water heater that releases pressure if steam forms inside the tank). In general, the system is already in an unsafe operating condition when the fail-safe has been activated. But, successful activation of a fail-safe may prevent a worse event much of the time. In other words, the hope is that most hazards will be mitigated by a fail-safe, but a few hazards may not be mitigated, and will result in incidents.

If the fail-safe is not activated then an incident occurs. An incident is a situation in which the system remains unsafe long enough for an accident to happen, but due to some combination of operator intervention or just plain luck, a loss event is avoided. In many systems it is common to catch a lucky break when the system fails, especially if well trained operators are able to find creative ways to recover the system, such as by shifting a car's transmission to neutral or turning off the ignition when a car's engine over-speeds. (It is important to note that recovering such a system doesn't mean that the system was safe; it just means that the driver had time and training to recover the situation and/or got lucky.) On the other hand, if the operator doesn't manage to recover the system, or the failure happens in a situation that is unrecoverable even by the best operator, a mishap will occur resulting in property damage, personal injury, death, or other safety loss event. (The general description of these points is based on Leveson 1986, pp. 149-150.)

A well known principle of creating safety critical systems is that hazardous behavior displayed by individual components is likely to result in an eventual accident. In other words, with a layered defense approach, components that act in a hazardous way might lead to no actual mishap most times, because a higher level safety mechanism takes over, or just because the system gets “lucky.” However, the occurrence of such hazards can be expected to eventually result in an actual mishap, when some circumstance results in which the safety net mechanism fails. 

For example, a fault containment might work 99.9% of the time, and fail-safes might also work 99.9% of the time. Thousands of tests might show that one or another of these safety layers saves the day. But, eventually, assuming the probability the safety layers being effective is random and independent, both will fail for some infrequent situation, causing a mishap. (Two layers at 99.9% give unmitigated faults of: 0.1% * 0.1% = 0.0001%, which is unlikely to be seen in testing, but still isn't zero.)  The safety concept of avoiding single point failures only works if each failure is infrequent enough that double failures are unlikely to ever happen in the entire lifetime of the operational fleet, which can be millions or even billions of hours of exposure for some systems. Doing this in practice for large deployed fleets requires identifying and correcting all situations detected in which single point failures are not immediately and completely mitigated. You need multiple layers to catch infrequent problems, but you should always design the system so that the layers are never exercised in situations that occur in practice.

Selected Sources:

Most NASA space systems employ failure tolerance (as opposed to fault tolerance) to achieve an acceptable degree of safety. Failure tolerance means not only are faults tolerated within a particular component or subsystem, but the failure of an entire subsystem is tolerated. (NASA 2004 pg. 114) These are the famous NASA backup systems. “This is primarily achieved via hardware, but software is also important, because improper software design can defeat the hardware failure tolerance and vice versa.” (NASA 2004 pg. 114, emphasis added)

Some of the layered defenses might be considered to be forms of graceful degradation (e.g., as described by Nace 2001 and Shelton 2002). For example, a system might revert to simple mechanical controls if a 2oo2 computer controller does a safety shut-down. A key challenge for graceful degradation approaches is ensuring that safety is maintained for each possible degraded configuration.

See also previous blog posting on: Safety Requires No Single Points of Failure

References:
  • Leveson, N., Software safety: why, what, how, Computing Surveys, Vol. 18, No. 2, June 1986, pp. 125-163.
  • Nace, W. & Koopman, P., "A Graceful Degradation Framework for Distributed Embedded Systems," Workshop on Reliability in Embedded Systems (in conjunction with Symposium on Reliable Distributed Systems/SRDS-2001), October 2001.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Shelton, C., & Koopman, P., "Using Architectural Properties to Model and Measure System-Wide Graceful Degradation," Workshop on Architecting Dependable Systems (affiliated with ICSE 2002), May 25 2002.
  • 0 nhận xét:

    Đăng nhận xét