Self-Monitoring and Single Points of Failure


A previous post discussed single points of failure in general. Creating a safety-critical embedded system requires avoiding single points of failure in both hardware in software. This post is the first part of a discussion about examples of single points of failure in safety critical embedded systems.

 Consequences: A consequence of having a single point of failure is that when a critical single point fails, the system becomes unsafe via taking an unsafe action or ceasing to perform critical functions. 

Accepted Practices: The following are accepted practices for avoiding single point failures in safety critical systems:
  • A safety critical system must not have any single point of failure that results in a significant unsafe condition if that failure can reasonably be expected to occur during the operational life of the deployed fleet of systems. Because of their high production volume and usage hours, for automobiles, aircraft, and similar systems it must be expected that any single microcontroller hardware chip and software on any single chip will fail in an arbitrarily unsafe manner.
  • Properly implemented monitor-actuator pairs, redundant input processing, and a comprehensive defense-in-depth strategy are all accepted practices for mitigating single point faults (see future blog entries for postings on those topics).
  • Multiple points of failure that can fail at the same time due to the same cause, can accumulate without being detected and mitigated during system operation, or are otherwise likely to fail concurrently, must be treated as having the same severity as a single point of failure.
Discussion:
MISRA Report 2 states that the objective of risk assessment is to “show that no single point of failure within the system can lead to a potentially unsafe state, in particular for the higher Integrity Levels.” (MISRA Report 2, 1995, pg. 17). In this context, “higher Integrity levels” are those functions that could cause significant unsafe behavior, typically involving passenger deaths. That report also says that the risk from multiple faults must be sufficiently low to be acceptable.

Mauser reports on a Siemens Automotive study of electronic throttle control for automobiles (Mauser 1999). The study specifically accounted for random faults (id. p. 732), as well as considering the probability of a “runaway” incidents (id., p. 734) in which an open throttle fault could cause a mishap. It found a possibility of single point failures, and in particular identified dual redundant throttle electrical signals being read by a single shared (multiplexed) analog to digital converter in the CPU (id., p. 739) as a critical flaw.

Ademaj says that “independent fault containment regions must be implemented in separate silicon dies.” (Ademaj 2003, p. 5) In other words, any two functions on the same silicon die are subject to arbitrary faults and constitute a single point of failure.

But Ademaj didn’t just say it – he proved it via experimentation on a communication chip specifically designed for safety critical automotive drive-by-wire applications (id., pg. 9 conclusions), and those results required the designers of the TTP protocol chip (based on the work of Prof. Kopetz) to change their approach to achieving fault tolerance to the use of a Star topology because combining a network CPU with the network monitor on the same silicon die was proven to be susceptible to single points of failure even though the die had been specifically designed to physically isolate their network monitor from their main CPU. Even though every attempt had been made for on-chip isolation, two completely independent circuits sharing the same chip were observed to fail together from a single fault in a safety-critical automotive drive-by-wire design.

A fallacy in designing safety critical systems is thinking that partial redundancy in the form of "fail-safe" hardware or software will catch all problems without taking into account the need for complete isolation of the potentially faulty component and the mitigation component. If both the mitigation and the fault are in the same Fault Containment Region (FCR), then the system can't be made entirely safe.

To give a more concrete example, consider a single CPU with a self-monitoring feature that has hardware and/or software that detects faults within that same CPU. One could envision such a system signaling to an outside device a self-health report. Such a design pattern is sometimes called a "simplex system with disengagement monitor" and uses "Built-In Test" (BIT) to do the self-checking.  (Note that BIT is a generic term for self-checks, and does not necessarily mean a manufacturing gate-level test or other specific diagnostic.) If self-health checks are false, then the system fails over to a safe state via, for example, shutting down (if shutting down is safe). To be sure, doing this is better than doing nothing. But, it can never get complete coverage. What if the self-health check is compromised by the fault in the chip?

A look at a research paper on aerospace fault tolerant architectures explains why a simplex (single-FCR) system with BIT is inadequate for high-integrity safety-critical systems. Hammett (2001) figure 5 shows a simplex computer with BIT disengagement features, and says that they “increase the likelihood the computer will fail passive rather than fail active. But it is important to realize that it is impossible to design BIT that can detect all types of computer failures and very difficult to accurately estimate BIT effectiveness.” (id., pg. 1.C.5-4, emphasis added) Such an architecture is said to “Fail Active” after some failures (id., Table 1, p. 1.C.5-7), where “A fail active condition is when the outputs to actuators are active, but uncontrolled. … A fail active condition is a system malfunction rather than a loss of function.” (id., pg. 1.C.5-2, emphasis per original) “For some systems, an annunciated loss of function is an acceptable fail-safe, but a malfunction could be catastrophic.” (id., p. 1.C.5-3, emphasis per original) In particular, with such an architecture depending on the fraction of failures caught (which is not 100%), some “failures will be undetected and the system may fail to a potentially hazardous fail active condition.” (id., p. 1.C.5-4, emphasis added).



Table 1 from Hammett 2001, below, shows where Simplex with BIT stands in terms of fault tolerance capability. It will fail active (i.e., fail dangerously) for some single point failures, and that's a problem for safety critical systems. .



Note that dual standby redundancy is also inadequate even though it has two copies of the same computer with the same software. This is because the primary has to self-diagnose that it has a problem before it switches to the backup computer (Hammett Fig. 6, below). If the primary doesn't properly self-diagnose, it never switches over, resulting in a fail-active (dangerous system).


On the other hand, a self-checking pair (Hammett figure 7 above), sometimes known as a "2 out of 2" or 2oo2 system, can tolerate all single point faults the following way. Each of the computers in a 2oo2 pair runs the same software on identical hardware, usually operating in lockstep. If the outputs don't agree, then the system disables its outputs. Any single failure that affects the computation will, by definition, cause the outputs to disagree (because it can only affect one of the 2 computers, and if it doesn't change the output then it is not affecting the result of the computation). Most dual-point failures will also be detected, except for dual point failures that happen to affect both computers in exactly the same way. Because the two computers are separate FCRs, this is unlikely unless there is a correlated fault such as a software defect or hardware design defect. In practice, the inputs are also replicated to avoid a bad sensor being a single point of failure as well (Hammett's figure is non-specific about inputs, because the focus is on computing patterns). 2oo2 is not a free lunch in many regards, and I'll queue a discussion of the gory details for a future blog post if there is interest. Suffice it to say that you have to pay attention to many details to get this right. But it is definitely possible to build such a system.

With a 2oo2 system, the second CPU does not improve availability, but in fact reduces it because there are twice as many computers to fail. To attain availability, a redundant failover set of 2oo2 computers can be used (Hammett Fig.9 -- dual self-checking pair). And in fact this is a commonly used architecture in railway switching equipment. Each 2oo2 pair self-checks, and if it detects an error it shuts down, swapping in the other 2oo2 pair.  So having a single 2oo2 pair is done for safety.  The second 2oo2 pair is there to prevent outages (see Hammett figure 9, below).


From the above we can see that avoiding single points of failure requires at least two CPUs, with care taken to ensure that each CPU is a separate fault containment region. If you need a fail-operational system, then 4 CPUs arranged per figure 9 above will give you that, but at a cost of 4 CPUs.

Note that we have not at any point attempted to identify some "realistic" way in which a computer can both produce a dangerous output and cause its BIT to fail. Such analysis is not required when building a safe system. Rather, the effects of failure modes in electronics are more subtle and complex than can be readily understood (and some would argue that many real but infrequent failure modes are too complex for anyone to understand). It is folly to try to guess all possible failures and somehow ensure that the BIT will never fail. But even if we tried to do this, the price for getting it wrong in terms of death and destruction with a safety critical system is simply too high to take that chance. Instead, we simply assert that Murphy will find a way to make a simplex system with BIT fail active, and take that as a given.

 By way of analogy, there is no point doing analysis down to single lines of  code or bolt tensile strengths in high-vibration environments within a jet engine to know that flying across the Pacific Ocean in a jet airliner with only one engine working at takeoff is a bad idea.  Even perfectly designed jet engines break, and any single copy of perfectly design jet engine software will eventually fail (due to a single event upset within the CPU it is running in, if for no other reason). The only way to achieve safety is to have true redundancy, with no single point failure whatsoever that can possibly keep the system from entering a safe state.

In practice the "output if agreement" block shown in these figures can itself be a single point of failure. This is resolved in practical systems by, for example, having each of the computers in a 2oo2 pair control the reset/shutdown line on the other computer in the 2oo2 pair. If either computer detects a mismatch, it both shuts down the other CPU and commits suicide itself, taking down the pair. This system reset also causes the switch in a dual 2oo2 system to change over to the backup pair of computers. And yes, that switch can also be a single point of failure, which can be resolved by for example having redundant actuators that are de-energized when the owner 2oo2 pair shuts down. And, we have to make sure our software doesn't cause correlated faults between pairs by ensuring it is of sufficiently high integrity as well.

As you can see, flushing out single points of failure is no small thing. But if you want to build a safety critical system, getting rid of single points of failure is the price of admission to the game. And that price includes truly redundant CPUs for performing safety critical computations.

References:
  • Hammett, Design by extrapolation: an evaluation of fault-tolerant avionics, 20th Conference on Digital Avionics Systems, IEEE, 2001.
  • MISRA, Report 2: Integrity, February 1995.
  • Mauser, Electronic throttle control – a dependability case study, J. Univ. Computer Science, 5(10), 1999, pp. 730-741.

0 nhận xét:

Đăng nhận xét