Proper Watchdog Timer Use

Watchdog timers are a prevalent mechanism for helping to ensure embedded system reliability. But they only work if you use them properly. Effective watchdog timer use requires that the failure of any periodic task in the system must result in a watchdog timer reset.

Consequences: Improper use of a watchdog timer leads to a false sense of security in which software task deaths and software task time overruns are not detected, causing possible missed deadlines or partial software failures.

Accepted Practices: 
  • If multiple periodic tasks are in the system, each and every such task must contribute directly to the watchdog being kicked to ensure every task is alive.
  • Use of a hardware timer interrupt to directly kick the watchdog is a bad practice. (There is arguably an exception of the ISR keeps a record of all currently live tasks as described later.)
  • Inferring task health by monitoring the lowest priority task alone is a bad practice. This approach fails to detect dead high priority tasks.
  • The watchdog timeout period should be set to the shortest practical value. The system should remain safe even if any combination of tasks dies for the entire period of the watchdog timeout value.
  • Every time the watchdog timer reset happens during testing of a fully operational system, that fact should be recorded and investigated. 
Discussion

Briefly, a watchdog timer can be thought of as a counter that starts at some predetermined value and counts down to zero. If the watchdog actually gets to zero, it resets the system in the hopes that a system reset will fix whatever problem has occurred. Preventing such a reset requires “kicking” (terms for this vary) the watchdog periodically to set the count back at the original value, preventing a system reset. The idea is for software to convince the hardware watchdog that the system is still alive, forestalling a reset. The idea is not unlike asking a teenager to call in every couple hours on a date to make sure that everything is going OK.

Watchdog timer arrangement.

Once the watchdog kicks a system reset is the most common reaction, although in some cases a permanent shutdown of the system is preferable if it is deemed better to wait for maintenance intervention before attempting a restart.

Getting the expected benefit from a watchdog timer requires using it in a proper manner. For example, having a hardware timer interrupt trigger unconditional kicking of the watchdog is a specifically bad practice, because it doesn’t indicate that any software task except the hardware timer ISR is working properly. (By analogy, having your teenager set up a computer to automatically call home with a prerecorded “I’m OK” message every hour on Saturday night doesn’t tell you that she’s really OK on her date.)

For a system with multiple tasks it is essential that each and every task contribute to the watchdog being kicked. Hearing from just one task isn’t enough – all tasks need to have some sort of unanimous “vote” on the watchdog being kicked. Correspondingly, a specific bad practice is to have one task or ISR report in that it is running via kicking the watchdog, and infer that this means all other tasks are executing properly. (Again by analogy, hearing from one of three teenagers out on different dates doesn’t tell you how the other two are doing.) As an example, the watchdog “should never be kicked from an interrupt routine” (MISRA Report 3, p. 38), which in general refers to the bad practice of using a timer service ISR to kick the watchdog.

A related bad practice is assuming that if a low priority task is running, this means that all other tasks are running. Higher priority tasks could be “dead” for some reason and actually give more time for low priority tasks to run. Thus, if a low priority task kicks the watchdog or sets a flag that is the sole enabling data for an ISR to kick the watchdog, this method will fail to detect if other tasks have failed to run in a timely periodic manner.

Monitoring CPU load is not a substitute for a watchdog timer. Tasks can miss their deadlines even with CPU loads of 70%-80% because of bursts of momentary overloads that are to be expected in a real time operating system environment as a normal part of system operation. For this reason, another bad practice is using software inside the system being monitored to perform a CPU load analysis or other indirect health check and kick the watchdog periodically by default unless the calculation indicates a problem. (This is a variant of kicking the watchdog from inside an ISR.)

The system software should not be in charge of measuring workload over time -- that is the job of the watchdog. The software being monitored should kick the watchdog if it is making progress. It is up to the watchdog mechanism to decide if progress is fast enough. Thus, any conditional watchdog kick should be done just based on liveness (have tasks actually been run), and not on system loading (do we think tasks are probably running).

One way to to kick a watchdog timer in a multi-tasking system is sketched by the below graphic:


Key attributes of this watchdog approach are: (1) all tasks must be alive to kick the WDT. If even one task is dead the WDT will time out, resetting the system. (2) The tasks do not keep track of time or CPU load on their own, making it impossible for them to have a software defect or execution defect that “lies” to the WDT itself about whether things are alive. Rather than making the CPU’s software police itself and shut down to await a watchdog kick if something is wrong, this software merely has the tasks report in when they finish execution and lets the WDT properly due its job of policing timeliness. More sophisticated versions of this code are possible depending upon the system involved; this is a classroom example of good watchdog timer use. Where "taskw" is run from depends on the scheduling strategy and how tight the watchdog timer interval is, but it is common to run it in a low-priority task.

Setting the timing of the watchdog system is also important. If the goal is to ensure that a task is being executed at least every 5 msec, then setting the watchdog timer at 800 msec doesn’t tell you there is a problem until that task is 795 msec late. Watchdog timers should be set reasonably close to the period of the slowest task that is kicking them, with just a little extra time beyond what is required to account for execution variation and scheduling jitter.

If watchdog timer resets are seen during testing they should be investigated. If an acceptable real time scheduling approach is used, a watchdog timer reset should never occur unless there has been system failure. Thus, finding out the root cause for each watchdog timer reset recorded is an essential part of safety critical design. For example, in an automotive context, any watchdog timer event recordings could be stored in the vehicle until it is taken in for maintenance. During maintenance, a technician’s computer should collect the event recordings and send them back to the car’s manufacturer via the Internet.

While watchdog timers can't detect all problems, a good watchdog timer implementation is a key foundation of creating a safe embedded control system. It is a negligent design omission to fail to include an acceptable watchdog timer in a safety critical system.

Selected Sources

Watchdog timers are a classical approach to ensuring system reliability, and are a pervasive hardware feature on single-chip microcontrollers for this reason.

An early scholarly reference is a survey paper of existing approaches to industrial process control (Smith 1970, p. 220). Much more recently, Ball discusses the use of watchdog timers, and in particular the need for every task to participate in kicking the watchdog. (Ball 2002, pp 81-83). Storey points out that while they are easy to implement, watchdog timers do have distinct limitations that must be taken into account (Storey pg. 130). In other words, watchdog timers are an important accepted practice that must be designed well to be effective, but even then they only mitigate some types of faults.

Lantrip sets forth an example of how to ensure multiple tasks work together to use a watchdog timer properly. (Lantrip 1997). Ganssle discusses how to arrange for all tasks to participate in kicking the watchdog, ensuring that some tasks don’t die while others stay alive. (Ganssle 2000, p. 125).

Brown specifically discusses good and bad practices. “I’ve seen some multitasking systems that use an interrupt to tickle the watchdog. This approach defeats the whole purpose for having one in the first place. If all the tasks were blocked and unable to run, the interrupt method would continue to service the watchdog and the reset would never occur. A better solution is to use a separate monitor task that not only tickles the watchdog, but monitors the other system tasks as well.” (Brown 1998 pg. 46).

The MISRA Software Guidelines recommend using a watchdog to detect failed tasks (MISRA Report 1, p. 43), noting that tasks (which they call “processes”) may fail because of noise/EMI, communications failure, software defects, or hardware faults. The MISRA Software Guidelines say that a “watchdog is essential, and must not be inhibited,” while pointing out that having returning an engine to idle in a throttle-by-wire application could be unsafe. (MISRA Report 1, p. 49). MISRA also notes that “The consequence of each routine failing must be identified, and appropriate watchdog and default action specified.” (MISRA Report 4 p. 33, emphasis added)

NASA recommends using a watchdog and emphasizes that it must be able to detect death of all tasks (NASA 2004, p. 93). IEC 61508-2 lists a watchdog timer as a form of test by redundant hardware (pg. 115) (without implying that it provides complete redundancy).

Addy identified a task death failure mode in a case study (Addy 1991, pg. 79) due to a task encountering a run-time fault that was not properly caught, resulting in the task never being restarted. Thus, it is reasonably conceivable that a task will die in a multitasking operating system. Inability to detect a task death is a defect in a watchdog timer, and a defective watchdog timer approach undermines the safety of the entire system. With such a defective approach, it would be expected that task deaths or other foreseeable events will go undetected by the watchdog timer.

References:
  • Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
  • Ball, Embedded Microprocessor Systems: Real World Design, Newnes, 2002.
  • Brown, D., “Solving the software safety paradox,” Embedded System Programming, December 1998, pp. 44-52.
  • Ganssle, J., The Art of Designing Embedded Systems, Newnes, 2000.
  • IEC 61508, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES), International Electrotechnical Commission, 1998. 
  • Lantrip, D. & Bruner, L., General purpose watchdog timer component for a multitasking system, Embedded Systems Programming, April 1997, pp. 42-54.
  • MISRA, Report 1: Diagnostics and Integrated Vehicle Systems, February 1995.
  • MISRA, Report 3: Noise, EMC and Real-Time, February 1995.
  • MISRA, Report 4: Software in Control Systems, February 1995.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Smith, Digital control of industrial processes, ACM Computing Surveys, Sept. 1970, pp. 211-241.
  • Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.





Task Death In Safety Critical Systems

Safety critical systems must be designed with an expectation that one or more tasks on the same CPU might fail via hanging, might fail to be scheduled, or otherwise might not execute in a periodic manner as intended, leading to a partial software failure.

Consequences:
Failing to monitor each and every task and failing to mitigate task execution faults can lead to a task “dying” without being recovered. If a task dies, it no longer performs its functions even though other tasks are still operational. This can leave the system in an uncontrolled state, cause loss of fail-safe functionality (if the fail-safe is in a dead task), cause stale recorded failure data (if the task collecting that data dies), etc.

Accepted Practice:
  • Each task must be continually monitored for completion within a defined period. This ensures that the task is scheduled and that it is completed on time. A properly designed watchdog timer that monitors all tasks is an accepted way to accomplish this.
  • If a task fails to periodically complete execution, fault mitigation must be performed. Common fault mitigation techniques include restarting the task or restarting the system. 
Discussion

Embedded systems typically have one or more periodic software tasks that must complete execution within some maximum period (each task has its own deadline, and by default the deadline equals the period unless some other deadline is specified). In a properly designed system, careful analysis and testing has been performed to ensure that each task will continue operation and meet its deadline even under worst case operating conditions.

However, competent safety-critical embedded system designers realize that faults may occur, resulting in a partial failure of the software system, and plan accordingly. Each task must be monitored to ensure the following: each task executes at least once in every period that it is supposed to execute; each task finishes execution without abnormally terminating; and each task completes execution within its predetermined deadline. This means that the system must not only make sure that once a task is scheduled to run it meets its deadline, but also make sure that the task is scheduled to run in the first place.

As an analogy, consider if you have a set of things you need to do every evening, but have a bad memory. One way you might handle this is to put post-it notes on your fridge to remind yourself what needs to be done every evening before you go to sleep. Let’s say you’re very methodical, so you take each note down in turn, do the action, then put it back on the fridge, in order. One of the notes says “check that basement door is locked” because you’ve found the kids have a habit of forgetting to lock it when they come in from playing. For a while this works fine. But one day you get distracted while locking the basement door and set the note down, forgetting to put it back on the fridge. The next night you might forget to check the door and not even realize it (at least not until you wake up in the middle of the night wondering about it – but embedded CPUs don’t have that feature!). What has happened is that a disruption caused one of your tasks to fall off the scheduling list. Just because you finished the entire list before morning doesn’t mean everything got done, because something was missing from the list. The same thing can happen in a computer if one of its tasks dies and doesn’t get put back on the task to-do list – the computer “thinks” it is running all its tasks, but in reality one of the tasks hasn’t been run because it isn’t even on the to-do list.

Selected Sources

Kleidermacher points out that tasks can die, saying “When a thread faults (for example, due to a stack overflow), the kernel should provide some mechanism whereby notification can be sent to the supervisor thread. If necessary, the supervisor can then make a system call to close down the faulted thread, or the entire process, and restart it. The supervisor might also be hooked into a software ‘watchdog’ setup, whereby thread deadlocks and starvation can be detected as well.” (Kleidermacher 2001, pg. 23).

Ganssle refers to the necessity of checking all tasks by saying “This problem multiplies in a system with an RTOS, as a reliable watchdog monitors all of the tasks. If some of the tasks die but others stay alive – perhaps tickling the [Watchdog Timer] – then the system’s operation is at best degraded.”  (Ganssle 2000, p. 125)

Tasks can be expected to die due to a variety of causes, and this is especially likely to happen in a system without hardware support for memory protection, such as OSEK, because software from one task can accidentally corrupt another task's data. Task death is an expected possibility in such an RTOS. For example, a technical summary of the OSEK operating system shows “Failed” as a task state, indicating that tasks can be expected to fail and need to be restarted to restore system operation. (Feiler 2003, Fig 1, shown below).


OSEK tasks can be expected to fail, requiring a task restart. (Feiler 2003, Fig. 1.)


Another contemporaneous real time operating system, VxWorks, did not support memory protection and did not support automatic task restart. As a result, it was dramatically more prone to subtle but significant partial software failures if software applications misbehaved (Koopman 2008, pp. 220-221, recounting work done several years earlier).  In particular, parts of the system would appear to be working, but key pieces of the system would have failed, giving a superficial appearance of a system that was working (the keyboard accepted commands and displayed them on the computer screen) when in fact the system was mostly dead, and programs could not be run on it. (VxWorks has since added memory protection capabilities, so these experimental results do not necessarily represent current VxWorks systems.)

Safety critical system designers should reasonably expect that a task might die (cease to finish computation in a periodic manner). They should have complete analysis to determine which task deaths might lead to an unsafe system (by default, this is all tasks in the system), and should take proactive measures to detect and recover from one or more tasks dying, regardless of the underlying cause -- even if no specific cause for the task death can be imagined by the designers.

References:
  • Feiler, Real-time application development with OSEK: a review of the OSEK standards, Carnegie Mellon University Software Engineering Institute, CMU/SEI-2003-TN-004, Nov 2003.
  • Ganssle, J., The Art of Designing Embedded Systems, Newnes, 2000.
  • Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
  • Koopman, P., DeVale, K. & DeVale, J., "Interface robustness testing: experiences and lessons learned from the Ballista Project," In: Kanoun, K. & Spainhower, L., Eds., Dependability Benchmarking for Computer Systems, IEEE Press, 2008, pp. 201-226.


Real Time Scheduling Analysis for Critical Systems

Critical embedded software must be created using methodical scheduling analysis that ensures that every task can meet its deadline under worst-case fault-free operational conditions. Looking just at CPU load is not only inadequate to determine if you will meet real time deadlines -- it is a specifically bad practice that is to be avoided.

Consequences:
Failing to do an accurate real time schedule design for a safety critical system means that designers do not know if the system will miss deadlines. It is impractical in a complex system (which means almost any real-world product) to do enough testing with a multitasked RTOS-based system to ensure that deadlines will be met under worst case conditions. Thus, a design team that fails to do scheduling analysis should appreciate that they have an unknown probability of missing deadlines, and can reasonably expect to miss deadlines in the worst case if the CPU load is above about 69% for a large number of non-harmonic tasks.

 Accepted Practices:
  • Using Rate Monotonic Scheduling (RMA) when using an RTOS with multi-tasking capability, or using some other mathematically sound scheduling analysis. Rate Monotonic Scheduling (RMS) is what you get as a result of RMA.
  • Documenting the RMA analysis including at least listing: all tasks, WCET for each task, period of each task, and worst case system blocking time. Assumptions used in the analysis should be stated and justified (for example, an assumption that no low priority task blocks a high priority task must be justified via written explanation, or if untrue then advanced techniques must be used to ensure schedulability). The system use must be less than 100% of the CPU if a set of favorable assumptions such as harmonic task periods can be documented to hold, and may be restricted to as low as 69.3% CPU used (meaning 30.7% unused CPU capacity) in the general case.
  • A specifically bad practice is basing real time performance decisions solely on spare capacity (e.g., “CPU is only 80% loaded on average”) in the absence of mathematical scheduling analysis, because it does not guarantee safety critical tasks will meet their deadlines. Similarly, monitoring spare CPU capacity as the only way to infer whether deadlines are being met is a specifically bad practice, because it does not actually tell you whether high frequency deadlines are being met or not.
  • Permitting more than one instance of a real-time task to queue is a specifically bad practice, because this can only happen when real time deadlines are being missed. This practice is a bandaid for a real-time system, and indicates that the system is missing its real-time deadlines.
Discussion

Real time scheduling is a mathematical approach to ensuring that every task in a real time embedded system meets its deadlines under all specified operating conditions. Using a mathematical approach is required because testing can only exercise some of the system operating conditions. There are almost always real time scheduling situations that can’t be adequately tested in a complex piece of software, requiring either simplifying the system to make testing feasible, or using a rigorous mathematical approach to ensure that complex scheduling is guaranteed to work. When a multi-tasking real time operating system such as OSEK is used, a mathematical approach must be used to ensure deadlines are met.

Rate Monotonic Analysis

The generally accepted method for scheduling critical real time systems is to perform Rate Monotonic Analysis (RMA), possibly with one of a number of adaptations for special circumstances, and create a system with a Rate Monotonic Schedule. RMA has the virtue of providing a simple set of rules that guarantees all tasks in a system will meet their deadlines. To achieve this, RMA requires rules to be followed such as all tasks having a defined fastest time period at which they will run, and the period of tasks being harmonic multiples to permit using 100% of the CPU capacity. (Task periods are considered harmonic if they are exact multiples of each other. For example, periods of 1, 10, 100, 1000 msec are harmonic, but 1, 9, 98, 977 msec are not harmonic.) To implement RMA, designers sort tasks based on period, and assign the fastest task to the highest priority, second fastest task to the second highest priority, and so on. If a designer wishes to have multiple tasks at the same priority that is acceptable, but it is required that all tasks at a given priority execute at the same period. 

The benefit of RMA is that there is a mathematical proof that all tasks will meet deadlines if a certain set of rules is followed. If any of the rules is bent, however, then more complex mathematical analysis must be performed or other special techniques used to ensure that deadlines will be met. In particular, if task periods are not harmonic multiples of each other, guaranteeing that deadlines will be met requires leaving slack capacity on the CPU. For a system with many tasks, RMA can only guarantee that task deadlines will be met if the CPU load is a maximum of 69.3%. (For this reason, it is common for embedded systems to use harmonic multiples of task periods.)


Example execution of a system scheduled with RMA.
(Source: http://blogs.abo.fi/alexeevpetr/2011/11/24/generating-multithread-code-from-simulink-model-for-embedded-target/)

Employing RMA requires at least the following steps at design time: listing all tasks in the system including interrupt service routines, determining the fastest period of each task, determining the worst case execution time (WCET) of each task, and determining the maximum blocking time of the system (longest time during which interrupts are disabled). Given those numbers, tasks can be assigned priorities and designers can know if the system will always meet its deadlines. Given correct RMA scheduling and a fault-free operational system, every task will complete by the end of its period, so it will never be possible to have a second occurrence of a task enqueued while waiting for a first occurrence to complete.

If the CPU is over-subscribed, there are established methods of ensuring that critical tasks always complete while non-critical tasks get whatever CPU resources are left. The simplest method is to simply assign all non-critical tasks priorities lower than the lowest priority critical task (which works so long as those non-critical tasks can’t substantively delay with the operation of the critical tasks). If it is important to share available CPU time across a number of noncritical tasks even when the CPU is overloaded, this can be done by having non-critical tasks take turns at the lowest priority. Note that an overloaded CPU does not cause any critical tasks to miss their deadlines in this scenario; it is simply a matter of making non-critical tasks wait a bit longer to execute if the CPU is overloaded. (This assumes that the CPU can handle worst case demand from critical tasks. This assumption must be ensured by the design process for the system to be safe.)

Less Than 100% CPU Load Does Not Guarantee Deadlines Are Met

A specifically bad practice is looking at idle time during testing to determine whether or not the CPU is overloaded, and inferring from less that 100% usage that the system will meet its deadlines. That simply does not account for the worst case, or potentially even just an infrequent heavy load case.

As a non-computer example of missing deadlines with less than 100% loading, let’s say you want to spend five days per week working and three days out of every 12 volunteering at a homeless shelter (the fact that it is out of 12 days instead of out of 14 days makes these periods non-harmonic). Because work has a period of 7 days, it will have higher priority than the 12 day volunteer service period. This means you’ll complete all of your work the first 5 days (Monday – Friday) out of each 7-day work week before you start volunteering on weekends. Most weeks this will work fine (you’ll have enough time for both). But when the 12-day volunteer period starts on a Monday, you’ll only have one weekend (two days) in the 12 day period for volunteering that runs Monday of one week through Friday of the next week.  Thus you’ll be a day short for volunteering whenever the volunteer period starts on a Monday. The amount of time you’ve committed is 5 days out of 7 plus 3 days out of 12:  5/7 + 3/12 = 96.4%. But you’re going to come up a whole day short on volunteering once in a while even though you are less than 100% committed and you are taking some weekend days off, performing neither task. You could solve this by committing 4 days out of 14 to volunteering, which is actually a slightly higher workload (100% total), but changes the period to be harmonic with the weekly work cycle. (3.5 days out of every 14 would also work.) Thus, as shown by this example, non-harmonic task periods can result in missed deadlines even with a workload that is less than 100%.

Example of 79% loaded CPU missing a task deadline with only 4 tasks.

As a computer-based example, the above figure shows another example in terms of a task in a four-task system missing its deadline on a system with non-harmonic periods even though the CPU is only 79% loaded. This scenario would happen rarely, and only when all the tasks’ periods synchronize, which for this example would be only once every Least Common Multiple of (19,24,29,34) = 224,808 time units – and even then only if each task actually took its worst case time to execute. Thus, while Task 4 would be expected to miss its deadline in operation, detecting that situation might be difficult with limited duration testing.

Selected Sources

Rate Monotonic Scheduling was developed to address the problem of creating an easy-to-follow recipe for ensuring that real time schedules can be met in a system with multiple tasks. Influential early papers include (Liu & Layland 1973 and Lehoczky et al. 1989).

Ganssle recommends the use of RMA as early as 1992 (Ganssle 1992, pg. 200-201) and sketches its use, giving a reference for more details.

Douglass provides a pattern for static priority real time scheduling and states: “the most common policy for the selection of policies is rate monotonic scheduling or RMS”.  (Douglass 2002, section 5.9.5; note that rate monotonic scheduling is what you get as a result of rate monotonic analysis).

In the context of safety critical operating systems, Kleidermacher says “Rate monotonic analysis (RMA) is frequently used by system designers to analyze and predict the timing behavior of systems.” (Kleidermacher 2001, pg. 30).

MISRA Report 3 discusses the use of real-time kernels (which for our purposes are operating systems that use some sort of scheduling approach). It notes that there are a number accepted scheduling techniques, and that the use of “best available technology” such as for example using a certified RTOS brings benefit, providing a reference to a number of text.

References:
  • Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
  • Ganssle, J., The Art of Programming Embedded Systems, Academic Press, 1992.
  • Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
  • Lehoczky, J.; Sha, L.; Ding, Y. "The rate monotonic scheduling algorithm: exact characterization and average case behavior", IEEE Real-Time Systems Symposium, 1989, pp. 166–171.
  • Liu, C. L.; Layland, J., "Scheduling algorithms for multiprogramming in a hard real-time environment", Journal of the ACM, 1973, (1): 46–61
  • MISRA, Report 3: Noise, EMC and Real-Time, February 1995.

Mitigating Data Corruption

As previously discussed, data corruption from bit flips can happen from a variety of sources. The results from such faults can be catastrophic. An effective technique should be used to detect both software- and hardware-caused corruption of critical data. Effective techniques include error coding and data replication in various forms.

Consequences:
Unintentional modification of data values can cause arbitrarily bad system behavior, even if only one bit is modified. Therefore, safety critical systems should take measures to prevent or detect such corruptions, consistent with the estimated risk of hardware-induced data corruption and anticipated software quality. Note that even best practices may not prevent corruption due to software concurrency defects.

Accepted Practices:

  • When hardware data corruption may occur and automatic error correction is desired, use a hardware single error correction/multiple error detection circuitry (SECMED) form of Error Detection and Correction circuitry (EDAC), sometimes just called Error Correcting Code circuitry (ECC) for all bytes of RAM. This would protect against hardware memory corruption, including hardware corruption of operating system variables. However, it does not protect against software memory corruption.
  • Use a software approach such as a cyclic redundancy code CRC (preferred), or checksum to detect a corrupted program image, and test for corruption at least when the system is booted.
  • Use a software approach such as keeping redundant copies to detect software data corruption of RAM values.
  • Use fault injection to test data corruption detection and correction mechanisms.
  • Perform memory tests to ensure there are no hard faults.

Discussion:

Safety critical systems must protect against data corruption to avoid small changes in data which can render the system unsafe. Even a single bit in memory changing in the wrong way could cause a system to change from being safe to unsafe. To guard against this, various schemes for memory corruption detection and prevention are used.


Hardware and Software Are Both Corruption Sources

Hardware memory corruption occurs when a radiation event, voltage fluctuation, source of electrical noise, or other cause makes one or more bits flip from one value to another. In non-volatile memory such as flash memory, wearout, low programming voltage, or electrical charge leakage over time can also cause bits in memory to have an incorrect value. Mitigation techniques for these types of memory errors include the use of hardware error detection/correction codes (sometimes called “EDAC”) for RAM, and typically the use of a “checksum” for flash memory to ensure that all the bytes, when “added up,” give the same total as they did when the program image was stored in flash.

If hardware memory error detection support is not available, RAM can also be protected with some sort of redundant storage. A common practice is to store two copies of a value in two different places in RAM, often with one copy inverted or otherwise manipulated. It's important to avoid storing the two copies next to each other to avoid problems of errors that corrupt adjacent bits in memory. Rather, there should be two entirely different sections of memory for mirrored variables, with each section having only one copy of each mirrored variable. That way, if a small chunk of memory is arbitrarily corrupted, it can at most affect one of the two copies of any mirrored variable. Error detection codes such as checksums can also be used, and provide a tradeoff of increased computation time when a change is made vs. requiring less storage space for error detection information as compared to simple replication of data.

Software memory corruption occurs when one part of a program mistakenly writes data to a place that should only be written to by another part of the program due to a software defect. This can happen as a result of a defect that produces an incorrect pointer into memory, due to a buffer overflow (e.g., trying to put 17 bytes of data into a 16 byte storage area), due to a stack overflow, or due to a concurrency defect, among other scenarios.

Hardware error detection does not help in detecting software memory corruption, because the hardware will ordinarily assume that software has permission to make any change it likes. (There may be exceptions if hardware has a way to “lock” portions of memory from modifications, which is not the case here.) Software error detection may help if the corruption is random, but may not help if the corruption is a result of defective software following authorized RAM modification procedures that just happen to point to the wrong place when modifications are made. While various approaches to reduce the chance of accidental data corruption can be envisioned, acceptable practice for safety critical systems in the absence of redundant computing hardware calls for, at a minimum, storing redundant copies of data. There must also be a recovery plan such as system reboot or restoration to defaults if a corruption is detected.

Data Mirroring

A common approach to providing data corruption protection is to use a data mirroring approach in which a second copy of a variable having a one’s complement value is stored in addition to the ordinary variable value. A one’s complement representation of a number is computed by inverting all the bits in a number. So this means one copy of the number is stored normally, and the second copy of that same number is stored with all the bits inverted (“complemented”). As an example, if the original number is binary “0000” the one’s complement mirror copy would be “1111.” When the number is read, both the “0000” and the “1111” are read and compared to make sure they are exact complements of each other. Among other things, this approach gives protection against a software defect or hardware corruption that sets a number of RAM locations to all be the same value. That sort of corruption can be detected regardless of the constant corruption value put into RAM, because two mirrored copies can’t have the same value unless at least one of the pair has been corrupted (e.g., if all zeros are written to RAM, both copies in a mirrored pair will have the value “0000,” indicating a data corruption has occurred).

Mirroring can also help detect hardware bit flip corruptions. A bit flip is when a binary value (a 0 or 1), is corrupted to have the opposite value (changing to a 1 or 0 respectively), which in turn corrupts the value of the number stored at the memory location suffering one or more bit flips. So long as only one of two mirror values suffers a bit flip, that corruption will be detectable because the two copies won’t match properly as exact complements of each other.

A good practice is to ensure that the mirrored values are not adjacent to each other so that an erroneous multi-byte variable update is less likely to modify both the original and mirrored copy. Such mirrored copies are vulnerable to a pair of independent bit flips that just happen to correspond to the same position within each of a pair of complemented stored values. Therefore, for highly critical systems a Cyclic Redundancy Check (CRC) or other more advanced error detection method is recommended.

It is important to realize that all memory values that can conceivably cause a system hazard need to be protected by mirroring, not just a portion of memory. For example a safety-critical Real Time Operating System will have values in memory that control task scheduling. Corruption of these variables can lead to task death or other problems if the RTOS doesn't protect data integrity, even if the application software does use mirroring. Note that there are multiple ways for an RTOS to protect its data integrity from software and hardware defects beyond this, such as via using hardware access protection. But, if the only mechanism being used in a system to prevent memory corruption is mirroring, the RTOS has to use it too or you have a vulnerability.

Selected Sources

Automotive electronics designers knew as early as 1982 that data corruption could be expected in automotive electronics. Seeger writes: “Due to the electrically hostile environment that awaits a microprocessor based system in an automobile, it is necessary to use extra care in the design of software for those systems to ensure that the system is fault tolerant. Common faults that can occur include program counter faults, altered RAM locations, or erratic sensor inputs.” (Seeger 1982, abstract, emphasis added). Automotive designers generally accepted the fact that RAM location disruptions would happen in automotive electronics (due to electromagnetic interference (EMI), radiation events, or other disturbances), and had to ensure that any such disruption would not result in an unsafe system.

Stepner, in a paper on real time operating systems that features a discussion of OSEK (the trade name of an automotive-specific real time operating system), states with regard to avoiding corruption of data: “One technique is the redundant storage of critical variables, and comparison prior to being used. Another is the grouping of critical variables together and keeping a CRC over each group.” (Stepner 1999, pg. 155).

Brown says “We’ve all heard stories of bit flips that were caused by cosmic rays or EMI” and goes on to describe a two-out-of-three voting scheme to recover from variable corruption. (Brown 1998 pp. 48-49). A variation of keeping only two copies permits detection but not correction of corruption. Brown also acknowledges that designers must account for software data corruption, saying “Another, and perhaps more common, cause of memory corruption is a rogue pointer, which can run wild through memory leaving a trail of corrupted variables in its wake. Regardless of the cause, the designer of safety-critical software must consider the threat that sometime, somewhere, a variable will be corrupted.” (id., p. 48).

Kleidermacher says: “When all of an application’s threads share the same memory space, any thread could—intentionally or unintentionally— corrupt the code, data, or stack of another thread. A misbehaved thread could even corrupt the kernel’s own code or internal data structures. It’s easy to see how a single errant pointer in one thread could easily bring down the entire system, or at least cause it to behave unexpectedly.” (Kleidermacher 2001, pg. 23). Kleidermacher advocates hardware memory protection, but in the absence of a hardware mechanism, software mechanisms are required to mitigate memory corruption faults.

Fault injection is a way to test systems to see how they respond to faults in memory or elsewhere (see also an upcoming post on that topic). Fault injection can be performed in hardware (e.g., by exposing a hardware circuit to a radiation source or by using hardware test features to modify bit values), or injected via software means (e.g., slightly modifying software to permit flipping bits in memory to simulate a hardware fault). In a research paper, Vinter used a hybrid hardware/software fault injection technique to corrupt bits in a computer running an automotive-style engine control application. The conclusions of this paper start by saying: “We have demonstrated that bit-flips inside a central processing unit executing an engine control program can cause critical failures, such as permanently locking the engine’s throttle at full speed.” (Vinter 2001). Fault injection remains a preferred technique for determining whether there are data corruption vulnerabilities that can result in unsafe system behavior.

References:
  • Brown, D., “Solving the software safety paradox,” Embedded System Programming, December 1998, pp. 44-52.
  • Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
  • Seeger, M., Fault-Tolerant Software Techniques, SAE Report 820106, International Congress & Exposition, Society of Automotive Engineers, 1982, pp. 119-125.
  • Stepner, D., Nagarajan, R., & Hui, D., Embedded application design using a real-time OS, Design Automation Conference, 1999, pp. 151-156.
  • Vinter, J., Aidemark, J., Folkesson, P. & Karlsson, J., Reducing critical failures for control algorithms using executable assertions and best effort recovery, International Conference on Dependable Systems and Networks, 2001, pp. 347-356.