Crashing after a prolonged up time due to a counter rollover or other problem is a classic mistake in computer software. And, it just bit the Boeing 787.
Again.
The Problem:The Boeing 787 aircraft has three Flight Control Modules (FCMs) that are the subject of a new
FAA Airworthiness Directive. Based on that sentence alone, you want to make sure whatever that involves gets fixed before you fly on a 787!
The FAA says there is "a report" that all three FCMs can fail at the same time 22 days after they have been rebooted. If you don't reboot the FCMs the FAA says this "could result in flight control surfaces not moving in response to flight crew inputs for a short time and consequent temporary loss of controllability." This is FAA-speak for the airplane could crash. The're telling airlines to reboot the plane every 21 days to avoid this. Hope nobody forgets to do that! (In fairness, I understand it is likely that most planes get rebooted more often than this anyway. But this is not something you want to leave to chance.)
At this point we can only guess at the cause, but the usual guess is that it is a timer overflow problem. Let's hypothesize a 32-bit signed integer is counting the passing of time in milliseconds. So a value of 32700 in that counter is 32.700 seconds.
How long until it overflows 31 bits of counting into the 32nd bit, which is the sign bit?
0x7FFFFFFF = 2147483647 ==> 2147483.647 seconds
2147483.647 seconds * (1 min/60 sec) (1 hr/60 min)(1 day/24 hr) = 24.9 days
Hmm, a bit longer than the 22 days the FAA reports. Some time spent playing with various multipliers didn't seem to give a likely candidate. Possible factors if it is a timer rollover would include fixed point math (e.g., time keeping in 256ths of a second) or scaling from a 400 Hz aircraft AC frequency. Or there could be some divided-down crystal oscillator frequency on the FCM that is involved.
Or, it could be something completely different. Maybe there is memory that records operating parameters periodically and the system crashes when that fills up that memory (for example, logs that get downloaded every maintenance interval, with an expectation that the maintenance interval is more like a few days than a few weeks).
For now the cause is a bit of a mystery to us. I'll bet the FCM engineers have a pretty good idea at this point. No doubt they'll issue a fix as fast as they can get the FAA to review it.
But the big news is that for the second time, the FAA is telling is telling the
airlines they have to do a maintenance reboot of their planes. Last time it was every 248 days. This time it's every 21 days.
It's bad enough that they have to reboot the infotainment systems once in a while. For flight controls, this is not good news. This is the kind of problem that should be caught in design reviews. Always think about what happens if any counter, timer, or data structure overflows.
Other Examples:
This is not the first time a problem with long-running software has happened beyond the usual memory leaks in everyday applications. Some examples are:
Timer rollover bugs:
- B787 needs to be rebooted every 248 days due to a likely timer overflow bug [Blog][NY Times] [FAA]
- Air Traffic control loses contact with 400 aircraft due to a 32-bit time rollover in 2004 [IEEE Spectrum]
- IBM: Interface adapters hang after 497 days of uptime [IBM]
- Windows 95: hang after 49.7 days without reboot, counting in milliseconds [Microsoft] (I met the engineers who found that one. And congratulated them on the significant feat of actually getting Windows 95 to run that long without crashing for some other reason!)
There are also plenty of date roll-over bugs:
- NASA Deep Impact Comet Mission terminated unexpectedly when at 2**32 seconds after Jan 1, 2000 (a time rollover bug). [IEEE Software]
- Y2K: on 1 January 2000 (overflow of 2-digit year from 99 to 00) [Wikipedia]
- GPS: 1024 week rollover on 22 August 1999 [USCG]
- Year 2038: Unix time will roll over on 19 January 2038 [Wikipedia]
There are also somewhat related capacity overflow issues such
And floating-point roundoff issues (thanks to Dan for reminding me of this one):
- Patriot Missile mishap after operating for 100 hours without a maintenance reboot [GAO]
If you want to dig further, there is a "zoo" of related problems on Wikipedia: "
Time formatting and storage bugs"