Epic failures: 11 infamous software bugs
- — 10 September, 2010 00:20
Mark your calendars! Sept. 9 is hereby declared Debugging Day. It's been associated with removing bugs for more than 50 years now but is rarely formally celebrated. So let's start the tradition this year.
It all began with a log entry from 1947 by Harvard University's Mark II technical team. The now-classic entry features a moth taped to the page, time-stamped 15:45, with the caption "Relay #70 Panel F (moth) in relay" and the proud boast, "First actual case of bug being found."
moth taped to computer log
OK, the history of computer bugs didn't really begin on this date (see "Moth in the machine" for the real story), but nevertheless, its anniversary seems a perfect time to examine famous bugs and other ghosts in the machine.
Here is a highly selective -- and therefore incomplete -- collection of infamous software bugs. Unlike the relatively benign tale of the moth in the relay, some bugs have wreaked disaster, embarrassment and destruction on the world. Some have literally killed people.
Don't expect this collection to contain tales of the Ping of Death or other faults exploited by hackers and malware -- such as the Spanair crash of 2008 or the possibly apocryphal tale of the CIA sabotaging the Soviet gas pipeline. Nor will it include deliberate decisions by programmers that came back to haunt them later, as with Y2K.
Instead, this story is about outright programming errors that caused key failures in their own right.
Have I missed anything important? Consider this a call for nominations for the biggest bugs of all time. These are my suggestions; if you have any honorable mentions, bring 'em on. The worst anyone can do is swat them.
Programming errors that derail high-profile space-exploration missions -- especially bugs that cause spectacular explosions -- are frightening, expensive and career-killingly embarrassing for those who let them slip through. They provide extremely vivid reminders for all of us to check and recheck (and recheck and recheck and recheck) every line of code.
The Mars Climate Orbiter doesn't orbit
Back in physics class, our teachers leaped all over answers that consisted of a number. If the answer was 2.5, they'd take their red pens and write "2.5 what? Weeks? Puppies? Demerits?" And proceed to mark the answer wrong.
Back then, we thought that they were just being pedantic. But it's the kind of error that can burn up a $327.6 million project in minutes. It did in 1998, when the Mars Climate Orbiter built by NASA's Jet Propulsion Laboratory approached the Red Planet at the wrong angle. At this point, it could easily have been renamed the Mars Climate Bright Light in the Upper Atmosphere, and shortly afterward been renamed the Mars Climate Debris Drifting Through the Sky.
There were several problems with this spacecraft -- its uneven payload made it torque during flight, and its project managers neglected some important details during several stages of the mission. But the biggest problem was that different parts of the engineering team were using different units of measurement. One group working on the thrusters measured in English units of pounds-force seconds; the others used metric Newton-seconds. And whoever checked the numbers didn't use the red pen like a pedantic high-school teacher.
The result: The thrusters were 4.45 times more powerful than they should have been. If this goof had been spotted earlier, it could have been compensated for, but it wasn't, and the result of that inattention is now lost in space, possibly in pieces.
Mariner 1's five-minute flight
On July 22, 1962, the first spacecraft of NASA's Mariner program blasted off on a mission to fly by Venus. The booster did its job, taking the spacecraft from its Cape Canaveral launchpad, but after a few minutes, Mariner 1 began to yaw off course. The guidance system failed to correct the trajectory, and guidance commands failed to correct it manually.
As the rocket veered off toward North Atlantic shipping lanes, the range safety officer did the only thing he could do: blow the thing up. Four minutes and 55 seconds into the mission, the Mariner 1 exploded.
NASA was already suffering from Sputnik envy, and the Mariner I incident was another international embarrassment for the agency. The postmortem of this debacle revealed what NASA described as "improper operation of the Atlas airborne beacon equipment" -- though later it came out that the mistranscription of a single punctuation mark by an engineer caused the mission's fatal software error.
In his 1968 book The Promise of Space, Arthur C. Clarke described the mission as "wrecked by the most expensive hyphen in history."
That may not be strictly accurate. Although NASA did mention a hyphen in some of its reports of the incident, it appears that the agency was simplifying the story for a nontechnical audience.
A more widely accepted account is that the punctuation mark was a superscript bar over a radius symbol, handwritten in a notebook. In rocket science, the overbar signifies a smoothing function, so the formula should have calculated the smoothed value of the time derivative of a radius.
Without the smoothing function, even minor variations of speed would trigger the corrective boosters to kick in. The automobile driving equivalent would be to yank the steering wheel in the opposite direction of every obstacle in the driver's field of vision.
But few people know what an overbar is, and since it looks like a hyphen, that's how most people tell the story.
Moth in the machine: Debugging the origins of 'bug'
It's an oft-repeated tale that the grande dame of military computing, computer scientist and U.S. Navy Rear Admiral Grace Hopper, coined the terms bug and debug after an incident involving Harvard University's Mark II calculator.
The story goes like this:
On Sept. 9, 1945, a Harvard technical team looked at Panel F and found something unusual between points in Relay 70. It was a moth, which they promptly removed and taped in the log book. Grace Hopper added the caption, "First actual case of bug being found," and that's the first time anyone used the word bug to describe a computer glitch. Naturally, the term debugging followed.
Yes, it's an oft-repeated tale, but it's got more bugs in it than Relay 70 ever had.
For one thing, Harvard's Mark II came online in summer of 1947, two years after the date attributed to this story. For another thing, you don't use a line like "First actual case of bug being found" if the term bug isn't already in common use. The comment doesn't make sense in that context, except as an example of engineer humor. And although Hopper often talked about the moth in the relay, she did not make the discovery or the log entry.
The core facts of the story are true -- including the date of Sept. 9 and time of 15:45 hours -- but that's not how this meaning of the word bug entered the lexicon. Inventors and engineers had been talking about bugs for more than a century before the moth-in-the-relay incident. Even Thomas Edison used the word. Here's an excerpt from a letter he wrote in 1878 to Theodore Puskas, as cited in The Yale Book of Quotations (2006):
'Bugs' -- as such little faults and difficulties are called -- show themselves and months of intense watching, study and labor are requisite before commercial success or failure is certainly reached.
Word nerds trace the word bug to an old term for a monster -- it's a word that has survived in obscure terms like bugaboo and bugbear and in a mangled form in the word boogeyman. Like gremlins in machinery, system bugs are malicious. Anyone who spends time trying to get all the faults out of a system knows how it feels: After a few hours of debugging, any problems that remain are hellspawn, mocking attempts to get rid of them with a devilish glee.
And that's the real origin of the term bug. But the tale of the moth in the relay is worth retelling anyway.
Forty seconds of Ariane-5
The European Space Agency (ESA) has also suffered embarrassment on the software front. The inaugural flight of its fifth-generation Ariane launcher bested NASA's Mariner 1 score for unmanned spacecraft disaster: It took only 40 seconds to blow up.
On June 4, 1996, after the kind of dramatic vertical blastoff you'd expect from a high-profile European vehicle, cameras on the ground barely had time to focus on the Ariane-5 as it turned around and began to fall apart, before it completely exploded.
The Ariane Flight 501 disaster began with a loss of guidance and attitude information 30 seconds after liftoff. Once it veered completely off course, it automatically self-destructed.
The problem was that Ariane-5's inertial reference system dealt with 64-bit floating-point data and converted it into 16-bit signed integer values. The result of the data conversion was too large for a 16-bit signed integer, which caused an arithmetic overflow in the hardware. In the ESA's case, a software handler that could have dealt with the problem had been disabled, and so there was no levee to dam the cascade of system failures that led to the destruction.
Some bugs are noisy: They cause explosions that destroy machines. Others are subtler in their destructiveness: They cause severe embarrassment that turns companies' good names to "Mud" and sometimes threatens the bottom line.
Pentium chips fail math
In 1994, an entire line of CPUs by market leader Intel simply couldn't do their math. The Pentium floating-point flaw ensured that no matter what software you used, your results stood a chance of being inaccurate past the eighth decimal point. The problem lay in a faulty math coprocessor, also known as a floating-point unit. The result was a small possibility of tiny errors in hardcore calculations, but it was a costly PR debacle for Intel.
How did the first generation of Pentiums go wrong? Intel's laudable idea was to triple the execution speed of floating-point calculations by ditching the previous-generation 486 processor's clunky shift-and-subtract algorithm and substituting a lookup-table approach in the Pentium. So far, so smart. The lookup table consisted of 1,066 table entries, downloaded into the programmable logic array of the chip. But only 1,061 entries made it onto the first-generation Pentiums; five got lost on the way.
When the floating-point unit accessed any of the empty cells, it would get a zero response instead of the real answer. A zero response from one cell didn't actually return an answer of zero: A few obscure calculations returned slight errors typically around the tenth decimal digit, so the error passed by quality control and into production.
What did that mean for the lay user? Not much. With this kind of bug, there's a 1-in-360 billion chance that miscalculations could reach as high as the fourth decimal place. More likely, with odds of 1-to-9 billion against, was that any errors would happen in the 9th or 10th decimal digit.
More math bugs
Intel's Pentium flaw wasn't the only math-related bug to cause a PR disaster. These two had Microsoft execs red in the face:
Windows Calculator 3.x: In 1994, a bug in CALC.EXE came to light that had quietly been kicking around since Windows 3.x first appeared in 1990. Propellerheads had fun subtracting 2.1 from 2.11 in Windows Calculator and getting an answer of not 0.01, but 0.00.
Excel 2007: Ask people with calculators or slide rules to multiply 850 x 77.1, and they'll answer 65,535. But in September 2007, it was discovered that Excel 2007 answered 100,000. According to Microsoft, this bizarre rounding-up occurred only in calculations that resulted in 65,535 or 65,536. What's more, Excel actually calculated the correct answer, but a bug prevented it from displaying properly.
But wouldn't you know it? A Virginia-based math professor named Thomas Nicely needed that level of accuracy, found he wasn't getting it and figured out why.
In October 1994, he alerted Intel, then others, to the problem. Intel retorted with a response only marginally less tactful than "Oh, that thing? Yeah, we noticed that back in June."
Thus began an inexorable slide into PR hell and a costly mop-up bill. In January 1995, Intel announced a pretax charge of $475 million against earnings, most of which apparently stemmed from replacing flawed processors.
The bottom line in this arithmetic mess is this: In lookup-table and money calculations, 1,066 -- 5 = --$475,000,000. Any way you look at it, that's bad math.
Call waiting ... and waiting ... and waiting
On Jan. 15, 1990, around 60,000 AT&T long-distance customers tried to place long-distance calls as usual -- and got nothing. Behind the scenes, the company's 4ESS long-distance switches, all 114 of them, kept rebooting in sequence. AT&T assumed it was being hacked, and for nine hours, the company and law enforcement tried to work out what was happening. In the end, AT&T uncovered the culprit: an obscure fault in its new software.
Here's how the switches were supposed to work: If one switch gets congested, it sends a "do not disturb" message to the next switch, which picks up its traffic. The second switch resets itself to keep from disturbing the first switch. Switch 2 checks back on Switch 1, and if it detects activity, it does another reset to reflect that Switch 1 is back online. So far, so simple.
The month before the crash, AT&T tweaked the code to speed up the process. The trouble was, things were too fast. The first server to overload sent two messages, one of which hit the second server just as it was resetting. The second server assumed that there was a fault in its CCS7 internal logic and reset itself. It put up its own "do not disturb" sign and passed the problem on to a third switch.
The third switch also got overwhelmed and reset itself, and so the problem cascaded through the whole system. All 114 switches in the system kept resetting themselves, until engineers reduced the message load on the whole system and the wave of resets finally broke.
In the meantime, AT&T lost an estimated $60 million in long-distance charges from calls that didn't go through. The company took a further financial hit a few weeks later when it knocked a third off its regular long-distance rates on Valentine's Day to make amends with customers.
Windows Genuine Disadvantage
Introduced in 2006, Windows Genuine Advantage was never a popular initiative with Microsoft's customers. Consumers had trouble seeing the advantages: It did nothing to help the security or stability of a legitimate Windows installation. All it did was help Microsoft root out software piracy.
In that task, it was as vigilant as, well, a vigilante. In fact, in late-August 2007, it found piracy everywhere it looked -- even among thousands of legitimate Windows customers.
On Friday, Aug. 24, someone on the WGA team accidentally installed bug-filled preproduction software on the WGA servers. The team quickly rolled back to a tested release of the software, but they didn't check that their fix actually addressed the problem. It didn't. So for 19 hours, until around 3 p.m. the following day, the server flagged thousands of WGA clients across the globe as illegal.
Windows XP customers were told they were running pirated software. Windows Vista customers were slapped harder: They had features turned off, including the eye candy Aero theme and support for ReadyBoost virtual RAM drives.
The first official response to complaints didn't help much: Disgruntled patrons were advised to try to revalidate on Tuesday. But even when the problem was fixed, mid-Saturday afternoon, Vista clients still had to revalidate their Windows installations before they could ReadyBoost their way back into Aero.
OK, so this was a relatively mild issue in engineering terms, and strictly speaking, it was caused by human error. But the error in question was deploying buggy, untested software, and when you factor in the number of people affected, the level of anger induced and the knock-on effect of bad publicity, it was more severe than it seems at first glance.
Grievous bodily bugs
Not all bugs can be laughed off. Some of them are fatal. Medical and military software can be especially dangerous when not properly tested, as shown with these fatal flaws.
Patriot missile mistiming
During the first Persian Gulf war, Iraqi-fired Scud missiles were the most threatening airborne enemies to U.S. troops. Once one of these speeding death rockets launched, the U.S.'s best defense was to intercept it with an antiballistic Patriot missile. The Patriot worked a bit like a shotgun, getting within range of an oncoming missile before blasting out a cloud of 1,000 pellets to detonate its warhead.
A Patriot needed to deploy its pellets between 5 and 10 meters from an oncoming missile for the best results. This requires split-second timing, which is always tricky with two objects moving very fast toward each other. Even the Patriot's most prominent booster, then-President George H.W. Bush, conceded that one Scud (out of 42 fired) got past the Patriot. The single failure the president acknowledged was at a U.S. base in Dhahran, Saudi Arabia, on Feb. 25, 1991, and it cost 28 soldiers their lives. The fault was traced to a software error.
The Patriot's trajectory calculations revolved around the timing of radar pulses, and they had to be modified to deal with the high speed of modern missiles. A subroutine was introduced to convert clock time more accurately into floating-point figures for calculation. It was a neat kludge, but the programmers did not put the call to the subroutine everywhere it was needed. High-speed trajectories based on one accurately timed radar pulse and one less-precise time increased the chances of poorly timed deployment.
Apparently, the issue was known, and a temporary fix was in place: Reboot the system every so often to reset the clocks. Unfortunately, the term "every so often" wasn't defined, and that was the problem in late February at Dhahran. The system had been running for 100 hours, and the clocks were off by about a third of a second. A Scud travels half a kilometer in that time, so there was no chance the Patriot could have intercepted it.
On a side note, some experts did dispute the president's claims of a more than 97% success rate for Patriots vs. Scuds, so it's possible that this bug caused more (but less high-profile) damage than the incident at Dhahran.
Therac-25 Medical Accelerator disaster
Radiation therapy is a handy tool in the fight against some contained forms of cancer: Beams of electrons zap the bad stuff, and the body disposes of the dead matter. It has a strong success rate, but it depends on accurate aim and focus. That's something that the medical world leaves to machinery. Unfortunately for six patients between 1985 and 1986, the Therac-25 was the machine in question.
The Therac-25 handled two types of therapy: a low-powered direct electron beam and a megavolt X-ray mode, which required shielding and filters and an ion chamber to keep the dangerous beams safely on target. The trouble was that the software that powered the unit was repurposed from the previous model, and it wasn't adequately tested.
If the operators changed the mode of the device too quickly, a race condition occurred: Two sets of instructions were sent, and the first one to arrive set the mode. In six documented cases, this meant that megavolt X-rays were sent, unfiltered and unshielded, toward patients requiring direct electron therapy. At least two of them screamed in pain and tried to run from the room. All of them suffered radiation poisoning, which claimed several lives.
The Therac-25, which was recalled in 1987, has become an object lesson in what can go wrong with powerful medical machinery. The code didn't cause overdoses in earlier Therac models because hardware constraints prevented them. Reusing code on a new system without thorough testing is a programming no-no, with good reason.
The new system did deliver error messages during race-condition events, but the codes were cryptic, undocumented and easily overridden -- which is what operators did. With adequate documentation and training, the overdoses would never have happened. Additionally, a smaller bug that set up flag variables occasionally caused arithmetic overflows that bypassed safety checks.
Multidata Systems/Cobalt-60 overdoses
Unfortunately, the Therac-25 disaster wasn't the last software-related radiation therapy failure. Twenty-five years after the Therac-25 incident, a Cobalt-60 machine in Panama's National Cancer Institute overdosed more than two-dozen patients with gamma radiation.
As with the Therac-25, the Cobalt-60 system was an accident waiting to happen. Unlike the Therac-25, the Cobalt-60 was an old, overused and undermaintained piece of hardware. The software that ran it was an aftermarket program from Multidata Systems, because the Panamanian hospital could not afford what the machine's manufacturer, Theratronics, charged.
Two of the technicians who operated the Cobalt-60 had quit, leaving the rest to work 16-hour days to keep up with treatments. Very sick patients would sometimes wait four to six hours a day for scheduled treatments.
Overworked and tired technicians requested some software maintenance, but management overlooked their requests. Somewhere along the line, the technicians hit upon a more efficient way to line up the shields that defined the radiation's target. It wasn't in the manual, but it seemed to work. Unfortunately, if you lined up the shields in a particular order, an obscure bug in the Multidata software meant that the patients were overirradiated. Because of massive overwork and undersupervision, the process went on for seven months.
By the time Multidata Systems issued an advisory about a "data entry sequence that creates a self-intersecting shape outline" in mid-2001, it was too late for many patients. The exact death toll is hard to calculate -- these were very sick patients even before their treatment -- but it's a tragic mess-up by any measure.
Osprey aircraft crash
Two weeks before Christmas in 2000, a U.S. Marine Corps Osprey, a hybrid airplane and helicopter, suffered a hydraulic system fault that should have been remedied without loss of life. A hydraulic line broke in one of the two engine cases as the Osprey was shifting from airplane to helicopter mode for landing.
According to the Marine Corps major general who presented reports during the investigation of the incident, the trouble was "compounded by a computer software anomaly." The flight-control computer stopped the rotation of the engine pods when it detected the hydraulic failure.
The pilots went through the normal procedure and pressed the primary reset button to re-engage the pods. At this point, both prop rotors went through "significant pitch and thrust changes," which led to a stall. The plane crashed into a marsh and killed all four Marines onboard.
The nature of the software flaw is still hard to track down: Boeing and Bell Helicopter made the Osprey, and Boeing's spokesman said only that changes were made in the software. Requests for details were referred to the government, and as of now, the explanation has not been forthcoming.
Remember how the world descended into nuclear oblivion on Sept. 23, 1983? No? Well, thank your lucky stars -- this is a tale of bugs so major they could have brought the entire world to a standstill.
It was all averted by the common sense of one individual, who ignored the Soviet early-warning system's faulty reports of incoming missiles and didn't launch a counterattack on the United States.
The warning system set off klaxons at half past midnight on that September morning. Apparently, the U.S. had launched five nuclear missiles toward what the U.S. president had taken to calling "the Evil Empire."
At the time, Lt. Col. Stanislaus Petrov reasoned his way to a decision not to respond: The USSR was in a shouting match with the U.S. about a Soviet attack on Korean Air Lines Flight 007 three weeks earlier, but it was only a rhetorical battle at that stage. Besides, if the U.S. wanted to attack the Soviet Union, would it really launch only five missiles?
Petrov ordered his men to stand down, and 15 minutes later, radar outposts confirmed that there were no incoming missiles. The decision took less than five minutes, it was confirmed within half an hour, and the world remained at peace.
When the early-warning system was later analyzed, it was found to have more bugs than a suburban compost heap -- which meant that although Stanislaus Petrov had saved the world, he'd made a serious error of judgment: He had shown up the incompetence of Soviet programmers.
This was not good for morale, or for the lieutenant colonel. He was cold-shouldered into an early retirement and was largely unsung until May 21, 2004, when a San Francisco-based organization called the Association of World Citizens bestowed its highest honor -- world citizenship -- and a financial reward on him.
The bug that never was: Black Monday's dark secret
It is a truth universally acknowledged (by people who don't know bugs) that the end of the 1980s stock boom, Black Monday of 1987, was precipitated by buggy software. It was Wall Street's greatest ever loss in a single day: The Dow Jones Industrial Average plummeted 508 points, 22.6% of its total value, and the S&P 500 dropped 20.4%. And it was all the fault of bugs in the computer models.
Except that it wasn't.
Program trading was relatively new and harder to understand back then, and people with diminished pension funds were anxious to find a scapegoat they could really lay the blame on. It was easier to point to a faulty program than to understand overvaluation, lack of liquidity, international disputes about exchange rates, and the market's notoriously bipolar psychology. So the computers became the bad guys.
Of course, program trading did contribute to the precipitous fall of American markets. The software contained strategy models for handling portfolio insurance, and it was there that the problems of Monday, Oct. 19, 1987, really lay. Portfolio insurance derivatives are tied to the condition of the market. After things nose-dived in Hong Kong and Europe, the sun rose on a Wall Street ready to react: The writers of derivatives sold on every down-tick, and plummeting values triggered a cascade of selling.
But the trading programs just did as they were instructed. The fact that they sold as the financial markets collapsed around them wasn't a bug, it was a feature -- just not a well-thought-out one.
Now it's your turn -- tell us your bug tales in the reader comments.
Matt Lake is familiar with quality control systems and auditing, but he is also writing a science book that includes a subchapter on entomology, making him a bug connoisseur in more ways than one.