RIM explains BlackBerry crash; questions remain

  • John Cox (Network World)
  • 23 April, 2007 09:25

The BlackBerry network failure in the US last week was caused by a small bit of new code and a still unexplained problem in the network's failover process.

Research in Motion, which runs the BlackBerry service through its Canadian network operations center, says it has ruled out security and capacity issues, and hardware or software infrastructure failures, as the cause of the outage, which blocked e-mail service to subscribers in the Americas.

RIM sent an e-mail describing the cause of the outage late Thursday night, Eastern Time, its first detailed communication since the BlackBerry service was disrupted Tuesday evening. The e-mail is a model of managed communication. It only once acknowledges a "problem," and only once uses the word "failure," and then only in eliminating a potential cause.

Just hours after the e-mail was sent, RIM and T-Mobile unveiled the debut of the "performance-driven" BlackBerry 8800 on the carrier's cell net. The BlackBerry 8800 allows users to "stay connected and productive while on the go," according to the joint press release.

But when RIM's own net failed, users of the 8800, and most of the other BlackBerry handsets, were left with little more than a rather expensive alarm clock or thumb-powered game console for the Brickbreaker game.

BlackBerry subscribers noticed the disruption when the usual stream of mobile e-mails dried up Tuesday evening. IT managers scrambled to figure out if the problem was related to RIM's enterprise server software, their wireless carrier, or the RIM operations center.

According to RIM, its IT staff Tuesday deployed a new system routine that was intended to better optimize the system's caching. The new code was not a critical routine and "was expected to be non-impacting with respect to the real-time operation of the BlackBerry infrastructure."

RIM has concluded that the pre-testing of the new code was "insufficient."

The new routine didn't behave according to plan, apparently: It "triggered a compounding series of interaction errors between the system's operational database and the cache."

Troubleshooters at the NOC identified the problem and tried to correct it. When those measures failed, and RIM has not given any time frame for this whole process, the NOC staff began a well-rehearsed failover process to a backup system.

And that process unexpectedly ran into problems. The RIM e-mail says the backup procedure has been "repeatedly and successfully tested previously." But, for reasons RIM has not yet explained, this time the process "did not fully perform to RIM's expectations." This second problem caused further delays in restoring service and processing the backlog of messages.

"RIM apologizes to customers for inconvenience resulting from the service interruption," according to the e-mail. The company is continuing its analysis of what happened and of what changes to make to minimize the chances of it happening again. The e-mail says that RIM has identified "certain aspects of its testing, monitoring, and recovery processes that will be enhanced" as a result of the failure.

The apology may not be enough for some enterprise users. A surprising number say they've never been contacted by RIM at all during the outage.

"RIM did NOT contact us... Before, during or after the outage," says Rich De Brino, CIO and vice president for Advances in Technology, the technology arm of Compass HealthCare, in Everett, Wash. "We learned [about] it from Slashdot first and then other internet sites, but never RIM."

As of this writing [9 a.m. ET], there is still no word of the outage posted at www.rim.com or www.blackberry.com .

The health provider has about 100 BlackBerry users, with Exchange as the corporate e-mail server. De Brino sent his comments via his BlackBerry.

"We've come to the conclusion that we should seriously re-evaluate using a service (like BlackBerry) that we don't control and once again consider using something like Microsoft Exchange Server and Windows Mobile instead," De Brino says. He acknowledges that such an alternative would only be as reliable as their own Exchange server and wireless connection. "But ours is pretty reliable: no down time last 24 months. So I like it," he says.