RIM explains BlackBerry crash; questions remain

The BlackBerry network failure in the US last week was caused by a small bit of new code and a still unexplained problem in the network's failover process.

Research in Motion, which runs the BlackBerry service through its Canadian network operations center, says it has ruled out security and capacity issues, and hardware or software infrastructure failures, as the cause of the outage, which blocked e-mail service to subscribers in the Americas.

RIM sent an e-mail describing the cause of the outage late Thursday night, Eastern Time, its first detailed communication since the BlackBerry service was disrupted Tuesday evening. The e-mail is a model of managed communication. It only once acknowledges a "problem," and only once uses the word "failure," and then only in eliminating a potential cause.

Just hours after the e-mail was sent, RIM and T-Mobile unveiled the debut of the "performance-driven" BlackBerry 8800 on the carrier's cell net. The BlackBerry 8800 allows users to "stay connected and productive while on the go," according to the joint press release.

But when RIM's own net failed, users of the 8800, and most of the other BlackBerry handsets, were left with little more than a rather expensive alarm clock or thumb-powered game console for the Brickbreaker game.

BlackBerry subscribers noticed the disruption when the usual stream of mobile e-mails dried up Tuesday evening. IT managers scrambled to figure out if the problem was related to RIM's enterprise server software, their wireless carrier, or the RIM operations center.

According to RIM, its IT staff Tuesday deployed a new system routine that was intended to better optimize the system's caching. The new code was not a critical routine and "was expected to be non-impacting with respect to the real-time operation of the BlackBerry infrastructure."

RIM has concluded that the pre-testing of the new code was "insufficient."

The new routine didn't behave according to plan, apparently: It "triggered a compounding series of interaction errors between the system's operational database and the cache."

Troubleshooters at the NOC identified the problem and tried to correct it. When those measures failed, and RIM has not given any time frame for this whole process, the NOC staff began a well-rehearsed failover process to a backup system.

And that process unexpectedly ran into problems. The RIM e-mail says the backup procedure has been "repeatedly and successfully tested previously." But, for reasons RIM has not yet explained, this time the process "did not fully perform to RIM's expectations." This second problem caused further delays in restoring service and processing the backlog of messages.

"RIM apologizes to customers for inconvenience resulting from the service interruption," according to the e-mail. The company is continuing its analysis of what happened and of what changes to make to minimize the chances of it happening again. The e-mail says that RIM has identified "certain aspects of its testing, monitoring, and recovery processes that will be enhanced" as a result of the failure.

The apology may not be enough for some enterprise users. A surprising number say they've never been contacted by RIM at all during the outage.

"RIM did NOT contact us... Before, during or after the outage," says Rich De Brino, CIO and vice president for Advances in Technology, the technology arm of Compass HealthCare, in Everett, Wash. "We learned [about] it from Slashdot first and then other internet sites, but never RIM."

As of this writing [9 a.m. ET], there is still no word of the outage posted at www.rim.com or www.blackberry.com .

The health provider has about 100 BlackBerry users, with Exchange as the corporate e-mail server. De Brino sent his comments via his BlackBerry.

"We've come to the conclusion that we should seriously re-evaluate using a service (like BlackBerry) that we don't control and once again consider using something like Microsoft Exchange Server and Windows Mobile instead," De Brino says. He acknowledges that such an alternative would only be as reliable as their own Exchange server and wireless connection. "But ours is pretty reliable: no down time last 24 months. So I like it," he says.

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

John Cox

Network World
Show Comments

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Ben Ramsden

Sharp PN-40TC1 Huddle Board

Brainstorming, innovation, problem solving, and negotiation have all become much more productive and valuable if people can easily collaborate in real time with minimal friction.

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

George Khoury

Sharp PN-40TC1 Huddle Board

The biggest perks for me would be that it comes with easy to use and comprehensive programs that make the collaboration process a whole lot more intuitive and organic

David Coyle

Brother PocketJet PJ-773 A4 Portable Thermal Printer

I rate the printer as a 5 out of 5 stars as it has been able to fit seamlessly into my busy and mobile lifestyle.

Kurt Hegetschweiler

Brother PocketJet PJ-773 A4 Portable Thermal Printer

It’s perfect for mobile workers. Just take it out — it’s small enough to sit anywhere — turn it on, load a sheet of paper, and start printing.

Featured Content

Product Launch Showcase

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?