RIM explains BlackBerry crash; questions remain

The BlackBerry network failure in the US last week was caused by a small bit of new code and a still unexplained problem in the network's failover process.

Research in Motion, which runs the BlackBerry service through its Canadian network operations center, says it has ruled out security and capacity issues, and hardware or software infrastructure failures, as the cause of the outage, which blocked e-mail service to subscribers in the Americas.

RIM sent an e-mail describing the cause of the outage late Thursday night, Eastern Time, its first detailed communication since the BlackBerry service was disrupted Tuesday evening. The e-mail is a model of managed communication. It only once acknowledges a "problem," and only once uses the word "failure," and then only in eliminating a potential cause.

Just hours after the e-mail was sent, RIM and T-Mobile unveiled the debut of the "performance-driven" BlackBerry 8800 on the carrier's cell net. The BlackBerry 8800 allows users to "stay connected and productive while on the go," according to the joint press release.

But when RIM's own net failed, users of the 8800, and most of the other BlackBerry handsets, were left with little more than a rather expensive alarm clock or thumb-powered game console for the Brickbreaker game.

BlackBerry subscribers noticed the disruption when the usual stream of mobile e-mails dried up Tuesday evening. IT managers scrambled to figure out if the problem was related to RIM's enterprise server software, their wireless carrier, or the RIM operations center.

According to RIM, its IT staff Tuesday deployed a new system routine that was intended to better optimize the system's caching. The new code was not a critical routine and "was expected to be non-impacting with respect to the real-time operation of the BlackBerry infrastructure."

RIM has concluded that the pre-testing of the new code was "insufficient."

The new routine didn't behave according to plan, apparently: It "triggered a compounding series of interaction errors between the system's operational database and the cache."

Troubleshooters at the NOC identified the problem and tried to correct it. When those measures failed, and RIM has not given any time frame for this whole process, the NOC staff began a well-rehearsed failover process to a backup system.

And that process unexpectedly ran into problems. The RIM e-mail says the backup procedure has been "repeatedly and successfully tested previously." But, for reasons RIM has not yet explained, this time the process "did not fully perform to RIM's expectations." This second problem caused further delays in restoring service and processing the backlog of messages.

"RIM apologizes to customers for inconvenience resulting from the service interruption," according to the e-mail. The company is continuing its analysis of what happened and of what changes to make to minimize the chances of it happening again. The e-mail says that RIM has identified "certain aspects of its testing, monitoring, and recovery processes that will be enhanced" as a result of the failure.

The apology may not be enough for some enterprise users. A surprising number say they've never been contacted by RIM at all during the outage.

"RIM did NOT contact us... Before, during or after the outage," says Rich De Brino, CIO and vice president for Advances in Technology, the technology arm of Compass HealthCare, in Everett, Wash. "We learned [about] it from Slashdot first and then other internet sites, but never RIM."

As of this writing [9 a.m. ET], there is still no word of the outage posted at www.rim.com or www.blackberry.com .

The health provider has about 100 BlackBerry users, with Exchange as the corporate e-mail server. De Brino sent his comments via his BlackBerry.

"We've come to the conclusion that we should seriously re-evaluate using a service (like BlackBerry) that we don't control and once again consider using something like Microsoft Exchange Server and Windows Mobile instead," De Brino says. He acknowledges that such an alternative would only be as reliable as their own Exchange server and wireless connection. "But ours is pretty reliable: no down time last 24 months. So I like it," he says.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

John Cox

Network World

Comments

Comments are now closed.

Most Popular Reviews

Follow Us

Best Deals on GoodGearGuide

Shopping.com

Latest News Articles

Resources

GGG Evaluation Team

Kathy Cassidy

STYLISTIC Q702

First impression on unpacking the Q702 test unit was the solid feel and clean, minimalist styling.

Anthony Grifoni

STYLISTIC Q572

For work use, Microsoft Word and Excel programs pre-installed on the device are adequate for preparing short documents.

Steph Mundell

LIFEBOOK UH574

The Fujitsu LifeBook UH574 allowed for great mobility without being obnoxiously heavy or clunky. Its twelve hours of battery life did not disappoint.

Andrew Mitsi

STYLISTIC Q702

The screen was particularly good. It is bright and visible from most angles, however heat is an issue, particularly around the Windows button on the front, and on the back where the battery housing is located.

Simon Harriott

STYLISTIC Q702

My first impression after unboxing the Q702 is that it is a nice looking unit. Styling is somewhat minimalist but very effective. The tablet part, once detached, has a nice weight, and no buttons or switches are located in awkward or intrusive positions.

Latest Jobs

Shopping.com

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?