RIM explains BlackBerry crash; questions remain

The BlackBerry network failure in the US last week was caused by a small bit of new code and a still unexplained problem in the network's failover process.

Research in Motion, which runs the BlackBerry service through its Canadian network operations center, says it has ruled out security and capacity issues, and hardware or software infrastructure failures, as the cause of the outage, which blocked e-mail service to subscribers in the Americas.

RIM sent an e-mail describing the cause of the outage late Thursday night, Eastern Time, its first detailed communication since the BlackBerry service was disrupted Tuesday evening. The e-mail is a model of managed communication. It only once acknowledges a "problem," and only once uses the word "failure," and then only in eliminating a potential cause.

Just hours after the e-mail was sent, RIM and T-Mobile unveiled the debut of the "performance-driven" BlackBerry 8800 on the carrier's cell net. The BlackBerry 8800 allows users to "stay connected and productive while on the go," according to the joint press release.

But when RIM's own net failed, users of the 8800, and most of the other BlackBerry handsets, were left with little more than a rather expensive alarm clock or thumb-powered game console for the Brickbreaker game.

BlackBerry subscribers noticed the disruption when the usual stream of mobile e-mails dried up Tuesday evening. IT managers scrambled to figure out if the problem was related to RIM's enterprise server software, their wireless carrier, or the RIM operations center.

According to RIM, its IT staff Tuesday deployed a new system routine that was intended to better optimize the system's caching. The new code was not a critical routine and "was expected to be non-impacting with respect to the real-time operation of the BlackBerry infrastructure."

RIM has concluded that the pre-testing of the new code was "insufficient."

The new routine didn't behave according to plan, apparently: It "triggered a compounding series of interaction errors between the system's operational database and the cache."

Troubleshooters at the NOC identified the problem and tried to correct it. When those measures failed, and RIM has not given any time frame for this whole process, the NOC staff began a well-rehearsed failover process to a backup system.

And that process unexpectedly ran into problems. The RIM e-mail says the backup procedure has been "repeatedly and successfully tested previously." But, for reasons RIM has not yet explained, this time the process "did not fully perform to RIM's expectations." This second problem caused further delays in restoring service and processing the backlog of messages.

"RIM apologizes to customers for inconvenience resulting from the service interruption," according to the e-mail. The company is continuing its analysis of what happened and of what changes to make to minimize the chances of it happening again. The e-mail says that RIM has identified "certain aspects of its testing, monitoring, and recovery processes that will be enhanced" as a result of the failure.

The apology may not be enough for some enterprise users. A surprising number say they've never been contacted by RIM at all during the outage.

"RIM did NOT contact us... Before, during or after the outage," says Rich De Brino, CIO and vice president for Advances in Technology, the technology arm of Compass HealthCare, in Everett, Wash. "We learned [about] it from Slashdot first and then other internet sites, but never RIM."

As of this writing [9 a.m. ET], there is still no word of the outage posted at www.rim.com or www.blackberry.com .

The health provider has about 100 BlackBerry users, with Exchange as the corporate e-mail server. De Brino sent his comments via his BlackBerry.

"We've come to the conclusion that we should seriously re-evaluate using a service (like BlackBerry) that we don't control and once again consider using something like Microsoft Exchange Server and Windows Mobile instead," De Brino says. He acknowledges that such an alternative would only be as reliable as their own Exchange server and wireless connection. "But ours is pretty reliable: no down time last 24 months. So I like it," he says.

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

John Cox

Network World
Show Comments

Brand Post

Most Popular Reviews

Latest Articles


PCW Evaluation Team

Tom Pope

Dynabook Portégé X30L-G

Ultimately this laptop has achieved everything I would hope for in a laptop for work, while fitting that into a form factor and weight that is remarkable.

Tom Sellers


This smart laptop was enjoyable to use and great to work on – creating content was super simple.

Lolita Wang


It really doesn’t get more “gaming laptop” than this.

Jack Jeffries


As the Maserati or BMW of laptops, it would fit perfectly in the hands of a professional needing firepower under the hood, sophistication and class on the surface, and gaming prowess (sports mode if you will) in between.

Taylor Carr


The MSI PS63 is an amazing laptop and I would definitely consider buying one in the future.

Christopher Low

Brother RJ-4230B

This small mobile printer is exactly what I need for invoicing and other jobs such as sending fellow tradesman details or step-by-step instructions that I can easily print off from my phone or the Web.

Featured Content

Product Launch Showcase

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?