Kings of open source monitoring
- 01 October, 2009 16:11
Network monitoring is a fact of life for IT departments. Monitoring software ranges from simple ICMP-based scripts for up/down monitoring to midrange products like SolarWinds to high-end offerings such as HP's OpenView and IBM's Tivoli — all of which have their drawbacks. Simpler monitoring systems don't provide enough information about your network, while the feature-laden high-end systems can be prohibitively expensive. At the same time, midrange systems might not scale well for monitoring large networks.
Two network monitoring systems with open source roots, OpenNMS and Zenoss, provide a bevy of features at a lower (or no) cost than their high-end competitors, and can scale to monitor large numbers of network nodes. Both solutions compete with large commercial systems such as OpenView and Tivoli. They are advanced systems capable of monitoring a wide variety of network devices. OpenNMS is completely open source, while Zenoss offers Core, a free open source edition that can be extended with free Zenoss- and community-built add-ons.
Although OpenNMS and Zenoss support a large number of common network devices, computers, and applications out of the box, some companies will have uncommon or specialized equipment that is not yet supported. Both provide facilities to extend the functionality of the system and to add custom network device support. Zenoss uses its ZenPack system, which is a Python-egg plug-in architecture.
OpenNMS has several ways to provide additional capabilities. First, most of the configuration information for OpenNMS exists in XML files in /etc/opennms. These files can be edited to add new notification methods and other small extensions to the application. Next are event automations, which allow you to specify an SQL query to look for thresholds being exceeded and create events or trigger actions (such as shell scripts). External tools — such as mib2opennms for SNMP Traps and the mibParser for data collection — can be used to convert SNMP MIB information into a format OpenNMS can use. Finally, if you're a Java programmer, it is pretty straightforward to create custom monitors and data collectors in Java.
In addition, both systems have the ability to run custom scripts, and they can use Nagios plug-ins to extend functionality. This is a handy feature, but note that it can hamper scalability for large networks. (See the article, "Maximize the performance of your monitoring system.")
OpenNMS and Zenoss put in good efforts at making their applications easy to use, providing Web interfaces for management and information delivery. Nevertheless, to get the full network monitoring benefits of these two systems, you will need to roll up your sleeves and spend at least some time on the command line. For example, neither has the ability to import a list of devices via the Web interface. And we have to face the fact that an enterprise-grade network monitoring system is not a simple piece of software. Despite both companies' best efforts at making their software easy to manage, these are complex systems, and IT staff will want to read all the documentation and take full advantage of the training classes offered by both companies.
I was pleasantly surprised to find that installation is quite easy for the two products, and both support a variety of Unix-ish platforms, including various Linux distributions, FreeBSD, Solaris, and Mac OS X. OpenNMS can also run on OpenBSD and Windows. As a longtime Debian user, I'm used to being disappointed when big commercial applications do not support Debian's native apt installer or even deb packages. So I was pleased that Zenoss provides stand-alone deb packages with dependencies, and OpenNMS maintains a software repository for use with Debian's apt software installation system.
Further, both of these network monitoring systems can be run on virtualization platforms such as VMware and Xen. Zenoss maintains a VMware image of its open source Core version, and OpenNMS makes a VMware image available for download from SourceForge. My company runs OpenNMS on an Amazon EC2 cloud computing instance.
OpenNMS and Zenoss share several advanced features that separate them from their lower-end open source counterparts such as Nagios or Cacti. Companies with large network infrastructures are likely using some configuration management tool. If that happens to be the open source RANCID project, then you will be glad that both OpenNMS and Zenoss can integrate with RANCID. For IT shops with Windows servers, OpenNMS and Zenoss can use WMI to monitor the Windows machines (though large numbers of WMI monitored network devices will cause a performance hit; again, see the "Maximize performance" sidebar). Futhermore, these two network monitoring systems will cull information from your VMware infrastructure.
Beyond the usual ICMP and service up/down monitoring that you'll get with most any network monitoring system, OpenNMS and Zenoss are able to use specific service queries and compare the responses against what you expect from your monitored servers. You can run custom SQL queries against your production databases and trigger an alert if the response changes. Maybe you want to monitor a critical Web application that uses Apache with PHP and MySQL. You can build a PHP page on your Web server that will run a database query and return the results in the Web page, thus ensuring that Apache, PHP, and MySQL all work as expected.
Peek at performance
Performance monitoring is an important task for any enterprise-grade network monitoring system. Zenoss uses the RRDtool system, made popular by MRTG and Cacti, to collect performance data from network devices, create graphs for easy analysis by a human, and generate threshold-based alerts. By default OpenNMS uses a Java implementation of RRDtool's functionality to provide the same performance data collection and presentation services, but it can be configured to use RRDtool proper for compatibility with other tools that read RRDtool's data files. Both systems collect and graph all available incrementing data via SNMP for monitored devices by default. In addition to the usual network traffic statistics and RAM and CPU utilization, OpenNMS and Zenoss can track disk utilization as well as I/O throughput on supported systems. And both systems provide a way to collect and graph custom data, or perform modifications to collected data as well as calculations before graphing.
An important aspect of any open source software project is its community of supporters and developers, and both OpenNMS and Zenoss have strong followings. At the OpenNMS Group (the company), all but one of the employees are developers, while approximately 60 percent of the OpenNMS project development team is composed of community programmers who have no direct affiliation with the OpenNMS Group. Zenoss is developed almost entirely by company employees, but the community of Zenoss users and customers has produced more than 100 ZenPacks to provide additional functionality for Zenoss.
For an application of the magnitude of these two network monitoring systems, you will almost certainly run into some configuration issues or unclear features, so support is important. As part of my research for this review, I made use of the support groups at both companies for help with their respective products. OpenNMS and Zenoss both did a great job with support. They were able to answer my questions quickly and were completely courteous and friendly at all times.
OpenNMS: Superior value
OpenNMS is a purely open source software project. This means that you get the full version of the software for free as open source; there is no extended, "enterprise" version. The business model used by the OpenNMS company is to sell support and training services for the OpenNMS open source software. Most of the company's employees are developers or contributors to the open source project. There is also a healthy development community behind the project. While the company does not have large financial backers, it likes to brag of its prudent accounting strategy: "We spend less than we make."
One of my favorite features of OpenNMS is the ability to integrate with several help desk software applications, including Request Tracker (RT), ConcourseSuite, JIRA, OTRS, and Intuit's QuickBase. Unlike some monitoring systems (including Zenoss), OpenNMS does not merely open help desk tickets by sending e-mails to the application. OpenNMS uses the API for the help desk application, which allows OpenNMS to open, update, and close tickets, as well as provide links from the OpenNMS Web interface to the appropriate ticket in the help desk application's Web interface. As part of our GreenLight Project on-site consulting, we had the OpenNMS server tied in to my company's Request Tracker help desk system and configured OpenNMS to open tickets for network device down alerts, and update and close the tickets once the network device came back online.
Another strong point of OpenNMS is its alerting and notification system, which abstracts alarms into event notifications and user notification paths. Event notifications define what sort of changes in the network will trigger an alert. User notification paths define who gets notified, how they get notified, and when they get notified, as well as a notification escalation procedure for alerts. Although setting up alerts in OpenNMS is not quite intuitive, the OpenNMS design gives the user a great amount of flexibility. The event notifications and user notification paths can be reused. Once you've configured an event notification rule for one network device (to watch for high network throughput, for instance), you can use that same rule for other network devices. And if you have a standard set of employees for day shift and night shift, then you might need only a handful of rules for thousands of devices. The more devices you monitor, the more you'll appreciate the OpenNMS notification system.
OpenNMS supports several methods of alert notification. Of course e-mail and e-mail-based pager notifications are supported out of the box, but OpenNMS can also send IM alerts via the XMPP (Jabber) protocol, as well as traditional numeric and text pager services. If you have another type of notification service that you want to receive, then you can designate an OpenNMS XML configuration file to use a command-line utility of your choice to send notification messages. OpenNMS uses the concept of a duty schedule to provide further flexibility and reusability of the alert and notification rules. Duty schedules can be applied to users, groups, and roles, preventing off-duty employees' pagers from waking them up in the middle of the night when they're not supposed to be working.
My company has used various versions of OpenNMS in production for more than five years now, and we have seen that it scales up very well to monitor thousands of devices. In fact, OpenNMS has at least one customer with 144,000 devices being monitored. Of those 144,000 devices, SNMP performance data is collected on 50,000 interfaces, resulting in 450,000 data points being amassed every five minutes.
A common weak spot for open source software projects is documentation, and OpenNMS is no exception. The documentation is supplied via the opennms.org wiki, which should provide for easier collaboration on documentation but only partially delivers the goods. Although there is plenty of good documentation for OpenNMS available, the organization of that documentation is odd. Instead of being written as a beginning-to-end software manual, it is a collection of docs on individual OpenNMS features. I should note that OpenNMS does offer a set of how-to and reference docs as part of the installation, but these are not extensive documents, and they are often somewhat outdated. The company recognizes that documentation is a weakness of the project, and it is working on a new set of documentation for its upcoming 1.8 release.
Another common weak spot for open source software projects is the user interface. OpenNMS seems to have almost a love/hate relationship with its Web-based GUI. The Web interface is attractive in its simplicity, but the lack of AJAX features make it feel a bit clunky; for instance, it takes several clicks and full page loads in the browser to alter a configuration setting. However, one side effect of a simple Web interface is that it is very fast in a Web browser. The Web interface is being updated for the version 1.8 release to include more AJAX features to reduce page loads.
OpenNMS comes up short on some enterprise-grade features, and these will be especially apparent for xSP service provider companies. For example, OpenNMS does not have a full ACL system to restrict users to particular nodes or screens within the Web interface. Currently, admins can set up read-only, view-controlled dashboards for select users. This provides only some of the functionality needed in a full ACL setup, because users with limited access cannot move beyond the dashboard screen. OpenNMS is working on an implementation of full-blown ACLs for a dot-release of its upcoming version 1.8 series.
A feature that has become more common among high-end network monitoring systems over the past few years is network topology discovery and mapping. This allows the system to find switches and routers and provide a simple network diagram. Some implementations also use the network topology data to improve outage alerts by notifying administrators about a router outage, but not sending notifications about the unreachable devices behind the router. Thanks to this "root-cause analysis" instead of receiving hundreds of alerts during a router outage, administrators would receive only a single notification about the router itself.
The current version (1.6.5) of OpenNMS does topology discovery, but its auto-generated network map only works with Internet Explorer. This will change with OpenNMS 1.8, when a switch from SVG 1.2 to SVG 1.1 will allow network maps to be rendered in most modern browsers. Another disappointment is that OpenNMS does not currently use topology data to automatically set up root-cause relationships. It does provide a way to manually configure root-cause relationships for smarter alert notifications, but this can require a lot of manual configuration for a large deployment. Auto configuration of root-cause relationships is not slated for the 1.8 series, which puts its inclusion at an unknown future date.
A final enterprise feature not fully implemented by OpenNMS is distributed collection of monitoring data. With a network monitoring system, we ideally should be able to add multiple collection servers to our monitoring system, all of which gather monitoring data from nearby devices and report back to the main monitoring server. Currently, OpenNMS can split the monitoring across multiple machines, but these systems will write directly to the primary database and thus need to be located on a fast network link to the primary OpenNMS server. The OpenNMS developers are in the process of making each component of the system capable of handling distributed collection. However, this will be done piecemeal as the developers make other improvements to each component over the entire remainder of the 1.x series of releases. Full-fledged distributed collection will be the defining criteria for the 2.0 release, with no target date currently set.
That said, OpenNMS is able to monitor tens of thousands of devices, with hundreds of thousands of data collection points, from a single monitoring server, so splitting OpenNMS data collection services across multiple servers is necessary only for the largest of enterprise networks.
The OpenNMS Group is developing an iPhone app for OpenNMS monitoring. The app communicates with the OpenNMS server to allow admins to view and acknowledge alarms from their iPhones. Now you have one more argument on your side to convince your boss that the IT department needs to have iPhones instead of BlackBerrys.
The OpenNMS Group has several support options as well as training and consulting. We chose to use the GreenLight Project for this report. This got us one week with an OpenNMS consultant, whom we used to help us with installation, configuration, and some specialized integration tasks. The GreenLight Project costs $22,995 and includes one year of standard support available from 7 a.m. to 7 p.m. on business days. The 24/7 GreenLight Project for $44,995 provides two weeks of on-site consulting, plus a year of around-the-clock support. The OpenNMS Group provides several other options, including standard support for $14,995 per year and 24/7 support for $29,995 per year. Training classes are available as well.
Zenoss Enterprise: Superior functionality
Billing itself as a "commercial open source" company, Zenoss uses a common business model in the open source world: It provides an open source version of its software for free with a limited feature set, as well as an enhanced "enterprise" version through an annual software subscription that also includes support.
The free, open source version of Zenoss, called Zenoss Core, provides a good set of basic monitoring tools, but is missing most of the more expansive features that bring the Zenoss Enterprise version up to the level of enterprise monitoring systems such as OpenView. Improved reporting, integration with Remedy and RANCID, Windows WMI support, role-based access control, and support for several commercial software applications (VMware, Oracle, and so on) are only available with the Zenoss Enterprise subscription. Although Zenoss Core is a nice piece of software, it has a limited feature set compared to HP, IBM, CA, and OpenNMS. Thus, we are using the Zenoss Enterprise software for comparison in this article.
The Enterprise annual software subscription with Platinum support (7 a.m. to 8 p.m. business days) costs $180 per network device, with price breaks starting at 1,000 devices. Considering there are no up-front costs to purchase the software, it presents itself as a competitor against such industry stalwarts as HP OpenView and IBM Tivoli. Included with the Platinum level support is a two-hour response time SLA for high-severity issues, four hours of deployment planning with a Zenoss architect, training in Zenoss administration for three, and unlimited support via e-mail, Web portal, and telephone. The Zenoss sales team noted that 24/7 support is available as well, though its pricing is not published.
My company tried using Zenoss about two and a half years ago and found that it was not reliable enough for production use. I'm glad to report that the developers at Zenoss and the Zenoss community supporters have been busily improving the software, and the hard work has paid off. Today Zenoss is stable, reliable, and ready for use in the largest network operations centers. Zenoss boasts OpSource and Rackspace among its many customers, and they both monitor tens of thousands of devices.
Zenoss boasts a full ACL implementation, allowing an administrator to provide fine-grained control over what a given user is allowed to see and do on the system. This is an important feature for service providers who need to provide customers with access to their — and only their — monitoring and performance information. Of course, the ACLs also provide an enterprise administrator with the level of control needed to give very different sets of permissions to various classes of employees who use the monitoring system.
The topology discovery capability in Zenoss will cull route tables from your Layer 3 network devices and use that information to organize a network topology. This is then used to create network diagrams, with collapsible nodes that allow you to view the network as a whole or drill down into individual collision domains to see a detailed view of a particular network segment. My favorite feature, though, is the automatic configuration of root-cause relationships between routers and other network nodes. After all, if a router goes down, there's no use in the monitoring system notifying you that it can't see all the other network devices behind that failed router!
If you have a well-designed NOC with plenty of computer screen real estate, you'll enjoy the geographic network maps. Zenoss uses a Google Maps mashup to display your network across geographic locations. Companies with multiple sites across a large geographic area will find this quite useful as a quick at-a-glance overview of where problems are occurring. The caveat to this is that the free Google Maps API key requires the Web site using the key to be free and publicly available. Because Zenoss requires user authentication and companies typically want to keep their monitoring system information to themselves, they would be violating the licensing terms of the Google Maps API key. To use this feature, a company would therefore need to buy a Google Maps Premier (Enterprise) API key.
Zenoss Enterprise has a number of vendor- and technology-specific monitoring capabilities, such as Oracle databases, Tomcat server, IBM WebSphere, Active Directory, and more. OpenNMS generally matches these capabilities, though Zenoss does have more advanced VMware monitoring; whereas Zenoss covers the latest VMware vSphere 4, OpenNMS covers only VMware Infrastructure 3. Zenoss lacks the full integration with help desk systems that OpenNMS provides, but it can open tickets in the Remedy help desk system via e-mails. In fact, both monitoring systems can work with any help desk system that is able to process e-mails into tickets.
Zenoss has a polished Web interface with some nice AJAX features. My favorite UI feature is the use of small pop-up windows with a dark background and white text (similar to Growl notifications) to provide feedback when a setting has been successfully changed or an operation has completed. These windows provide useful feedback, without causing extra page loads or mouse clicks and without being annoying or obtrusive.
The Web interface also makes use of a sidebar navigation menu, tabbed information panels, and pull-down action menus. This format works well in most situations, but sometimes the pull-down menus are used when a simple clickable link would be more convenient. Such things are always subjective, but I found the Web interface to be quite usable, if a little busy at times.
Once a device has been discovered and classified, it will automatically collect performance data and create performance graphs. Zenoss uses the same RRDtool-based graphs as OpenNMS and so many other monitoring systems. The Zenoss graphs have one minor shortcoming compared to OpenNMS: I like the ability to look at all the performance graphs for a particular device on one page, as OpenNMS presents them. When troubleshooting problems on a device, it can be useful to see both interface performance graphs and device resource graphs (RAM, CPU, disk) in a single view, so you can more easily discover patterns, trends, and correlations, such as high numbers of network packets on a router whose CPU utilization spiked. Of course, when faced with this situation, I can just open two windows with the performance graphs that I need to compare and arrange them side by side.
Zenoss has better documentation than OpenNMS. The Zenoss docs are well organized, and they come in both PDF and HTML formats. The material is organized by task in the Installation and Getting Started guides, and written to take you step by step through the installation and configuration of your Zenoss system. However, these docs lack some important performance-tuning tips that will be crucial to larger companies using Zenoss to monitor thousands of devices. Thus, it is important that customers take advantage of the planning and installation support provided to Enterprise Platinum customers to achieve maximum performance of their monitoring systems.
Alerts are more intuitive to configure on Zenoss than they are on OpenNMS. However, the simplicity comes at a cost: Zenoss alerts lack the flexibility of the OpenNMS model, which is so useful to larger IT shops. Also, an alert escalation in Zenoss is built as a separate alert notification rather than as part of the alert notification you want to escalate. The Zenoss notification system appears to be designed for easier configuration in smaller IT departments, but will create extra work for companies with larger IT departments who have multiple tiers of network support and 24/7 staff.
IT shops might want to keep track of software programs installed on each machine to ensure licensing compliance and to watch for employees installing undesirable programs on company workstations. Zenoss Enterprise can maintain a list of installed software packages on each server or workstation that it monitors. This is not exactly a "network monitoring" feature, but IT departments will appreciate this capability.
Distributed collectors for Zenoss are extremely easy to set up. Once a base OS installation has been completed on the remote collector machine, you can give the primary Zenoss Web console the root log-in information and it will install and configure the Zenoss software on the remote collector for you. These distributed collectors can be used to lighten the load on the primary Zenoss server or to get data from behind a NAT router if a VPN tunnel is not a feasible option for you.
Zenoss is currently working on improved features for its distributed data collection system, including simplified installation on highly secured servers and having remote collectors queue up data for later transmission when a network outage prevents them from immediately sending the data back to the central Zenoss server. A new, streamlined event console is also in the works, as is an encrypted data repository, allowing Zenoss users to protect any sensitive information being collected by the system.
Cadillac or Chevy
Zenoss Enterprise and OpenNMS are both very capable and flexible network monitoring systems, and they present great values when compared with big applications from the likes of IBM, CA, and HP. These two network monitoring systems have a long list of features, and IT staff looking for a monitoring system should carefully consider their needs, then research both OpenNMS and Zenoss to determine which is a better fit for their organization.
Zenoss Enterprise clearly has a more developed feature set, including ACLs, VMware vSphere 4 support, automatic root-cause analysis, distributed collectors, collapsible network diagrams, and software inventories. However, those features come with a significant price as compared to OpenNMS, which is free, open source (GPL) software. Further, OpenNMS provides support at a flat rate; the cost is not tied to the number of network devices your company will monitor. Because Zenoss Enterprise is sold as a subscription rather than an outright purchase, companies will have to commit to budgeting the cost of Zenoss each year. And the cost of your subscription will increase if you add more network devices to be monitored.
If you need the extra features of Zenoss Enterprise and can allocate the annual budget for it, then Zenoss will be a terrific purchase for your company. You will have no trouble selling the idea of using Zenoss to the company higher-ups when you stack it up against HP, IBM, or CA. If you have a limited budget or do not need the extra features of Zenoss Enterprise, then OpenNMS is a real winner, and support is easy on the bottom line.
Furthermore, if you are not in hurry to deploy a new network monitoring system right now, then it would be prudent to wait a month or two to see how the version 1.8 release from OpenNMS turns out. If OpenNMS delivers on the planned features, then a number of the software's shortcomings (including VMware vSphere 4 support) will be addressed and OpenNMS will become a more suitable fit for some companies.