Apache Lucene/Solr expands across many servers

Lucene/Solr 4.0 comes with a new distributed indexing architecture and can offer real-time results

The Apache Software Foundation's widely used open source Lucene/Solr search engine package has been upgraded to accommodate its users' seemingly insatiable need to collect and use ever-larger amounts of data.

"The biggest improvement that has happened to Lucene/Solr is scalability," said Sarath Jarugula, vice president of product management at LucidWorks, which offers a commercially support version of Lucene/Solr. "Lucene/Solr has been re-architected to index data across hundreds of servers," he said.

The keepers of the project plan to release Lucene/Solr 4.0 within the next day or so. Version 4.0 has been three years in the making.

While IT professionals may not have heard of Lucene or Solr, many probably have used these technologies at some point, as the software is embedded in a number of enterprise search products. Many e-commerce and social media sites, such as Facebook and Twitter, also use Lucene/Solr to power their search services.

Doug Cutting, who also created the Apache Hadoop data processing platform, built Lucene as a full-text search engine based on Java. While Lucene is a Java library of search functions, Solr provides an API (application programming interface) so other applications can interface with Lucene. Although Lucene and Solr started as separate projects, the two were merged into a single entity in 2010, now called Apache Lucene/Solr.

This new update reflects how organizations are ingesting and reusing more and more data.

Ten years ago, Jarugula noted, larger organizations might have stored a few million electronic documents, which collectively took up several hundred gigabytes. These days, however, such repositories have ballooned in size: It is not uncommon for Jarugula to encounter organizations that generate a terabyte of data a day.

Lucene/Solr has been updated to handle such larger workloads.

Most significantly, the Solr component includes a new technique called distributed indexing, which divides document indexing duties across multiple servers to speed response time even as the data sets grow larger. To further speed operations, Solr now can spawn multiple threads to index material, with each thread being able to write to disk concurrently.

The software can now also recognize when it operates in a clustered server environment and adjust its actions to the new setup. This set of technologies comes under the name SolrCloud. "If you have a cluster, Solr will know will any server goes down and will watch for when it comes back up," Jarugula said. To help with these with duties, Lucene/Solr uses the Apache ZooKeeper cluster configuration management software.

The distributed indexing also shortens the time indexed material is made available to users, which paves the way for real-time search. Typically, enterprise search engines only update their indices once a day, or once every few hours. Lucene can now update continuously, even with a data set of billions of documents. "You can now index on a per-second basis," Jarugula said.

As a result, as soon as a document has been entered into a repository, it can be indexed and will start appearing in search results. This feature also reflects the changing needs of the enterprise. Thanks to the influence of Twitter and Facebook, "as I send an email or update a document, I want it to be immediately available to my colleagues," Jarugula said.

Lucene/Solr 4.0 will also offer a number of other features, such as versioning -- in which older versions of data are retained -- and a new Web-based administrative interface.

One organization looking forward to the new edition is deal-of-the-day Internet service Groupon. Groupon uses the open source version of Lucene/Solr and contracts with LucidWorks for engineering support. "Lucene/Solr is highly competitive against other commercial offerings," said Jeff Ayars, who is a Groupon vice president of engineering.

Groupon uses Lucene/Solr to index all the emails it sends to its users, Ayars said. Emails are customized for each user, so as a result, "tens of millions of new documents are indexed daily," Ayers said. When a user calls the company, a representative can search for the specific email that the caller has a question about. The company also uses Lucene/Solr's geospatial indexing capabilities to provide each user information about nearby deals.

Perhaps not surprisingly, Ayers is most looking forward to the new clustering features of Lucene/Solr 4.0. "There's been recipes for clustering with Solr for a very long time. But it's helpful for us to have baked-in support," Ayars said.

The Apache Lucene/Solr project has 37 core committers, nine of whom work for LucidWorks (which was previously called Lucid Imagination). Users of LucidWorks' Lucene/Solr commercial package include AT&T, Ford, Verizon, Cisco, Raytheon, Salesforce.com, Qualcomm and eHarmony.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service

Comments

Comments are now closed.

Most Popular Reviews

Follow Us

Best Deals on GoodGearGuide

Shopping.com

Latest News Articles

Resources

GGG Evaluation Team

Kathy Cassidy

STYLISTIC Q702

First impression on unpacking the Q702 test unit was the solid feel and clean, minimalist styling.

Anthony Grifoni

STYLISTIC Q572

For work use, Microsoft Word and Excel programs pre-installed on the device are adequate for preparing short documents.

Steph Mundell

LIFEBOOK UH574

The Fujitsu LifeBook UH574 allowed for great mobility without being obnoxiously heavy or clunky. Its twelve hours of battery life did not disappoint.

Andrew Mitsi

STYLISTIC Q702

The screen was particularly good. It is bright and visible from most angles, however heat is an issue, particularly around the Windows button on the front, and on the back where the battery housing is located.

Simon Harriott

STYLISTIC Q702

My first impression after unboxing the Q702 is that it is a nice looking unit. Styling is somewhat minimalist but very effective. The tablet part, once detached, has a nice weight, and no buttons or switches are located in awkward or intrusive positions.

Latest Jobs

Shopping.com

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?