Apache Lucene/Solr expands across many servers

Lucene/Solr 4.0 comes with a new distributed indexing architecture and can offer real-time results

The Apache Software Foundation's widely used open source Lucene/Solr search engine package has been upgraded to accommodate its users' seemingly insatiable need to collect and use ever-larger amounts of data.

"The biggest improvement that has happened to Lucene/Solr is scalability," said Sarath Jarugula, vice president of product management at LucidWorks, which offers a commercially support version of Lucene/Solr. "Lucene/Solr has been re-architected to index data across hundreds of servers," he said.

The keepers of the project plan to release Lucene/Solr 4.0 within the next day or so. Version 4.0 has been three years in the making.

While IT professionals may not have heard of Lucene or Solr, many probably have used these technologies at some point, as the software is embedded in a number of enterprise search products. Many e-commerce and social media sites, such as Facebook and Twitter, also use Lucene/Solr to power their search services.

Doug Cutting, who also created the Apache Hadoop data processing platform, built Lucene as a full-text search engine based on Java. While Lucene is a Java library of search functions, Solr provides an API (application programming interface) so other applications can interface with Lucene. Although Lucene and Solr started as separate projects, the two were merged into a single entity in 2010, now called Apache Lucene/Solr.

This new update reflects how organizations are ingesting and reusing more and more data.

Ten years ago, Jarugula noted, larger organizations might have stored a few million electronic documents, which collectively took up several hundred gigabytes. These days, however, such repositories have ballooned in size: It is not uncommon for Jarugula to encounter organizations that generate a terabyte of data a day.

Lucene/Solr has been updated to handle such larger workloads.

Most significantly, the Solr component includes a new technique called distributed indexing, which divides document indexing duties across multiple servers to speed response time even as the data sets grow larger. To further speed operations, Solr now can spawn multiple threads to index material, with each thread being able to write to disk concurrently.

The software can now also recognize when it operates in a clustered server environment and adjust its actions to the new setup. This set of technologies comes under the name SolrCloud. "If you have a cluster, Solr will know will any server goes down and will watch for when it comes back up," Jarugula said. To help with these with duties, Lucene/Solr uses the Apache ZooKeeper cluster configuration management software.

The distributed indexing also shortens the time indexed material is made available to users, which paves the way for real-time search. Typically, enterprise search engines only update their indices once a day, or once every few hours. Lucene can now update continuously, even with a data set of billions of documents. "You can now index on a per-second basis," Jarugula said.

As a result, as soon as a document has been entered into a repository, it can be indexed and will start appearing in search results. This feature also reflects the changing needs of the enterprise. Thanks to the influence of Twitter and Facebook, "as I send an email or update a document, I want it to be immediately available to my colleagues," Jarugula said.

Lucene/Solr 4.0 will also offer a number of other features, such as versioning -- in which older versions of data are retained -- and a new Web-based administrative interface.

One organization looking forward to the new edition is deal-of-the-day Internet service Groupon. Groupon uses the open source version of Lucene/Solr and contracts with LucidWorks for engineering support. "Lucene/Solr is highly competitive against other commercial offerings," said Jeff Ayars, who is a Groupon vice president of engineering.

Groupon uses Lucene/Solr to index all the emails it sends to its users, Ayars said. Emails are customized for each user, so as a result, "tens of millions of new documents are indexed daily," Ayers said. When a user calls the company, a representative can search for the specific email that the caller has a question about. The company also uses Lucene/Solr's geospatial indexing capabilities to provide each user information about nearby deals.

Perhaps not surprisingly, Ayers is most looking forward to the new clustering features of Lucene/Solr 4.0. "There's been recipes for clustering with Solr for a very long time. But it's helpful for us to have baked-in support," Ayars said.

The Apache Lucene/Solr project has 37 core committers, nine of whom work for LucidWorks (which was previously called Lucid Imagination). Users of LucidWorks' Lucene/Solr commercial package include AT&T, Ford, Verizon, Cisco, Raytheon, Salesforce.com, Qualcomm and eHarmony.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the PC World newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Most Popular Reviews

Latest Articles


PCW Evaluation Team

Matthew Stivala

HP OfficeJet 250 Mobile Printer

The HP OfficeJet 250 Mobile Printer is a great device that fits perfectly into my fast paced and mobile lifestyle. My first impression of the printer itself was how incredibly compact and sleek the device was.

Armand Abogado

HP OfficeJet 250 Mobile Printer

Wireless printing from my iPhone was also a handy feature, the whole experience was quick and seamless with no setup requirements - accessed through the default iOS printing menu options.

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?