NoSQL offers users scalability, flexibility, speed

User case studies at the NoSQL Now conference show NoSQL being used for a variety of reasons

Users of NoSQL databases and data processing frameworks such as CouchDB and Hadoop are deploying these new technologies for their speed, scalability and flexibility, judging from a number of sessions at the NoSQL Now conference being held this week in San Jose, California.

EMC is using a mixture of traditional databases and newfangled NoSQL data stores to analyze public perception of the company and its products, explained Subramanian Kartik, distinguished EMC engineer, during one talk.

The process, called sentiment analysis, involves scanning hundreds of technology blogs, finding mentions of EMC and its products, and assessing if the references are positive or negative, using words in the text.

To execute the analysis, EMC gathers the full text of all the blog and Web pages mentioning EMC, and compiles them into a version of MapReduce running on its Greenplum data analysis platform. It then uses Hadoop to weed out the Web markup code and non-essential words, which slims the data set considerably. It then passes the word lists into SQL-based databases, where a more thorough quantitative analysis is done.

The NoSQL technologies are useful in summarizing a huge data set, while SQL can then be used for a more detailed analysis, Kartik said, adding that this hybrid approach can be applied to many other areas of analysis as well.

"There is all sorts of information out there, and at some point you will have to go through tokenizing, parsing and natural language processing. The way to get to any meaningful quantitative measures of this data is to put it in an environment you know can manipulate it well, in a SQL environment," Kartik said.

For digital media company AOL, NoSQL products provide speed and volume that would not be possible using traditional relational databases.

The company uses Hadoop and the CouchDB NoSQL database to run its ad targeting operations, said Matt Ingenthron, manager of community relations for Couchbase, during another talk.

AOL has developed a system that can pick out a set of targeted ads for each time a user opens an AOL page. What ads are chosen can be based on the data that AOL has on the user, along with algorithmic guesses about what ads would be most of interest to that user. The process must be executed within about 40 milliseconds.

Source data is voluminous. Logs are kept on all users' actions on every server. They must be parsed and reassembled to build a profile of each user. The ad brokers also set a complex set of rules of how much they will pay for an ad impression, or what ads should be shown to which users.

This activity generates 4 to 5 terabytes of data a day, and AOL has amassed 600 petabytes of operational data. The system maintains more than 650 billion keys, including one for every user, as well as keys for handling other aspects of data as well. The system must react to 600,000 events every second.

Data feeds produce much of this source data, which come from Web server logs and outside sources. The Hadoop Flume component is used to ingest data. The Hadoop cluster also executes a series of MapReduce jobs to parse the raw data into summaries.

AOL also uses Couchbase's CouchDB as a switching station of sorts for data arriving from the feeds. Because CouchDB can work with data without writing it to disk, it can be used to parse data quickly before sending it to the next step.

"We didn't anticipate ad targeting to be a primary [market] for us. But Couchbase ended up filling a need for AOL and other ad companies," Ingenthron said. The work is "technically complex and has a lot challenges in processing data very quickly."

Scientific and medical publishing house Elsevier was looking for greater flexibility when it procured an XML-based, non-relational database system from Mark Logic, said Elsevier Labs Vice President Bradley Allen.

The scientific publishing world is moving from a static model to a more dynamic one, Allen explained. For the past few centuries, printed scientific paper, collected in journals, served as the basic unit of knowledge. It contained a description of the work, the authors and contributors, references and other core components of information. While the scientific publishing world is moving to digital, paper remains the dominant medium for data communication. "We're still in the horse-and-carriage era," Allen quipped.

Over time, the scientific paper will be decomposed into individual elements, which can be used in multiple products. Individual paragraphs or even individual assertions can be annotated and indexed, Allen predicted. They can then be reassembled into new works and embedded in applications, such as programs that doctors can consult. They can also be mined for new information through the use of analytics.

With this in mind, Elsevier is in the process of annotating the papers in its journals so they can be deployed in other applications and services. An XML database was a natural fit for this work, Allen explained. New content types can easily be added into a database, and the format allows individual components to be easily reused in new composite applications and services.

Elsevier has introduced a number of new products with this approach. One is the SciVal, a service for academic administrators that summarizes the publishing activity within their institution, giving them a quantitative idea of the organization's academic strengths and weaknesses. Another service is the Science Direct, a full-text search engine for Elsevier's journals.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the PC World newsletter!

Error: Please check your email address.

Tags AOLdatabasesapplicationsdata miningsoftwareemc

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?