IBM: Puzzles provide clues to better analysis

An IBM fellow explains why data analysis is much like assembling a picture puzzle

Today's large-scale data analysis may be a high-tech undertaking, but smart data scientists can improve their craft by observing how simple low-tech picture puzzles are solved, said an IBM scientist at the GigaOm conference.

Watching how people put together picture puzzles can reveal "a lot of profound effects that we could bring to big data" analysis, said Jeff Jonas, IBM's chief scientist for entity analytics, speaking Wednesday at one of the more whimsical presentations at the data structure conference in New York.

Data analysis is becoming a more important component to many businesses. IDC estimates enterprises will spend more than US$120 billion by 2015 on analysis systems. IBM estimates that it will reap $16 billion in business analytics revenue by 2015.

But getting useful results from such systems requires careful planning.

In a series of informal experiments, Jonas observed how small groups of friends and family work together to assemble picture puzzles, those involving thousands of separate pieces that could be assembled to form a picture.

"My girlfriend sees her son and three cousins, I see four parallel processor pipelines," he said. To make the challenge a bit harder, he removed some of the puzzle pieces, and, obtaining a second copy of some puzzles, added duplicate pieces.

Puzzles are about assembling small bits of discrete data into larger pictures. In many ways, this is the goal of data analysis as well, namely finding ways of assembling data such that it reveals a bigger pattern.

A lot of organizations make the mistake of practicing "pixel analytics," Jonas said, in which they try to gather too much information from a single data point. The problem is that if too much analysis is done too soon, "you don't have enough context" to make sense of the data, he said.

Context, Jonas explained, means looking at what is around the bit of data, in addition to the data itself. By doing too much stripping and filtering of seemingly useless data, one can lose valuable context. When you see the word "bat," you look at the surrounding data to see what kind of bat it is, be it a baseball bat, a bat of the eyelids or a nocturnal creature, he said.

"Low-quality data can be your friend. You'll be glad you didn't over-clean it," Jonas said. Google, for instance, reaps the benefits of this approach. Sloppy typers will often get a "did you mean this?" suggestion after entering into the search engine a misspelled word. Google provides results to what it surmises is the correct word. Google guesses the correct word using a backlog of incorrectly typed queries.

With puzzles, users first concentrate on assembling one piece with another. Over time, they create small clumps of data, which they can then figure out how to connect to finish the puzzle. The edges and the corners are assembled fairly quickly. What in effect happens is that, as progress on the puzzle proceeds, "you are making faster quality decisions than before," Jonas said. "The computational costs to figure out where a piece goes declines."

Watching his teams put together the faulty puzzles, he noticed a number of interesting traits. One obvious one is that the larger the puzzle, the more time it takes to complete. "As the working space expands the computational effort increases," he said. Ambiguity also increases computational complexity. Puzzle pieces that have the same colors and shapes were harder to fit together than those with distinct details.

"Excessive ambiguity really drives up the computational cost," Jonas said.

Jonas was also impressed with how little information someone needed to get an idea of the image that the puzzle held. After assembling only four pieces, one of his teams was able to guess that its puzzle depicted a Las Vegas vista. "That is not a lot of fidelity to figure that out," he said. Having only about 50 percent of the puzzle pieces fitted together provided enough detail to show the outline of the entire puzzle image. This is good news for organizations unable to capture all the data they are studying -- even a statistical sampling might be enough to provide the big picture, so to speak.

"When you have less than half the observation space, you can make a fairly good claim about what you are seeing," Jonas said.

Also, studying how his teams finish the puzzles gave Jonas a new appreciation in batch processing, he said.

The key to analysis is a mixture of streaming and batch processing. The Apache Hadoop data framework is designed for batch processing, in which a lot of data in a static file is analyzed. This is different from stream processing, in which a continually updated string of data is observed. "Until this project, I didn't know the importance of the little batch jobs," he said.

Batch processing is a bit like "deep reflection," Jonas said. "This is no different than staying at home on the couch mulling what you already know," he said. Instead of just staring at each puzzle piece, participants would try to understand what the puzzle depicted, or how larger chunks of assembled pieces could possibly fit together.

For organizations, the lesson should be clear, Jonas explained. They should analyze data as it comes across the wire, but such analysis should be informed by the results generated by deeper batch processes, he said.

Jonas' talk, while seemingly irreverent, actually illustrated many important lessons of data analysis, said Seth Grimes, an industry analyst focusing on text and content analytics who attended the talk. Among the lessons: Data is important. Context accumulates and real-time streams of data should be augmented with deeper analysis.

"These are great lessons, communicated really effectively," Grimes said.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Brand Post

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Luke Hill

MSI GT75 TITAN

I need power and lots of it. As a Front End Web developer anything less just won’t cut it which is why the MSI GT75 is an outstanding laptop for me. It’s a sleek and futuristic looking, high quality, beast that has a touch of sci-fi flare about it.

Emily Tyson

MSI GE63 Raider

If you’re looking to invest in your next work horse laptop for work or home use, you can’t go wrong with the MSI GE63.

Laura Johnston

MSI GS65 Stealth Thin

If you can afford the price tag, it is well worth the money. It out performs any other laptop I have tried for gaming, and the transportable design and incredible display also make it ideal for work.

Andrew Teoh

Brother MFC-L9570CDW Multifunction Printer

Touch screen visibility and operation was great and easy to navigate. Each menu and sub-menu was in an understandable order and category

Louise Coady

Brother MFC-L9570CDW Multifunction Printer

The printer was convenient, produced clear and vibrant images and was very easy to use

Edwina Hargreaves

WD My Cloud Home

I would recommend this device for families and small businesses who want one safe place to store all their important digital content and a way to easily share it with friends, family, business partners, or customers.

Featured Content

Product Launch Showcase

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?