Google releases data cleanser

Google's Refine 2.0, formerly Freebase Gridworks, cleans up messy data sources, links them together

Google has updated and re-released open-source software for cleaning, analyzing and transforming data sets, now called Google Refine.

The software, originally called Freebase Gridworks, came with Metaweb, a company Google purchased in July.

Google Refine is a collection of tools that could come in handy when wrangling useful information from a data set, particularly ones that have data inconsistencies.

This desktop application can, for instance, find all the variant spellings of a word in a data set and replace them with the appropriate term. This process, called normalization, is nothing new. But normalizing data usually requires writing code that is specific to one data set, noted Christopher Groskopf, a developer for the Chicago Tribune.

"The genius of Gridworks is that it is generic enough to work for a wide variety of data sets without the need to write any code at all. Even better the resulting operations are portable, so the process used to clean up 2009′s data can be repeated for 2010," Groskopf wrote in a blog post.

The software contains a number of other tools as well. It includes an expression language that can be used to analyze a set of data. Filters can be used to isolate subsets of data, which then can be analyzed or changed through a set of transform commands.

The software works with plain text files, the data in which can be split into different columns by the use of commas. Results can exported back out in the JSON (JavaScript Object Notation) format, which can then be easily transformed into HTML tables or other formats.

The software can work with up to a few hundred thousand rows per data set, depending on the user's computer memory. And unlike most spreadsheet software, this software can interactively transform large subsets of data, the company asserted.

Google said this week that it has added several new features to the software, officially called Google Refine 2.0, including the ability to link records to other databases, and a number of new transformation commands and expressions.

The non-profit government watchdog organization ProPublica has used this software to aggregate data from seven different data sets to show how pharmaceutical companies pay doctors to recommend certain medications.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Tags open sourceGoogleapplicationssoftwaredata miningData management

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service

Comments

Comments are now closed.

Most Popular Reviews

Follow Us

Best Deals on GoodGearGuide

Shopping.com

Latest News Articles

Resources

GGG Evaluation Team

Kathy Cassidy

STYLISTIC Q702

First impression on unpacking the Q702 test unit was the solid feel and clean, minimalist styling.

Anthony Grifoni

STYLISTIC Q572

For work use, Microsoft Word and Excel programs pre-installed on the device are adequate for preparing short documents.

Steph Mundell

LIFEBOOK UH574

The Fujitsu LifeBook UH574 allowed for great mobility without being obnoxiously heavy or clunky. Its twelve hours of battery life did not disappoint.

Andrew Mitsi

STYLISTIC Q702

The screen was particularly good. It is bright and visible from most angles, however heat is an issue, particularly around the Windows button on the front, and on the back where the battery housing is located.

Simon Harriott

STYLISTIC Q702

My first impression after unboxing the Q702 is that it is a nice looking unit. Styling is somewhat minimalist but very effective. The tablet part, once detached, has a nice weight, and no buttons or switches are located in awkward or intrusive positions.

Latest Jobs

Shopping.com

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?