Google releases data cleanser

Google's Refine 2.0, formerly Freebase Gridworks, cleans up messy data sources, links them together

Google has updated and re-released open-source software for cleaning, analyzing and transforming data sets, now called Google Refine.

The software, originally called Freebase Gridworks, came with Metaweb, a company Google purchased in July.

Google Refine is a collection of tools that could come in handy when wrangling useful information from a data set, particularly ones that have data inconsistencies.

This desktop application can, for instance, find all the variant spellings of a word in a data set and replace them with the appropriate term. This process, called normalization, is nothing new. But normalizing data usually requires writing code that is specific to one data set, noted Christopher Groskopf, a developer for the Chicago Tribune.

"The genius of Gridworks is that it is generic enough to work for a wide variety of data sets without the need to write any code at all. Even better the resulting operations are portable, so the process used to clean up 2009′s data can be repeated for 2010," Groskopf wrote in a blog post.

The software contains a number of other tools as well. It includes an expression language that can be used to analyze a set of data. Filters can be used to isolate subsets of data, which then can be analyzed or changed through a set of transform commands.

The software works with plain text files, the data in which can be split into different columns by the use of commas. Results can exported back out in the JSON (JavaScript Object Notation) format, which can then be easily transformed into HTML tables or other formats.

The software can work with up to a few hundred thousand rows per data set, depending on the user's computer memory. And unlike most spreadsheet software, this software can interactively transform large subsets of data, the company asserted.

Google said this week that it has added several new features to the software, officially called Google Refine 2.0, including the ability to link records to other databases, and a number of new transformation commands and expressions.

The non-profit government watchdog organization ProPublica has used this software to aggregate data from seven different data sets to show how pharmaceutical companies pay doctors to recommend certain medications.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is

Join the PC World newsletter!

Error: Please check your email address.

Tags open sourceapplicationsGoogledata miningsoftwareData management

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Cool Tech

D-Link PowerLine AV2 2000 Gigabit Network Kit

Learn more >

Crucial® BX200 SATA 2.5” 7mm (with 9.5mm adapter) Internal Solid State Drive

Learn more >

Lexar® Professional 1000x microSDHC™/microSDXC™ UHS-II cards

Learn more >

D-Link TAIPAN AC3200 Ultra Wi-Fi Modem Router (DSL-4320L)

Learn more >

ASUS ROG Swift PG279Q – Reign beyond virtual world

Learn more >

Gadgets & Things


Learn more >

Lexar Professional 2000x SDHC™/SDXC™ UHS-II cards

Learn more >

Lexar® Professional 1000x microSDHC™/microSDXC™ UHS-II cards

Learn more >

Family Friendly

Lexar Professional 2000x SDHC™/SDXC™ UHS-II cards

Learn more >

ASUS VivoPC VM62 - Incredibly Powerful, Unbelievably Small

Learn more >

Lexar® Professional 1000x microSDHC™/microSDXC™ UHS-II cards

Learn more >

Stocking Stuffer

Lexar Professional 2000x SDHC™/SDXC™ UHS-II cards

Learn more >

Lexar® Professional 1000x microSDHC™/microSDXC™ UHS-II cards

Learn more >

Christmas Gift Guide

Click for more ›

Most Popular Reviews

Best Deals on PC World

Latest News Articles


GGG Evaluation Team

Kathy Cassidy


First impression on unpacking the Q702 test unit was the solid feel and clean, minimalist styling.

Anthony Grifoni


For work use, Microsoft Word and Excel programs pre-installed on the device are adequate for preparing short documents.

Steph Mundell


The Fujitsu LifeBook UH574 allowed for great mobility without being obnoxiously heavy or clunky. Its twelve hours of battery life did not disappoint.

Andrew Mitsi


The screen was particularly good. It is bright and visible from most angles, however heat is an issue, particularly around the Windows button on the front, and on the back where the battery housing is located.

Simon Harriott


My first impression after unboxing the Q702 is that it is a nice looking unit. Styling is somewhat minimalist but very effective. The tablet part, once detached, has a nice weight, and no buttons or switches are located in awkward or intrusive positions.


Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?