
Using Google Refine to Clean a Data Set - craig552uk
http://craig-russell.co.uk/2012/06/21/using-google-refine.html
======
richardv
I found this really helpful. Not so much to do with the actual article, but to
do with actual making me aware of Google Refine.

Installation was a breeze. I couldn't find any instructions, but it was as
simple as downloading for Linux, extracting, the running the shell script.

[http://code.google.com/p/google-
refine/downloads/detail?name...](http://code.google.com/p/google-
refine/downloads/detail?name=google-refine-2.5-r2407.tar.gz&can=1&q=)

The application automatically opens in a new Chrome window.

From here, I grabbed a data dump from one of our external providers.

We work with a lot of providers who are _really_ technologically challenged.
I'd love to be able to say, here you are.. here is our API, start pushing your
content to us. But in practice they don't even know what their XML feeds do.
We need their data, but getting a consistent dataset from them when they seem
to change their format regularly is a pain! And when importing only 10 or so
items at a time it's excruciatingly painful.

Today I learnt how easy that can be with Google Refine!

~~~
bockris
Also check out Data Wrangler. <http://vis.stanford.edu/wrangler/>

It focuses on more mechanical transformations but has the ability to save the
steps to a program which you can then use in a process pipeline.

(disclaimer, I haven't played with it in a few months so this is from memory)

~~~
craig552uk
GR also allows you to export/import steps to reuse. Though I think
DataWrangler is easier to integrate in to an automated pipeline.

~~~
etrain
Yeah, Wrangler will turn your "manipulation" steps into scripts in
Python/MapReduce/Javascript.

------
danso
As a data analyst-type-person, I can't recommend enough the use of Google
Refine. When someone told me about it, I thought "that's dumb, I would just
write a cleaning/regex script and connect to my DB"...but tried it out anyway,
because my colleague is a much better power programmer than I am.

That's how good Refine is...it adds an extra, GUI-driven step to the workflow,
but it's so well executed that it makes data exploration (and cleaning)
effortless.

I wrote a tutorial awhile back about how I used it in an investigative
reporting project: [http://www.propublica.org/nerds/item/using-google-refine-
for...](http://www.propublica.org/nerds/item/using-google-refine-for-data-
cleaning)

------
frankc
Is this worth looking into for someone who already knows perl, R and the unix
zoo? Or is it more targetted at people who don't deal with data on a regular
basis?

~~~
craig552uk
It's another string to the bow - so why not?

Have a quick look over the screen casts. If you're familiar with those tools
you'll map the concepts pretty quickly.

<http://code.google.com/p/google-refine/>

------
guard-of-terra
I wonder why they won't let you to open local files without passing their
content via browser. Should be very useful when run locally.

~~~
craig552uk
Google's all about the web. Even local apps are web apps in their world.

Which isn't to say that's a bad idea...

~~~
eli
Actually this project was an acquisition. That's how it worked when they
bought it from the Freebase guys.

~~~
craig552uk
I knew about the Freebase integration but didn't know it came from them.
Thanks!

------
dpcx
This seems, on the surface at least, very similar to what ScraperWiki is
trying to do, by converting messy publicly available data in to a more
structured format.

Am I correct in that understanding, or did I miss the boat?

~~~
craig552uk
ScraperWiki is a data aggregator/publisher. Google Refine is a general purpose
tool for cleaning up data sets.

------
chucknelson
Not very impressive for people who work with data sets often and probably have
tools like SAS or Excel, but good to know it exists as a free alternative.

~~~
guard-of-terra
This thingy opens 2.3 million lines file in something under a minute. Excel
will choke, I guess.

Still unix toolset - awk, grep, sort - beats both for most tasks and huge data
sets.

~~~
craig552uk
Yeah, GR chocked on a 7GB file the other day, had to chop it up in the shell
before importing each in turn.

That's when the export/import processing steps feature comes in handy.

