
For ‘Big Data’ Scientists, Hurdle to Insights Is ‘Janitor Work’ - rgejman
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?hpw&action=click&pgtype=Homepage&version=HpHedThumbWell&module=well-region&region=bottom-well&WT.nav=bottom-well
======
kiyoto
While I am all for companies and software to expedite data cleaning, I don't
think any software can cleanly solve this problem in the near future.

1\. Data hygiene issues show up inconsistently, even within the same data
source: I experienced this first-hand as a quant trader: I only had half a
dozen, well-structured time series data coming from our trading apps and
switches. You would think that I can get reasonably clean data every day, or
failing that, being able to automate the data cleaning script. Nope! I
experienced every data hygiene issue imaginable: from ntpd being broken, local
time/UTC inconsistencies, switch firmware acting up, all the way to human
errors in the ETL process. Building a reliable data pipeline has so many
moving parts that I am fairly pessimistic about any one-stop solution.

2\. Then there is a volume issue. If the data is small enough, humans can
correct them fairly reliably, especially with some help from automation
scripts/software. That said, doing this at scale is a very hard problem. One
thing I learned working at a big data company is that many folks use MapReduce
as a data cleanser at scale, and for that, MapReduce is a pretty awkward tool.

3\. Anecdotal evidence: I have talked to employees at various big data
platform/software companies, and though they have a wide ranging opinions on
stream processing, Hadoop, Spark, etc., they all agree data cleaning is a huge
pain/deal-slowing/-killing nemesis/unsolved problem that their employers semi-
solve with increasing Sales Engineer headcounts. If this issue of solving data
cleaning at scale was easy, I feel that someone would have come up with a very
effective answer by now (and as an industry insider, I should have heard about
it).

~~~
rpedela
I completely agree. However I think there is also plenty of room for
improvement in the ETL process that doesn't necessarily mean a one-stop
solution.

For example, the common denominator data format for many use cases (not all)
is a relational database. Why isn't there something that can grab data from
anywhere in any common format and import it into a relational database with
only a couple commands or mouse clicks? Right now the situation is that you
either have to pay a lot of money for such a tool which usually doesn't work
very well or you piece together the various data conversion tools with
scripts. There is of course a lot of other steps in the ETL process, but that
alone would save a lot of time.

~~~
kiyoto
>However I think there is also plenty of room for improvement in the ETL
process that doesn't necessarily mean a one-stop solution.

I agree with you here as well. I recently wrote a blog article about this, at
least for log data > [http://www.fluentd.org/blog/unified-logging-
layer](http://www.fluentd.org/blog/unified-logging-layer)

~~~
rpedela
Very cool! I guess my particular ETL complaint above is solved for log data.
:)

------
apeconmyth
Speaking for the data janitors of the world, I can't believe there wasn't a
mention of hiring some of us to do this so-called dirty work, if for nothing
else than to save these data scientists from using the word sexy.

After twenty years of financial operations, including a decade in the back-
office of a top hedge fund, I eventually accepted that I'm a bit backwards for
my desire not to jump straight to a pivot table when encountering a new data
set. Like a farmer reaching down to touch the soil, my first step is exploring
the rows and columns with little tests here and there to find weaknesses
within the information at hand rather than paving over them with instantly
flawed reporting.

Anyway, for all the big data scientists out there too sexy to clean up their
own data, I'm looking for work and don't mind pushing a broom. Check my
profile for more info.

~~~
deathanatos
> Like a farmer reaching down to touch the soil, my first step is exploring
> the rows and columns with little tests here and there to find weaknesses
> within the information at hand rather than paving over them with instantly
> flawed reporting.

This needs to be done by more people, more often. I've worked with more than
one (often SQL, but it doesn't matter¹) database whereby someone will claim
that X is always true. It X is something that can be put into a DB query, then
_you should do that, and run the query_. Every time. In my experience, if it
isn't enforced by constraints in the database software, then it isn't true.

¹The nice thing about SQL is that you can but constraints on the data, and use
the database to keep various invariants invariant.

Often, you'll find people making assumptions about the data based on the
business logic, not the constraints in the database. Such thinking is flawed:
you cannot reason about what you think the data is, you must reason from what
the data actually is.

Worse, you'll find people making assumptions about what they _think_ the
business logic is.

------
bane
It turns out this problem even shows up for relatively small datasets, things
that would easily fit into memory on any consumer-level laptop from a couple
years ago. I did some contract work for the NYPD a few years ago and the
primary data sources I had to deal with was a block of free text describing a
911 call. All they really wanted was some frequency counts on specific terms
of interest, sounds easy enough.

Oh, but their data warehouse thing that nobody knew what it was called,
inserted a random space every so often and removed random spaces every so
often making the text completely unreliable. Compiled with all the normal
misspellings, fat fingered acronyms, slang and other nonstandard ways of
writing things it turned out to be a huge job.

"Bicycle stolen at 0930 from residence on the 300 block of E 10th St, East
Village"

would get turned into "Bicycle stol en a t093 0 from r es iden ceon the 300
block of E 10 thst, Ea st Vil lage."

Good luck with that.

~~~
arthurjj
How did you end up dealing with the issue? My first idea would be to remove
all the spaces and parse that

~~~
bane
After contemplating the issue for a few days I told them they need to talk to
their data warehouse provider and fix it before we could proceed.

------
dj-wonk
It is worse that the articles says -- the data cleaning is often not tracked
nor reproducible. First, think about how much data cleaning is not done by a
script or versioned at all. So many people just overwrite the bad data in
place. Second, of the data cleaning / ETL that is scripted or partially
automated, think about how much is not saved, much less version controlled. As
a result, if you ever want to go back and reproduce or share your analysis,
good luck!

Doing it better is simple: If you do ETL, version control your code. If you do
hand edits, track the before and after.

~~~
joeclark77
Is anyone actually doing that? I wonder if there's a profession, or if there
ought to be, specifically for "data wrangling". On the other hand, maybe you
can't separate it from the work of the data scientist... perhaps the data
scientist needs to be involved in the data wrangling in order to understand
the source better?

~~~
romming
> perhaps the data scientist needs to be involved in the data wrangling in
> order to understand the source better?

Founder of an ETL startup here. This is exactly what we believe: the end-user
of the data should be involved as early in the data pipeline as possible,
including the wrangling. If you eliminate the engineer from the ETL process
you remove a lot of painful back-and-forth and get more flexible pipelines.

~~~
joncooper
I'd love to hear about your startup and/or beta test.

~~~
romming
Hit us up at info at etleap dot com.

------
cinquemb
One issue that was very time consuming for me until recently was that on
machines we had mining data upstream across many clusters of vm's, they would
sometimes spit out mal formatted files that would need to be fed in downstream
for indexing.

Each file had about 100k objects, and there would be less 10 objects with
errors on avg if they were mal formatted. It was easy (but annoying) to do at
first but scaling up the clusters made it exponentially harder to do by hand
in the terminal.

It was conceptually easy to solve once I abstracted the problem a bit: I had a
rough idea of where I was getting the data from on the mining servers (urls,
ocr'd pdfs , etc), and since there are a bunch of libraries out there for
parsing json that give an idea of where the errors are occurring in the files
by the byte, after that it was a combination of traversing file forwards and
backwards (in memory, luckily files are only ~20mb each, but crashed every
text editor I tried before trying to do search and replace) looking for any
data within the file that could help me reconstruct the mal formatted object
or remove the object if not enough information was available (which if I
didn't want to do if my error occurrence wasn't ~10/100,000, I could use the
other objects near it to reconstruct the object from inference).

------
tezka
Kathleen Fisher et al. did some exciting PL oriented research on automatic
tool generation for ad hoc, semi structured data in AT&T research a while ago.
Take a look at their TOPLAS and POPL papers: The next 700 data description
languages:
[http://www.padsproj.org/papers/ddcjournal_preprint.pdf](http://www.padsproj.org/papers/ddcjournal_preprint.pdf)

From Dirt to Shovels, Fully Automatic Tool Generation from Ad Hoc Data:
[http://www.padsproj.org/papers/popl08.pdf](http://www.padsproj.org/papers/popl08.pdf)

The website for the project is:
[http://www.padsproj.org/](http://www.padsproj.org/)

------
geebee
I suspect this kind of janitor work is why I gravitate toward python. Even in
a nice, controlled environment like kaggle, it's remarkable how much
string/text manipulation you have to do just to get a basic benchmark or cram
stuff into a random forest - and that's where the data has already been put
into a text file (in my case, a csv) for you. My guess is that it would take
10 times more data munging to get that file produced in the first place.
Python has good libraries and is very good for all that data munging.

Trying to outsource this stuff is appealing, I suppose, but that has turned
out in my life to surprisingly difficult as well. I once worked with a supply
planning division of a manufacturing organization to try to reduce inventory
by anticipating when orders would come in, so I got access to the sales
database. It appeared that there was a large spike in orders later in the
quarter, and that we were carrying inventory unnecessarily. Actually, as we
discovered during a pilot, sales people who got new orders from the same
customer later in the quarter would just delete and re-enter an entire order
rather than doing small updates, which they found to be a hassle (this was
quite a while ago, when these systems weren't as easy to use).

These attempts to section off and outsource the "low value" work often just
make things worse. It's just too unpredictable.

------
adlersantos
"Janitor Work" is an "it depends" situation. I agree with the article that
data wrangling is counter-productive, where the 80/20 ratio for
wrangling/analysis can still be improved. But I disagree with the notion that
doing such work never leads to a data scientist's technical growth. Such dirty
work allows data scientists to be more effective because they can take data of
any form, know its shortcomings, and generate insights from it with proper
scope. Those who would want to rely on 'clean' data always handed down to
them, I must say, are spoiled.

On a side note, it's narcissistic and cocky to use the word "janitor" for such
an issue. It's not like data scientists should only be worthy of doing the
illustrious part of the job and never the dirty tasks, right?﻿ I'd still go
and take the broom and clean up the mess when no one else can.

------
rgejman
Data cleaning/munging is a surprisingly pervasive problem in my field ('big
data' cancer research). I spend a good fraction of my time hunting down bugs
caused by dirty sources upstream of my scripts. One way to fix this is to
become more rigorous about using tools and libraries that check and enforce
data correctness and consistency... but it's hard to do this on an individual
level. Even if I'm careful about ensuring the correctness of my data, other
researchers who send me data may not be as scrupulous. As with any challenge
that requires teamwork and constant vigilance (e.g. unit testing), perhaps the
data science community needs to invest in tools and computational frameworks
that constantly monitor data for correctness and consistency using DDLs and
the like.

~~~
mrdmnd
My team at Airbnb is working on a tool called "Salus" \- named after the Roman
goddess of hygiene - that allows you to check the correctness of your data in
a number of common, extensible ways. Perhaps this is something that the open-
source community at large might be interested in?

~~~
rgejman
Interesting. I'd love to hear the details.

------
michaelmior
I quite like the term "data civilian."

~~~
justinph
Really? It made me think that Monica Rogati, the speaker of said term, is very
full of herself and her profession.

~~~
michaelmior
Having met Monica briefly at a conference, I don't think that's the case. But
I can understand why that would give the impression.

------
daemonk
It's a fine line between cleaning data and quantizing data. Spelling mistakes,
format error correction, consistent formatting are all "janitorial".

However categorizing data into correct groups, removing what you consider is
"non-essential", or simply rounding off decimal numbers all can have impact on
the analysis down-stream.

In some ways data science is pretty much all about learning best practices for
quantizing data.

------
fiatmoney
And of course, once your data is clean, if you actually want to be able to
predict anything you get to spend the time tinkering with model structure,
feature extraction techniques, hyperparameter tuning, data size / model size
tradeoffs, and a whole bunch more yeoman's work. Depending on your temperament
you might find this boring as sin.

------
scottlocklin
The fact that nobody mentioned Quandl in this article is criminal negligence.

