
Ways Data Projects Fail - martingoodson
http://www.martingoodson.com/ten-ways-your-data-project-is-going-to-fail/
======
jackschultz
Good to see that data cleaning was #1 on that list. Whenever I do work on a
side project, it takes way way more time to get and structure the data than it
does running the algorithms. Granted, that's because I have to go out and get
the data in the first place, and then make sure it's useable and in the
correct format.

Like the recent project I'm doing trying to classify country music songs based
on their topic on the data blog I write on
([https://bigishdata.com](https://bigishdata.com)), the amount of time it's
taking to scrape lyrics, remove duplicate / incorrect songs, and then do
manual classification for training data is taking far longer than running the
ml algorithms in the end one I've gone through that process.

I've been looking for jobs recently, and I've seen only one job posting that
mentions data cleaning as a necessity, whereas the rest only talk about data
science and algorithm knowledge, or overall ETL design on the data engineering
side. Seems like data set knowledge should be emphasized more.

~~~
elevensies
Any specific resources you'd recommend on data cleaning, verification,
etcetera? I've just started reading this: [https://www.amazon.com/Accuracy-
Economic-Observations-Oskar-...](https://www.amazon.com/Accuracy-Economic-
Observations-Oskar-Morgenstern/dp/0691003513) . I've seen a few other books on
the subject which I'm planning to get into, but I'd be interested if anyone
has specific recommendations.

~~~
chubot
Do you use R and the "hadleyverse"? (Or "tidyverse" I think as he prefers?)

I'm a programmer by trade but I use R because the people who _actually_ work
with data use it, and they write good tools for it... I think there is some
confusion in the programming world about this. Programmers work with data, but
they don't do it nearly as much as "professionals".

Tidy data is a good intro if you're not familiar with it:

[http://vita.had.co.nz/papers/tidy-
data.html](http://vita.had.co.nz/papers/tidy-data.html)

And I would recommend going through other publications by Wickam, all on his
site -- they are quite readable.

~~~
elevensies
No, I don't use R, thanks for the reference. Most of what I've done lately has
been as part of software development process, so along the lines of validating
the effectiveness different techniques for solving a known problem with a
smallish test dataset -- typical engineering style optimization. I'm looking
to impose more structure on the process.

------
mswen
In spite of assurances from business process owners that the underlying data
sources are clean ... they almost certainly are not.

Multiple legacy systems with no consistent cross reference to unambiguously
identify the same customer. Assured that systems have been gone through and
all the names made consistent. Consistency for a human is not consistency for
a computer. "Commers Ltd" is not the same as "Commers Ltd." And, isn't it
lovely when a salesperson decides to add a location to a customer name. Now we
have "Commers Ltd Dallas" as a unique customer. Business process discipline is
often lacking and will mess you up.

Subscription data sources that change their schema with no notification to
paying customers. And, when you are scraping data from websites you need to
constantly be checking that your scrapers are still working properly. Source
websites change regularly.

Crazy processes like entering a negative invoice to indicate a refund to
customers but forgetting to zero out the cost of goods related to the invoice.
We may have refunded the money but we didn't do the work twice. Arggh! Errors
abound.

~~~
marcosdumay
> And, isn't it lovely when a salesperson decides to add a location to a
> customer name. Now we have "Commers Ltd Dallas" as a unique customer.

Good luck modeling data to not need this. I have caught myself actively
telling people to do this more than once, and I'm really thinking about
replicating some data so those changes are less disrupting.

That "Commers Ltd Dallas" probably has differing billing and delivery
addresses, points of contact, invoice formats, customers representatives and
preferred sellers, product selection, and probably everything else you have on
your DB.

~~~
zo1
It's really the only option if your input/managing software can not model the
complexity you require "after the fact". E.g. Company has multiple offices.
Delivery address is per office, not per company, etc.

The real problem is when there is chaotic, and organic mixing, matching and
re-purposing. I've seen it many times with "non-technical" individuals. They
don't know what their software can do. E.g. Redmine. So the support
individuals just log everything under the same "IssueType. They then
"categorize" it using Category custom field, instead of the standard category
which has enumerations. And then they then use that Category field to drive
reports/process. Instead of using a different IssueType or Tracker, which is
what it was designed for, and has tools that help you leverage/manage the
complexity of different standardized processes.

Then, they decide to to add "Sub-categories" into the category field, instead
of using a project-hierarchy or something. Then they want to do billing
reports from the time logged per X and of types A,B,C, and at that point it's
a giant mess and I stop caring. If they want to not use the software as
intended, then do "fixing" by filtering and fiddling with Redmine CSV exports
in Excel afterwards, that's their problem. Oh, and they ask that everyone has
permissions to everything, allowing all users to change the status of each
IssueType as they please, without any process.

I just feel sorry for the poor individual that get's a raw extract of that
data and has to use it for something.

------
d--b
The big one that's missing: There is nothing you can conclude from your data.
It's clean, it makes its way properly to the analyst, and yet, there's just
nothing there...

~~~
eanzenberg
If done properly that's still a useful result. "None of the variables were
predictive of X."

~~~
hobofan
And it's much less useful cousin hiding in the shadows "The variables are
actually predictive of X, but you don't have enough data that it shows".

(The chance for which isn't that big when done "properly".)

~~~
huac
At least in online contexts, this often means 'there might be an effect, but
it is smaller than our experiment could have predicted. Let's keep adding
samples.'

~~~
angry_octet
'... until we get the answer we're expecting'.

Just expecting an effect is a bias towards outliers.

[http://doingbayesiandataanalysis.blogspot.com.au/2013/11/opt...](http://doingbayesiandataanalysis.blogspot.com.au/2013/11/optional-
stopping-in-data-collection-p.html?m=1)

------
danso
Absolutely agree that data cleaning should be at the top -- how someone
prioritizes data cleaning is for me, the main litmus test to how effective
they are at real-world data problems. I also agree with how the author
summarizes the issue, but he also runs into the same issue I have: data
cleaning is such a broad term that it obscures how difficult and important of
a problem it is.

For example, some people think data cleaning is "Convert 12-FEB-2012 to
2016-02-12" type problems, and can't believe that such a task would be 80 to
90% of the difficulty in data work (compared to say, learning enough ggplot2
to make a nice chart).

On the other side of the equation, you have people who want to do a JOIN-
GROUP-BY aggregate so they can calculate how much "evil" Wall Street money
goes to each political candidate, a la OpenSecrets's calculation [0], only to
find that the FEC does not classify campaign contributions by industry type or
company, nor is the "employer" field filled with normalized entries such as
"Evil Wall Street Company" that would lend itself to easy GROUP BY calls. For
fucks sake, I've found that executive-level/professor folks can't even spell
"Goldman Sachs" and "Berkeley" correctly (even on a typed form)

And that doesn't even scratch the surface of how little this person knows
about the data question the purport to answer, or about how the FEC, the
American political system, and real life works. Among the data cleaning
problems they will have to mitigate are also the 2 hardest problems in
computer science (how things are named/classified, and how up-to-date the data
is).

I don't have any better ideas at the moment for how to break apart the
category of "data cleaning" that reveals the many facets of the problem but
also still preserves the interelatedness of the facets. But it's possible to
be very good at some of the parts of data cleaning without knowing the rest.

[0]
[https://www.opensecrets.org/industries/indus.php?Ind=F](https://www.opensecrets.org/industries/indus.php?Ind=F)

[1]
[http://www.fec.gov/finance/disclosure/metadata/DataDictionar...](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividuals.shtml)

~~~
throw_away_777
I've always disliked the term data cleaning for the reasons you mention - it
doesn't tell me anything about what is meant by "cleaning".

~~~
mcrad
Besides, cleaning implies some entry level gig, something for a QA hack not
someone experienced in complex systems. Marketing types love this kinda twist
as a way to maintain control (and ensuring failure) of the project. It's like
a scapegoat. Just got out of meeting where marketing claims that data
"hygiene" is gonna be a priority in 2017.

~~~
zgramana
I have spent a lot of time talking with customers/prospects about this topic,
but I use "data remediation" which I feel brings more accurate and precise
connotations, to wit:

* Implies that the data is deficient/falls short of expectations.

* Implies that the shortcoming currently makes it ineligible to graduate to the next level.

* Implies that with hard work and additional time likely it can be made sufficient though still not ideal.

* Implies that someone failed to help the data to meet expectations.

* Implies that you need special outside expertise, namely someone with the knowledge needed to assess the shortfall, possibly help you clarify your standards, design steps that when followed should result in "good enough" data, and who is able to articulate the remaining weakness(es) which need to be accounted when assessing future suitably of that dataset for a given purpose.

* Implies that your data will be stuck in school all summer while their friends are out having so much fun.

------
martingoodson
Author here - in case of any criticisms or comments.

~~~
toxikitty_
A lot of data scientists these days (me included) are former academics with
backgrounds in numerical simulation in fields like chemistry, physics,
mechanical engineering etc.

They live and breath numerical linear algebra and are comfortable reading
advanced theoretical books or papers.

It's easy for them to pick up the basics needed to pass interviews and find a
data science job. How would they go about adding some rigor to their
understanding of ML and statistics?

~~~
sidlls
I wouldn't expect the majority of data science jobs to be particularly focused
on the math behind the algorithms. Rudimentary understanding of probability
and how to translate the jargon into your academic background's jargon is more
important than deep understanding for these jobs. Passing the interviews for
these jobs is one thing. Unless you're specifically looking for jobs that
focus on generating new modeling techniques or algorithms for computational
statistics, expect to be far removed from even basic linear algebra in actual
practice. Source: me. I fall in your described bucket and have worked in data
science/machine learning jobs in both contexts (new modeling techniques/stats
versus application of off-the-shelf tools).

------
Declanomous
>Your Data Scientists are about to quit.

This is me. I work for a non-profit that is stuck in the stone age--not for
lack of money, mind you, but because the IT Director is an incompetent
megalomaniac who views "security" as a reasonable justification to refuse any
and all requests, and treats everyone like an enemy.

I haven't been allowed to use Python or R. In fact, the only programming
language I have access to is VBA (for applications, not the stand-alone
variant). Of course that's a huge mess because the IT director disables macros
once a month, generally right after another crypto attack makes the news.
Thankfully, he didn't even realize that it was possible to use VBA from inside
any office application until after I had already used it to create several
Access applications which made the jobs of the most important people in the
organization easier. So when he breaks VBA every director in the organization
yells at him and the functionality is restored nearly instantly.

Of course he could restrict the applications to run only signed macros, but he
won't give me permission to sign things because he is (literally) afraid I
might hack something.

On top of that, my computer is a Core 2 Duo from 2007 or so with 4 gb of ram.
He bought over 100 of them used from a computer recycler about 2 years ago.
For the first three months at this job I had a Pentium D, which literally
couldn't run Excel and Firefox at the same time. I'm not allowed to get a
better computer, because the employee handbook states that every computer
needs to be the same for "security" reasons. If my director used our budget to
purchase a computer I wouldn't be granted access to any of the databases
containing our data because of "HIPAA compliance." (For the record, we don't
have any medical data whatsoever. We only have names, addresses, and donation
amounts. We don't even know the birthdays of our constituents.)

The worst part is that we randomly started losing data after all of our
network drives were moved offsite at one point to provide "redundancy." I
created several tickets about this issue, and each time I was told that it
couldn't have possibly happened, and there was no record of the file ever
existing. I created a script that created a log file each hour with a list of
files and their attributes from each directory to try record proof of this
happening. After I recorded about a week of files disappearing randomly
overnight, he reported me to HR for hacking.

Once I proved nothing I did was wrong he amended the "IT security" section of
the employee handbook. Several of these measures were impossible to follow
because of restrictions he had placed on the computers/network. I brought this
up with HR, and they removed these measures from the handbook. Once this
happened, he sent an email to me cc'ing my boss and HR accusing me of trying
to frame him by deleting files. I don't know how that accusation even made
sense, because the files would still have to show up in transaction logs.

Despite all this, I KNOW my director and HR aren't going to believe me when I
tell them I'm quitting because our IT director is an incompetent tyrant. From
their perspective, IT issues are something that can be solved by compromise,
just like everything else. So IT has to let me use VBA, and that should be
enough.

Anyways, long story short, anybody hiring in Chicago?

~~~
jupiter90000
That sounds horrible. Ever thought of writing a VBA for data science book? ;-)

~~~
Declanomous
I've thought about putting all my code on github, just for kicks. My code is a
nightmare though, because I don't have version control, and Office VBA doesn't
allow inheritance. As such, the amount of code that has been copy-pasted is
bonkers.

That being said, the amount of VBA I've seen that doesn't work with _Option
Explicit_ is a little staggering, so maybe I should be so self-conscious.

------
intrasight
An honest question. If I can make really good money helping businesses make
sense of out "smallish" data in Excel, why would I subject myself to the
miasma that is "data science"? Will I be able to charge lawyer-like rates?

------
alecco
One more:

Moving daily huge amounts of data into some cloud and then back out might be
too slow or too expensive.

------
angry_octet
A big fallacy seems to be that it is meaningful to just use existing data in
whatever cruddy non-normalized form it comes in, and let the 'algorithms' sort
it out.

There needs to be strong mgmt support for getting outcomes, because the
production team are going to have to change to support it, and they usually
like doing things their own way. Typically, without analytics, sensible
logging formats or a clue as to why the outside world behaves the way it does.

------
zero-x
This is great advice. I'd add that the culture around data engineering
projects tend to be very different.

I've seen companies that treat data projects as if they were this great
unknown projects where the developers could get away with using bad or no
patterns and not follow patterns that other applications in the company use.

Technologies like Spark have made more common and easier to develop big data
applications and implement design patterns that regular engineers can
understand and follow.

Couple a great data engineer that with great data scientist using tools like
Spark, R, H2O, Alluxio, Parquet, etc. and companies can truly exploit their
large sets of data effectively.

The problem is DevOps and bridging the gap between a scientist's environment
and a production environment and keeping both as flexible and testable as
possible.

We started a company to bootstrap companies into this culture by providing
DevOps services and UIs which simplify the deployment of Kubernets, Spark,
Druid, H2O, etc. clusters. We also provide tools and services for simplifying
and automating ETL pipelines with which models can be trained.

If you are interested in finding out more about these services contact us at:
miguel@zero-x.co.

------
zby
What I have seen in my limited practice with machine learning and big data
projects is that it is easy to fool yourself that your methods work. And the
problem is that the people who are good at this get promoted, and those who
find the mistakes are shunned.

------
ernestbro
"Somebody heard: Data is the new Oil: No it isn't. Data is not a commodity, it
needs to be transformed into a product before it's valuable"

Err...Oil needs to be transformed and refined before it can be called a
product (like gasoline, plastics). So the analogy is good and even supports
#1!

------
intrasight
> Data is not a commodity, it needs to be transformed into a product before
> it's valuable.

Commodities need to be transformed into a product before it's valuable.

------
swingbridge
Very good points and picked up on some accurate stereotypes that often appear
in business.

------
RickChen
I recently was working with a startup that is trying to tackle some of these
issues. Feel free to check them out and provide some feedback!
([https://datablade.com](https://datablade.com))

------
maverick_iceman
This is related to point 7, solution in search of a problem. I myself have
been guilty of this when I wanted to use deep learning models just because I
could. My much more experienced boss gently dissuaded me and I ended up with a
'boring old' logistic regression, which was completely adequate for the job.

Another time I was working with an engineer who built a neural net to predict
something. Turned out it was a really poor choice as interpretability was
important for the problem and the neural net's predictive power was actually
worse than more traditional models.

