Hacker News new | comments | show | ask | jobs | submit login

In "What barriers are faced at work?", I really wish they broke down the "dirty data" response into more categories. In particular, I'd love to know if people are dealing with data quality issues, feature engineering issues, or something else all together.

In my opinion, this is representative of the problems with data science tools today. There is so much focus on the machine learning algorithms rather than getting data ready for the algorithms. While there is a question that lets respondents pick which of 15 different modeling algorithms they use, there's nothing that talks about what technologies people use to deal with "dirty data", which is agreed to be the biggest challenge for data scientists. I think more formal study of data preparation and feature engineering is too frequently ignored in the industry.

Completely agree. Data quality issues was a big part of our motivation with Kaggle Datasets (an open data platform where the quality of the dataset improves as more people use it) and Kaggle Kernels (a reproducible data science workbench that combines versioned data, code, and compute environments to create reproducible results).

Two examples of this: Kaggle Datasets supports wiki-like editing of metadata (file and column descriptions) and makes it easy to see, fork, and build on all the analytics created on the data so far.

We're just getting started with each of these products: we want Kaggle Datasets to support a fully collaborative model around working with all your data in the future, and Kaggle Kernels to support every analytics and machine learning usecase.

Of course everyone agrees that "cleaning data" is difficult and boring, and it's always mentioned, but what I don't really understand is what kind of tools people expect for this beyond what are already available. E.g. pandas is pretty good at merging tables, re-ordering, finding doubles, filling or dropping unknowns etc. There are also tools for visualizing large amounts of data, look for outliers, etc. Beyond the basic tools it seems to me that each dataset requires decisions to be made that can't be automated. (e.g. do I drop or fill the unknowns?) I don't see how this could be improved, as every decision has a solid, semantic implication related to whatever is the overarching research question.

So statements like "getting data ready for the algorithms" seem kind of meaningless to me, in the sense of general methodologies. How could you possibly "get the data ready" without considering what it is, how it will be used, etc. How can it possibly be generalized to anything beyond the specific requirements of each problem instance?

I'm just really curious what you are imagining when you say that better tools are needed here.

I am the lead contributor of a python library called Featuretools[0]. It is intended to perform automated feature engineering on time-varying and multi-table datasets. We see it as bridging a gap between pandas and libraries for machine learning like scikit-learn. It doesn't handle data cleaning necessarily, but it does help get raw datasets ready for machine learning algorithms.

We have actually used it to compete on Kaggle without having to perform manual feature engineering with success.

[0] https://www.featuretools.com

Wow, this looks very cool!

I'm starting to build up various utilities to help with this kind of thing, but I fully agree. The decisions require understanding the business requirements (do I use source X or Y for field 1, what errors are OK, what types of error are worst, etc), but the process of finding some of these could be better.

One simple one is missing data. Missing data is rarely a null, I've seen (on one field, in one dataset):

    " "
    Blank # literally the string "Blank"
    NULL # Again, the string
    No data
    No! Data
    No data was entered for this field
    No data is known
    The data is not known
    There is no data
And many, many more. None can be clearly identified automatically, but some processes like:

Pull out the most common items, manually mark some as "equivalent to blank" and remove.

Identify common substrings with known text (N/A, NULL, etc) and bring up those examples.

Are useful, I'd like to extend with more clustering and analysis to bring out other common general issues but rare specific issues. Lots of similar things with encodings, etc. too.

Other things that might be good are clearer ways I could supply general conditions I expect to hold true, then bring the most egregious ones to my attention so I can either clear out / deal with them in some way. A good way of recording issues that have already been analysed and found to be OK would be great too.

Yes, completely agree that each dataset requires decisions to be made that can't be automated, but there are huge opportunities for tools to assist users in understanding what cleaning decisions they might want to make and how those decisions affect the data. Most data cleaning tools do a very poor job of helping the user visualize and understand the impact cleaning has on data - they're usually very low level (such as pandas).

As an example of a tool: Trifacta (disclaimer I work here) https://www.trifacta.com/products/wrangler/. We're trying to improve data cleaning with features such as suggesting transforms the user might want, integrating data profiling through all stages to discover and understand, and transform previews so the user can understand the impact.

I think there's a huge opportunity for better tools in the problem space.

That's precisely the problem of Kaggle. The data is mostly cleaned for you. This is most of the job of a DS in industry. Cleaning your data improves performance way more than working hard on optimizing your ML algo.

> There is so much focus on the machine learning algorithms rather than getting data ready for the algorithms.

Generally, once a problem at work has come to the point of being a "kaggle problem", it's trivially easy. The main problem is unstructured data, with infinite ways of specifying similar ways to measure the same attribute, and lots of leeway to build an unmaintainable data pipeline between the data generation process and the model at the end.

All Kaggle problems aren't created equal. Some look like a train matrix, a single target, and a test matrix.

Others are far more complex and start with much messier data and/or complex formulations.


- www.kaggle.com/c/nips-2017-non-targeted-adversarial-attack/ - www.kaggle.com/c/the-allen-ai-science-challenge

I disagree that a "kaggle problem" style problem is trivially easy, but I strongly agree with the sentiment that dealing with unstructured data is often a much bigger, deeper, and broader problem than the choice of a particular algorithm or ensemble of them.

The ability to efficiently and effectively derive insights from such data is scarce.

Right, by "kaggle problem" I mean the general case where we roughly know what we're going to want to have on the right hand side of the model we're going to run (plus or minus some feature engineering, model choice and other hyperparameter specification, etc.)

Dirty data is not as much as a problem for me than human-biased data. Dirty data engineering, like modeling, will soon be largely automated.

Let's say you are predicting store sales. You create a feature that holds the store sales of one year back. The feature works really well and you are happy with your evaluation. But you captured bias: The previous model the store used was "predict today's sales by looking at last year's sales". Store managers fitted their sales tactics to this model (when the model predicted too much sales, the store managers do their best to get rid of the surplus inventory, for instance: by adding discounts or moving the products to a more prominent spot in the store).

So in the end you end up with a model with good evaluation, but you actually have (over)fitted to previous policies/models. You have not created the best possible sales predictor. How to ever find this out, without a costly intimate deep-dive in the data and data generation processes?

>Dirty data engineering, like modeling, will soon be largely automated.

I don't agree. For every modern tech company that collects data that lends itself to automated data cleaning, there's a 40+ year old company that defined what data to be collected in 1990, designed an "automated system" in 1995 and has been shoehorning improvements on that system since then.

At my last job I was given access to a database with 150+ tables with no data dictionary. The person who wrote the load process and ETL (the output was a lot of summaries) had left 10 years before and nobody truly understood how anything actually worked or the downstream dependencies. It took me a week of digging just to find out which of those 150 tables were just temp tables for one of the many queries that executed on that system.

It's going to be a while before somebody figures out how to clean that data automatically, or even find issues in that data. That's the reality of the world of data for many organizations.

It seems to me you were given three jobs: database admin, data engineer, and data scientist.

When I am talking about automated data cleaning, I am talking more about preprocessing text, dealing with missing variables, discarding duplicates, noisy/uninformative variable and outlier removal, spelling correction, feature interactions and transformations. All of these can be (and are being) largely automated. [1] [2]

A data lake with 150+ undocumented tables is garbage in-garbage out, both for machines and humans. I'd almost label that as the barrier: "Data not available", not: "Dirty data". While a reality for some companies, such a company really needs a DB admin or data engineer, not try to shoehorn an (expensive) data scientist in these roles.

[1] https://people.csail.mit.edu/kalyan/dsm/

[2] https://www.ijcai.org/proceedings/2017/352

If I understand you correctly, the way you'd address this is by using counterfactuals. See this course[1] for an overview and this paper[2] which talks about the bias problem in the context of movie recommendations.

[1] http://www.cs.cornell.edu/courses/cs7792/2016fa/ [2] http://www.cs.cornell.edu/people/tj/publications/schnabel_et...

Yes, counterfactual inference is relevant to this. But it is not so much about answering "what would have happened if?", but more about control theory and feedback loops: Your model never being a static function, but a node inside a giant recursive net composed of other models and humans.

Another example (this time on the output-end): You build a model to route emails to sets of experts inside an organization. Your proxy loss is multi-class logistic loss on topic classes. You are interested in improving response times (which you can more or less measure in aggregate) and quality of response (which is harder to measure, if at all).

You build a first iteration of the model and response times improve. Then you create new features and modeling techniques and you improve logistic loss, but when you deploy this model, response times go way down. What happened? Maybe the experts fitted/adapted to the model output: They learned how to quickly answer a specific type of email because it keeps getting routed to them. The new model does better matching topics to emails, resulting in those emails now being send to another expert. While this expert in the long-term may become better at answering emails closer to his/her topic expertise, in a faster and more informative manner, in the short-term he/she will be slower and of lower quality, as they need to adapt to the new types of emails they are getting, and lack the priors for dealing with ambiguous emails.

Both on the input and the output of models there are all sorts of these nasty human-feedback loops that are very hard to even identify and harder to solve.

Leon Bottou gave a talk about these challenges in the context of ML at Facebook: http://leon.bottou.org/slides/2challenges/2challenges.pdf (he mentions the decisions of two separate ML teams adversely influencing their individual experiments). This paper (https://research.google.com/pubs/pub43146.html) talks about "hidden feedback loops" and "entanglement".

As the saying goes, 80% of data science is cleaning the data and 20% is complaining about cleaning the data.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact