In my opinion, this is representative of the problems with data science tools today. There is so much focus on the machine learning algorithms rather than getting data ready for the algorithms. While there is a question that lets respondents pick which of 15 different modeling algorithms they use, there's nothing that talks about what technologies people use to deal with "dirty data", which is agreed to be the biggest challenge for data scientists. I think more formal study of data preparation and feature engineering is too frequently ignored in the industry.
Two examples of this: Kaggle Datasets supports wiki-like editing of metadata (file and column descriptions) and makes it easy to see, fork, and build on all the analytics created on the data so far.
We're just getting started with each of these products: we want Kaggle Datasets to support a fully collaborative model around working with all your data in the future, and Kaggle Kernels to support every analytics and machine learning usecase.
So statements like "getting data ready for the algorithms" seem kind of meaningless to me, in the sense of general methodologies. How could you possibly "get the data ready" without considering what it is, how it will be used, etc. How can it possibly be generalized to anything beyond the specific requirements of each problem instance?
I'm just really curious what you are imagining when you say that better tools are needed here.
We have actually used it to compete on Kaggle without having to perform manual feature engineering with success.
One simple one is missing data. Missing data is rarely a null, I've seen (on one field, in one dataset):
Blank # literally the string "Blank"
NULL # Again, the string
No data was entered for this field
No data is known
The data is not known
There is no data
Pull out the most common items, manually mark some as "equivalent to blank" and remove.
Identify common substrings with known text (N/A, NULL, etc) and bring up those examples.
Are useful, I'd like to extend with more clustering and analysis to bring out other common general issues but rare specific issues. Lots of similar things with encodings, etc. too.
Other things that might be good are clearer ways I could supply general conditions I expect to hold true, then bring the most egregious ones to my attention so I can either clear out / deal with them in some way. A good way of recording issues that have already been analysed and found to be OK would be great too.
As an example of a tool: Trifacta (disclaimer I work here) https://www.trifacta.com/products/wrangler/. We're trying to improve data cleaning with features such as suggesting transforms the user might want, integrating data profiling through all stages to discover and understand, and transform previews so the user can understand the impact.
I think there's a huge opportunity for better tools in the problem space.
Generally, once a problem at work has come to the point of being a "kaggle problem", it's trivially easy. The main problem is unstructured data, with infinite ways of specifying similar ways to measure the same attribute, and lots of leeway to build an unmaintainable data pipeline between the data generation process and the model at the end.
Others are far more complex and start with much messier data and/or complex formulations.
The ability to efficiently and effectively derive insights from such data is scarce.
Let's say you are predicting store sales. You create a feature that holds the store sales of one year back. The feature works really well and you are happy with your evaluation. But you captured bias: The previous model the store used was "predict today's sales by looking at last year's sales". Store managers fitted their sales tactics to this model (when the model predicted too much sales, the store managers do their best to get rid of the surplus inventory, for instance: by adding discounts or moving the products to a more prominent spot in the store).
So in the end you end up with a model with good evaluation, but you actually have (over)fitted to previous policies/models. You have not created the best possible sales predictor. How to ever find this out, without a costly intimate deep-dive in the data and data generation processes?
I don't agree. For every modern tech company that collects data that lends itself to automated data cleaning, there's a 40+ year old company that defined what data to be collected in 1990, designed an "automated system" in 1995 and has been shoehorning improvements on that system since then.
At my last job I was given access to a database with 150+ tables with no data dictionary. The person who wrote the load process and ETL (the output was a lot of summaries) had left 10 years before and nobody truly understood how anything actually worked or the downstream dependencies. It took me a week of digging just to find out which of those 150 tables were just temp tables for one of the many queries that executed on that system.
It's going to be a while before somebody figures out how to clean that data automatically, or even find issues in that data. That's the reality of the world of data for many organizations.
When I am talking about automated data cleaning, I am talking more about preprocessing text, dealing with missing variables, discarding duplicates, noisy/uninformative variable and outlier removal, spelling correction, feature interactions and transformations. All of these can be (and are being) largely automated.  
A data lake with 150+ undocumented tables is garbage in-garbage out, both for machines and humans. I'd almost label that as the barrier: "Data not available", not: "Dirty data". While a reality for some companies, such a company really needs a DB admin or data engineer, not try to shoehorn an (expensive) data scientist in these roles.
Another example (this time on the output-end): You build a model to route emails to sets of experts inside an organization. Your proxy loss is multi-class logistic loss on topic classes. You are interested in improving response times (which you can more or less measure in aggregate) and quality of response (which is harder to measure, if at all).
You build a first iteration of the model and response times improve. Then you create new features and modeling techniques and you improve logistic loss, but when you deploy this model, response times go way down. What happened? Maybe the experts fitted/adapted to the model output: They learned how to quickly answer a specific type of email because it keeps getting routed to them. The new model does better matching topics to emails, resulting in those emails now being send to another expert. While this expert in the long-term may become better at answering emails closer to his/her topic expertise, in a faster and more informative manner, in the short-term he/she will be slower and of lower quality, as they need to adapt to the new types of emails they are getting, and lack the priors for dealing with ambiguous emails.
Both on the input and the output of models there are all sorts of these nasty human-feedback loops that are very hard to even identify and harder to solve.
Leon Bottou gave a talk about these challenges in the context of ML at Facebook: http://leon.bottou.org/slides/2challenges/2challenges.pdf (he mentions the decisions of two separate ML teams adversely influencing their individual experiments). This paper (https://research.google.com/pubs/pub43146.html) talks about "hidden feedback loops" and "entanglement".