
Ask HN: What tools are available for data cleaning? - iamwil
I just started messing around with some machine learning algorithms, and discovered most of my time is spent cleaning up data. What tools are available for seeing anomalies in data and cleaning it up?<p>I know there&#x27;s OpenRefine (open source, but abandoned) and DataWrangler (research project, also abandoned as originators went to create a company) for non-technical people. And for programmers, it&#x27;s usually some version of SQL or using Python&#x27;s Panda or R.<p>Something like OpenRefine looks great, but doesn&#x27;t operate on databases. I&#x27;m willing to pay for something, but not the usual enterprise software prices. Looking for something like...an Excel for data cleaning. Anyone know of something that might fit the bill?
======
tixocloud
I'd recommend checking out RefinePro, which is based on OpenRefine:
[http://refinepro.com/](http://refinepro.com/)

They're doing a private beta and might have what you're looking for. I can put
you in touch with the founder if you need more information.

------
tgflynn
I think most people code this themselves because what "data cleaning" means is
usually specific to each dataset/problem.

------
T-A
Asking GitHub about "anomaly detection" turns up a bunch of stuff for time
series data. If you need something more generic, maybe
[https://en.wikipedia.org/wiki/ELKI](https://en.wikipedia.org/wiki/ELKI) could
be useful?

