
Steps to a clean dataset with Pandas - NicoJuicy
https://towardsdatascience.com/3-steps-to-a-clean-dataset-with-pandas-2b80ef0c81ae?source
======
westurner
To add to the three points in the article:

Data quality
[https://en.wikipedia.org/wiki/Data_quality](https://en.wikipedia.org/wiki/Data_quality)

Imputation
[https://en.wikipedia.org/wiki/Imputation_(statistics)](https://en.wikipedia.org/wiki/Imputation_\(statistics\))

Feature selection
[https://en.wikipedia.org/wiki/Feature_selection](https://en.wikipedia.org/wiki/Feature_selection)

datacleaner can drop NaNs, do imputation with " _the mode (for categorical
variables) or median (for continuous variables) on a column-by-column basis_
", and encode " _non-numerical variables (e.g., categorical variables with
strings) with numerical equivalents_ " with Pandas DataFrames and scikit-
learn.
[https://github.com/rhiever/datacleaner](https://github.com/rhiever/datacleaner)

sklearn-pandas " _[maps] DataFrame columns to transformations, which are later
recombined into features_ ", and provides " _A couple of special transformers
that work well with pandas inputs: CategoricalImputer and FunctionTransformer_
" [https://github.com/scikit-learn-contrib/sklearn-
pandas](https://github.com/scikit-learn-contrib/sklearn-pandas)

Featuretools
[https://github.com/Featuretools/featuretools](https://github.com/Featuretools/featuretools)

> _Featuretools is a python library for automated feature engineering._ [using
> DFS: Deep Feature Synthesis]

auto-sklearn does feature selection (with e.g. PCA) in a "preprocessing" step;
as well as " _One-Hot encoding of categorical features, imputation of missing
values and the normalization of features or samples_ "
[https://automl.github.io/auto-
sklearn/master/manual.html#tur...](https://automl.github.io/auto-
sklearn/master/manual.html#turning-off-preprocessing)

auto_ml uses " _Deep Learning [with Keras and TensorFlow] to learn features
for us, and Gradient Boosting [with XGBoost] to turn those features into
accurate predictions_ " [https://auto-
ml.readthedocs.io/en/latest/deep_learning.html#...](https://auto-
ml.readthedocs.io/en/latest/deep_learning.html#feature-learning)

