Hacker News new | past | comments | ask | show | jobs | submit login
Steps to a clean dataset with Pandas (towardsdatascience.com)
4 points by NicoJuicy 4 months ago | hide | past | web | favorite | 1 comment

To add to the three points in the article:

Data quality https://en.wikipedia.org/wiki/Data_quality

Imputation https://en.wikipedia.org/wiki/Imputation_(statistics)

Feature selection https://en.wikipedia.org/wiki/Feature_selection

datacleaner can drop NaNs, do imputation with "the mode (for categorical variables) or median (for continuous variables) on a column-by-column basis", and encode "non-numerical variables (e.g., categorical variables with strings) with numerical equivalents" with Pandas DataFrames and scikit-learn. https://github.com/rhiever/datacleaner

sklearn-pandas "[maps] DataFrame columns to transformations, which are later recombined into features", and provides "A couple of special transformers that work well with pandas inputs: CategoricalImputer and FunctionTransformer" https://github.com/scikit-learn-contrib/sklearn-pandas

Featuretools https://github.com/Featuretools/featuretools

> Featuretools is a python library for automated feature engineering. [using DFS: Deep Feature Synthesis]

auto-sklearn does feature selection (with e.g. PCA) in a "preprocessing" step; as well as "One-Hot encoding of categorical features, imputation of missing values and the normalization of features or samples" https://automl.github.io/auto-sklearn/master/manual.html#tur...

auto_ml uses "Deep Learning [with Keras and TensorFlow] to learn features for us, and Gradient Boosting [with XGBoost] to turn those features into accurate predictions" https://auto-ml.readthedocs.io/en/latest/deep_learning.html#...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact