

Data Analysis: The Hard Parts - SanderMak
http://blog.mikiobraun.de/2014/02/data-analysis-hard-parts.html

======
mswen
Great article outlining some of the issues that make really sound data
analysis hard work even though tool sets have progressed substantially.

One thing I do is to remind myself to walk through the basics of a new set of
data before I start constructing models.

These include: 1) Measurement - manually examine data in the rawest form
possible. You may find for example that, 14% of the data that was collected at
stage 1 has not made it to stage 2. Why? To look for measurement error you
need to actually dig into how your data is being captured. And, ask yourself
to what extent do these measured variables capture the relevant business
reality? 2) Data fusion - Manually examine individual cases after raw data has
been fused into a dataset/frame for analysis. Can you verify that the merge
actually worked properly? In particular sort your data to find cases where one
or more of the variables are extreme. Extremes often reveal either errors or a
glimmer of the predictive gold you are mining for. 3) Run frequency
distributions on all your variables, looking for out of bounds data, oddly
shaped distributions etc. 4) Run pairwise comparisons on individual variables.
Create scatterplots, run basic correlations and think about the relationships
between the measured variables. This will start to give you a basis for
creating more sophisticated models.

Obviously there is more, but if I discipline myself to do these basics before
moving on to building predictive models I save myself a lot of grief. The
danger as the post points out, is that easy-to-use tools for the model
building stages of analysis tempt one to forgo much of that foundational work
and go straight to the fun of building models.

