
Data Is Never Raw - collapse
https://www.thenewatlantis.com/publications/why-data-is-never-raw
======
clhodapp
Previously:
[https://news.ycombinator.com/item?id=18735492](https://news.ycombinator.com/item?id=18735492)

------
danso
Related: I’ve been trying to get away from describing data as “dirty”, which
among other things, implies that there is an obviously better “clean” state
for the data. I think it better to call data “complicated”, and rather than
cleaning it, we have ways to simplify it for processing.

~~~
LanceH
Have 100,000 people enter there address with a text field for the state and
get back to me with how many states there are.

~~~
eximius
50 states, probably. ;)

Edit: assuming a very US-centric dataset...

~~~
icedchai
50 states, represented probably 500 different ways... full names, partial
names, abbreviations, abbreviations with dots, abbreviations with dots and
extra spaces, different cases...

~~~
porphyrogene
That data happens to take the form of human input. In that case the different
ways in which participants choose to enter their address is part of what you
are collecting. It does not suggest that the data is raw or dirty, you just
gathered superfluous information as a side effect of your methodology. I think
it is better described as overly complicated.

If the problem were the opposite it would also make sense to say that the data
is too simple rather than too clean. That gets to what the article is saying;
that all data collection is inherently biased.

------
rhacker
If someone hired me a data scientist and I said: "Can you help me process this
raw data into something that we can work with in this tensorflow model?" And
they reply with:

"Don't call it Raw Data"

I would ask my manager to get me someone else.

