Hacker News new | comments | ask | show | jobs | submit login
[dupe] Data Is Never Raw (thenewatlantis.com)
33 points by collapse 10 days ago | hide | past | web | favorite | 13 comments

Related: I’ve been trying to get away from describing data as “dirty”, which among other things, implies that there is an obviously better “clean” state for the data. I think it better to call data “complicated”, and rather than cleaning it, we have ways to simplify it for processing.

Have 100,000 people enter there address with a text field for the state and get back to me with how many states there are.

There are definitely cases in which the possible values are ennumerable and known. I’m thinking more about situations in which a many-to-one relation, or other ambiguity is attempted to be collected. Such as coming across multiple values in a race/gender field in a police interaction database because multiple officers were involved. My problem with using “dirty” is that it is so ambiguous as to encompass relatively simple issues that can be translated via a lookup table, as well as messiness that actually reflects the messiness of the reality beyond the input form’s common case. In the latter case, there isn’t a “clean” situation as much as there is a rethinking of what you thought you were measuring.

Did the people reply with 50 really get the correct answer? Is this making the assumption that there is a "known" list of states, that addresses are entered correctly, that we're talking about a specific country?

The only way we really know the answer to this question is by placing a lot of bias into our answer.

50 states, probably. ;)

Edit: assuming a very US-centric dataset...

50 states, represented probably 500 different ways... full names, partial names, abbreviations, abbreviations with dots, abbreviations with dots and extra spaces, different cases...

That data happens to take the form of human input. In that case the different ways in which participants choose to enter their address is part of what you are collecting. It does not suggest that the data is raw or dirty, you just gathered superfluous information as a side effect of your methodology. I think it is better described as overly complicated.

If the problem were the opposite it would also make sense to say that the data is too simple rather than too clean. That gets to what the article is saying; that all data collection is inherently biased.

Well, you can't answer 50 states just because you had 100000 people. What if the source of that data was a CA insurance query form where they must enter, specifically, their CA address. (I'm assuming you'd still get other states, but it might not be 50).

Plus Washington DC, the US territories, and US military bases abroad.

There’s validated data types and unvalidated data types.

“Our logs are stored as newline-delimited JSON objects.” No, it’s improperly escaped so it’s actually a JSON-like string of unknown character encoding.

“This is a postal code field.” Nope, it’s a user-generated string that occasionally converges on something approximating a postal code.

“This is the user’s IP address.” Well I clearly see localhost in here so unless folks are browsing the site from our servers, this is “sometimes” the user’s IP address.

Semantic normalization?

If someone hired me a data scientist and I said: "Can you help me process this raw data into something that we can work with in this tensorflow model?" And they reply with:

"Don't call it Raw Data"

I would ask my manager to get me someone else.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact