
Data Is Never “Raw” - seinundzeit
https://www.thenewatlantis.com/publications/why-data-is-never-raw
======
yuliyp
I think the author decides that "raw data" means "reality", when I don't think
that's a useful definition, as it, as the author showed, leads to the
conclusion that raw data does not exist.

Instead, a more useful definition are that "Raw data" are observations of
things. Those observations can occur in lots of different ways (surveys,
logging calls, billing system records, just to name a few). It's important to
always distinguish between the method of observation and the thing that's
being observed, though.

Similarly, once you have "raw data", you can analyze it. That analysis too is
yet another transformative step, which can introduce more biases and errors.

~~~
simonh
Quite, it’s context dependent. Raw format for sensor data from a camera is a
good example. There are many ways image sensor data can be processed to
generate a generic image file such as a JPEG, and that involves selective
choices. I think it’s useful to make that distinction.

In general it just means the original source data as distinct from processed,
selected, edited or derived data.

------
astazangasta
Ooh, what, we're talking about epistemology and science, my favorite subject?
Why, yes, it turns out that all observation is an imposition onto the world,
starting with our lyin' eyes, which can only see in a narrow band and don't
even pick up UV. To some extent this is what "science" is - you create a new
kind of instrument which you purport produces a specific kind of observation,
you collect those observations and make claims about reality, summarily
ignoring all of the ways your instrument leaks, breaks and fogs up. This is
true whether the instrument is a Western blot (is your extraction reliable? Is
your antibody specific? Did your gel break) or a microscope (is there out-of-
plane light in your image? photo-bleaching? Are your fluorophores working?) or
a thermometer in a bucket of sea-water (is your thermometer calibrated
correctly? Is the water artificially warmed by your boat?).

Et cetera. To some extent the ability to do science consists of saying, "Oh,
this is good enough to go on," ignoring the bumps, and being mostly right
about that.

~~~
yesenadam
Great comment. (Thanks HN, in the real world the only people I bump into who
know anything about philosophy of science are nutcases!)

Also..I'm fascinated by Delbruck's principle of limited sloppiness, where
being lax with the protocols (e.g. letting your nose drip into your petri
dish) is what leads to breakthroughs - be too careful and the unexpected
doesn't happen.

------
Waterluvian
In undergrad when doing satellite and aerial data processing, we considered it
"1st order" "2nd order" etc. For how many stages of processing it underwent.
This of course was somewhat subjective. There's two main things I remember:

1\. There's no "raw" because even the data collection device and platform will
do some form of processing (eg. quantization).

2\. There's no such thing as "ground truth" only "ground reference" because
truth is fuzzy.

------
l0b0
This fits really well with a pet peeve - that "data" is used by various
frameworks to mean everything from a request body to file contents to a stream
of database records. It is _always_ possible to be more specific than "data".
Conversely it's really confusing to deal with two _different_ "data" in a
single piece of code.

~~~
trustmath
It's hard to say which layer you are working with without having to know the
whole model. You can always look at data as initial to what is computed, but
1. That's not much different than looking at the code and 2. Everything that
is initial probably can be defined in terms of what computed it, unless you
want to draw some arbitrary wiggly line between perception and act of taking a
record or something like that, in which there are tons of use cases where one
defines the other, no matter what sort of nomenclature you start with.

------
oftenwrong
>We tend to think of data as the raw material of evidence. Just as many
substances, like sugar or oil, are transformed from a raw state to a processed
state, data is subjected to a series of transformations before it can be put
to use.

The best raw data is "raw" as in food, but often it is "raw" as in sewage.

I was recently working with some open public data, and it was very much the
latter. For example, there were a few important columns that were filled as
free-form text by police officers, and contained all sorts of random entries
that made the column basically useless. Sometimes they were unclear
abbreviations or jargon, and sometimes they were what appeared to be keyboard-
mashing, and possibly some bad OCR. It is obvious that the data collection was
only designed for events to be reviewed individually, if at all, whereas in
the tech world we know to always design our data to be analysed collectively.

I prefer to work with data that is "raw" as in food.

On a personal project, I added a processing stage before my "raw" stage. I
decided to call it the "alive" stage.

~~~
rabidrat
Can you explain what the alive stage does? I have a similar concept and I
wonder if it should be formalized.

~~~
oftenwrong
The joke is that I should have used descriptive names. "Alive" is pre-
columnated text data (e.g. CSV rows). "Raw" is columnated text data (e.g. CSV
cells). This makes it easier to accommodate quirks.

------
bageldaughter
"Raw" data is that which contains the maximum currently feasible total
information content. I think it's surprising that feasibility enters into it,
but it does.

In analytics, "raw" often means "the unaltered contents of the application
database". This is hardly "unprocessed" or "natural", but to alter it ("clean"
it) might lose information which turns out to be important later. An analytics
person may express exasperation that the application database is so
idiosyncratic. If it were up to them, the "raw" data would be cleaner, or more
complete, or less noisy.

But the application database is the way it is because it would be infeasible
to drastically change it. Certainly nothing can be done for the historical
data that's already been collected. Perhaps in the future, data could be
collected in a cleaner or less noisy way, the schemas normalized or
redesigned, but any proposed changes must compete with the present inertia of
the system, and with the need to maintain existing functionality. That is, any
such changes must be feasible.

For physical experiments, "raw" data is that produced by sensors that were
feasible to construct and operate given available technology and resources at
the time. One might imagine that "rawer" data than that might be collected
some day in the future. :)

------
grawprog
I've always considered 'raw data' to be data before interpretation or
processing. When you collect data, you just collect them. You collect every
bit of data you can, whether it's necessarily useful or not. After you've
processed it and gotten the values useful to whatever study you're doing the
data becomes 'interpreted data'. You likely leave out data that are unrelated
or not pertinent to the study being conducted and usually includes data
acquired after processing, means, medians, standard deviations etc.

Usually raw data is available but not included in it's entirety in reports.
The dat included in reports will typically be data after processing. This has
just always been my experience with the way studies and such tend to be
carried out.

------
cfmcdonald
For an interesting book on this premise (that there is no data without a
model), see [https://mitpress.mit.edu/books/vast-
machine](https://mitpress.mit.edu/books/vast-machine).

------
barrkel
Eh. Raw data to me means maximally noisy, maximally biased by collector and
detector mechanisms, unfiltered, uncorrected and uncalibrated. Give me the raw
data along with your adjustments, rationale and feedback loops that keep your
adjustments in line.

------
mgamache
The article makes some good points, but why do I get the feeling like this is
a postmodern attack on science? If you dismiss the data, you can dismiss any
conclusions drawn from it.

------
rogerb
I have this pet peeve towards "unstructured data". There's no such thing -
only data you don't understand the structure of. :P

~~~
kirubakaran
People usually mean "inconsistently structured data" when they say
"unstructured data"

~~~
brokensegue
or free text?

~~~
lmkg
Free text is the traditional definition of unstructured data (along with audio
& video equivalents). But I agree with GP, that people usually _mean_
inconsistently-structured data when they use the term, whether it's accurate
or not.

~~~
ACow_Adonis
Free text being the definition of unstructured data is kind of funny if you
think about it (and I agree that amongst computer people that's how they think
about it), since the whole point of text is to structure data using alphabets,
syllabary or logograms, such that it can be meaningfully be transferred and
understood by multiple biological interpreters :)

------
yarrel
This critique is never underdetermined.

------
plg
Sorry to be that guy but Data are plural. So Data _are_ never raw.

~~~
laurieg
In modern usage, 'data' is a mass noun and is used with 'is' very frequently.
To flat out say this usage is wrong with no is very prescriptive.

