
Math versus Dirty Data - furcyd
https://jeremykun.com/2019/06/08/math-versus-data/
======
anbop
Is it the new normal to vent about your job on a public blog post, talking
about how it's not very exciting and how it involves mostly "poop-smithing,"
and how you'd be thinking of quitting if you didn't have other stuff you were
doing, and for the company to be totally OK with that?

~~~
baron_harkonnen
A long time ago HN used to be a place where people had a general distrust of
large corporations and it was much more impressive to be working for an
ambitious startup than being highly paid an a large company just grinding out
the status quo.

I'm really happy to see posts like this here. Jeremy has done lost of great
work and so it's somewhat refreshing to see that even someone like him
struggles at times at a place like google. It's also important because it show
that FAANG isn't all it's cracked up to be and so it is a completely
legitimate career path to remain in small startups, maybe getting paid less,
but having more fun getting paid.

I've worked at both the big and small and this post definitely resonates with
me and certainly makes it easier to go into work knowing that I'm not the
crazy one.

~~~
islon
> It's also important because it show that FAANG isn't all it's cracked up to
> be and so it is a completely legitimate career path to remain in small
> startups

I don't think there's just these 2 options. There's the other 90% in between.

------
ridaj
It's funny how Google is perceived in the public as an all-knowing organism,
gobbling up petabytes of data public and private to fuel a superior
intelligence... Actually it can't even get clean data about its own internal
operation...

~~~
TheSpiceIsLife
Do you believe the two scenarios are mutually exclusive?

I've definitely seen people be extremely disorganised in some aspects of life
but also sufficiently, moderately, or fairly, successful in others.

I suspect companies are also capable of exhibiting similar behaviours.

------
tomkat0789
I dealt with this issue in my PhD research on data mining, and it motivated me
to search for a new line of work. The benefits of data based products never
justify the customers changing everything just to make life easier for the
data nerd in the corner (in my case at least). What made them happiest was
when i assembled some basic tools to mine their data on their own - "teaching
a man to fish, ruining a business opportunity".

I think dirty data are like the refs in a football game. Nobody comes to see
them, but they'll be part of the game until you have perfect players.

------
rxm
In most organizations I have interacted with getting data with known sampling
is always a problem. “The temperature sensor sends a measurement every 5
minutes.” Except, I learned after a lot of debugging, when the air-conditioner
gets maintained and the readings become a fixed value.

It’s not the organizations fault. Discovering and maintaining a known sampling
for the data is part of the process. I consider it a win when I can get a team
to accept that the data process will need to be debugged just like code.

------
cjohnson318
I just came back from an oil and gas conference, and this dog running after
missing/wrong data is painfully accurate. Datasets are often partially or
completely suspect. Two geologists might have contradictory opinions regarding
the interpretation of logging data. Even when you have good data, it might not
be useful six feet into a formation.

------
avmich
> I don’t have data-intensive applications or problems of scale, but rather
> policy-intensive applications.

As I understand it, this problem - "policy-intensive" \- is a lot of changes
of requests from the customers of the system. In other words, customers don't
know fully what they need and produce a stream of requests. This stream or
requests may converge on some stable "global" requirements, or, alternatively,
may represent "moving target" (not converge).

Additionally, some of those requests are caused by different (supposedly
better) understanding of the nature of data - the object of the system. That
is, with evolution of the system customers (with the help of developers)
understand more and more specific details about the data - missing parts,
ambiguous parts, alternative sets of attributes etc.

The approach for such problems, which so far is most promising, is to organize
the system as a set of independent, composable as much as possible operations.
When another request comes from a customer, or another detail about the data
becomes known, the system built from such composable components better allows
incremental modifications towards processing such a change.

A good set of operations sometimes develops with time. This approach requires
constant reflection on what a particular change mean to the existing process,
uncovering assumptions and making them explicit and changeable...

------
odomojuli
I now want to start a math cafe. Chalkboards. Coffee. Hang pieces from local
generative artists. I'll stop short of making puns about pie. But it'd be
nice. I already know what to name it! Satz. (German for theorem / coffee
residue).

"A mathematician is a machine for turning coffee into theorems."

------
breck
If anyone is interested, we are working on a solution to the dirty data
problem at [http://treenotation.org](http://treenotation.org).

~~~
JD557
I'm sorry to sound pessimistic, but I don't think that there's any _technical_
solution to dirty data.

The problem is usually not "This data was sent as a JSON without any schema
and with syntax errors", it's "This avro file has a completely useless schema
(e.g. everything is typed as string|null) and there are multiple enumerations
where the same value is encoded as 3 different strings (e.g. yes, y, true)"

~~~
breck
If we make it 10x+ easier to define schemas and reuse existing ones, by making
authoring, concatenation, extending, sharing, and discovering such schemas
easier; and provide great translation mechanisms to/from existing languages
like sql, Json, csv and xml; and provide new auto scheme generation and data
fixing from better deep learning models....I think we can get there.

We are working on all of that

