

Data trumps everything - outside1234
http://timpark.io/data-trumps-everything/

======
maxharris
Not everything - "data" does not trump philosophy. That we ought to go by the
facts (i.e. data) and that we are capable of discovering the facts are
examples of philosophical ideas.

~~~
outside1234
You got me there! The larger trend I'm trying to point out though is the
steady move away from intuition based companies and into companies nearly
completely driven by data.

The shift that we are seeing in lean startups reflects this - it is not a guru
based process but a data driven one to start a company. Obviously, execution
counts as well, but given two otherwise equal companies: One with a guru based
approach (ala Steve Jobs) and one with a data driven process, I believe the
inflection point has been reached where the data driven company wins
consistently.

~~~
stfu
Wasn't that the approach of some Cybernetic idealistits in the 50s?

That the world would be such a better place, if we just could ban intuition
and let everything get run by machines?

Even if that would be hypothetically be possible, I would neither want to work
nor life in a world like that. Predictability kills any excitement.

~~~
halvsjur
Isn't that a tad black and white?

------
saosebastiao
I'm a pretty big fan of data...but this is too extremist. There are plenty of
things for which data does nothing and more data does more of nothing. Data
only improves your decisions if the past can be helpful in predicting the
future, and it only improves your knowledge to the extent that truth is
immutable and the data is completely unbiased. And this is all before we get
into practical problems where data is messy, unstructured, inaccurate, and
still requires intelligent brains to interpret and understand in all of its
dimensionality.

Sometimes you are just better of with a solid theoretical abstraction, or a
good intuition. If you are smart, you will understand that they make great
priors that can get better with data. If you take the dogmatic route and only
trust data, you will constantly feel like someone else had a head start on
you.

------
gpcz
If you're looking for more examples and information about using simple
algorithms with lots of data, I recommend Peter Norvig's "The Unreasonable
Effectiveness of Data": <https://www.youtube.com/watch?v=yvDCzhbjYWs>

------
msellout
> "But the vast majority of these are changes that frankly they don’t
> completely understand"

Sounds like over-fit to me. I much prefer interpretable and plausible models
to magically accurate predictions.

------
disco565
Statistics is the new plastics.

~~~
outside2344
Exactly - if I could make any change to my engineering education, it would be
to have taken even more statistics.

~~~
chubot
Really? Can you name of some examples of where you've needed it in your work?
(honest question)

I am a programmer who works with a bunch of statisticians, doing "big data"
stuff. What I observed is that most of them don't really spend very much time
doing statistics. They spend all their time finding, collecting and massaging
data. That generally involves a lot of programming. Once you get the data, the
conclusions are fairly obvious without any statistics. Just make some plots
and there are glaring orders of magnitude deficiencies.

I also wanted to learn more statistics... but basically with software, you are
overflowing with data. The challenge in science usually is to gather data. In
software the challenge is the oppoosite -- you have so much data and you need
to make sense of it. To be concrete I'm talking about stuff like logs from web
servers and various other systems.

So I learned a lot about sampling algorithms to cut down data, as well as
various streaming algorithms. But I haven't actually learned that much about
statistics. So I wonder if I am missing something.

~~~
alexchamberlain
It sounds like the sampling and streaming algorithms you know are statistical
algorithms. Have you got any good references for streaming algorithms?

~~~
chubot
They're not really statistical algorithms -- you certainly wouldn't learn them
in a statistics class.

I actually looked online for some references on streaming algorithms ... but
somewhat surprisingly, I couldn't find anything really. I realized a lot of
this knowledge has been hard-earned, I guess that is good :) But there really
should be a reference.

Definitely look up "reservoir sampling", which gives you a fixed size sample
of an infinite length stream. This algorithm has come up over and over for me,
and I've implemented it in multiple contexts. There is a way to do it in
MapReduce which is very useful.

You know probably the most condensed version I can think of is quite hidden:
see the open source Sawzall code:

[http://code.google.com/p/szl/source/browse/#svn%2Ftrunk%2Fsr...](http://code.google.com/p/szl/source/browse/#svn%2Ftrunk%2Fsrc%2Femitters)

All those functions aggregate some property of a stream. Someone (maybe me if
I finish the 30 projects I've started...) really should write up some real
documentation about all those algorithms, because they're not only fun, but
useful.

A lot of them are (necessarily) approximations. You don't learn that many
approximate algorithms in a traditional CS class. None of this will be in any
stats class for sure.

------
ucee054
Example of data, August 2007:

Oh no, there's no risk to these mortgage backed securities at all! They're
priced to be AAA by everyone on Wall Street!

September 2007:

Oh shit...

