
Data is good, code is a liability - Anon84
http://glinden.blogspot.com/2008/11/data-is-good-code-is-liability.html
======
snprbob86
I worked on Google's statistical machine translation system during my
internship. There, I learned that data really is king. The Google Translator
team spends equal effort collecting data as they do improving their
algorithms.

The 2008 NIST results [1] show that Google's translator swept every category
with unconstrained training sets. That is, when Google was allowed to use all
of the data that they collected, they smoked the competition. When the
training sets were constraint to a common set for all competitors, better
algorithms prevaled. You can be sure that the very talented team at Google
will be improving their algorithms to ensure that never happens again. But you
can also be sure that competitors will be collecting even more data to counter
Google's victories.

[1]
[http://www.nist.gov/speech/tests/mt/2008/doc/mt08_official_r...](http://www.nist.gov/speech/tests/mt/2008/doc/mt08_official_results_v0.html)

~~~
liuliu
But who remember that in 1998, how many webpages Google indexed and how many
Altavista indexed? Data is important, for spam filter, translation etc. but is
far from a "king". We have the fancy that data is much important than
algorithm Because now we actually get some really good statistical learning
methods.

------
zmimon
I see this at a micro level frequently when coding. Very often code that is a
complex bunch of if / else statements is dramatically simplified by turning it
into a map / dictionary with pointers to either data or functions to handle
that type of data (object oriented polymorphism being an instance of this).

There are also interesting parallels with REST vs RPC as well. You can create
a rich API of function calls for accessing and manipulating data, but it's
nearly always less flexible than just exposing the data and letting people
manipulate it directly.

I think the tendency to favor algorithms when it might otherwise not be wise
to do so comes from how our minds work: we remember things primarily in terms
of stories, scenarios, sequences of events. This causes us to interpret the
world in terms of behavior as if behavior is the primary construct on which
the universe is modeled. But of course behavior is not primary, data is
primary, things are primary - behavior is just a fiction we impose on them.
This often leads our instincts in the wrong direction.

~~~
jamongkad
Hmmm replacing if/else statements with a map/dictionary with pointers to
either data or functions. A little off topic here but how do you propose to do
this? Assuming we know what polymorphism is. Your map/dictionary style is
quite interesting.

~~~
etal
For languages with first-class functions, it looks like this:

    
    
        # Algorithms
        def double(x): return x*2
        def square(x): return x*x
        def fact(x): return (x*fact(x-1) if x > 1 else 1)
    
        # Data
        choices = { 'A': double, 'B': square, 'C': fact }
    
        # I/O
        choice = raw_input('Choose A, B or C: ')
        x = input('Enter a number: ')
        if choice in choices:
            print choices[choice](x)
        else:
            print 'Initiating self-destruct sequence.'
    

It's actually similar to how a switch block works, if each case in the switch
statement just calls a function or evaluates one expression.

Also worthwhile: Instead of functions, let the dictionary values be lists of
arguments for another (multi-argument) function. Then the lookup is like
choosing from a set of possible configurations for that function. A little
redundancy is OK, since the table is so easy to read and edit.

------
fauigerzigerk
I know it's a popular opinion nowadays, but here's what keeps me from adopting
this view (huge amounts of data over algorithms) wholesale: Humans make smart
decisions on very little data. How many faces does a child have to "process"
in order to learn to recognise faces? Not many. How much does a person have to
read in order to learn correct spelling? Not the entire google index I
suppose.

Humans work neither on simple deterministic rules nor on huge amounts of data.
It's something else. Some very smart "algorithm" that we haven't found yet
(Bayes nets don't get there either but they look promising).

If there's a way for humans to be smart without much data there must be a way
for machines to do the same. That is unless you believe in some kind of
spirit/soul/god cult and I don't.

~~~
pchristensen
Humans are superior to machines in several ways:

\- we get _tons_ of data, just not all textual. We have visual (~30fps in much
bigger than HD resolution all day long), audio (again, better than CD quality
all day long), smell, taste, and touch, not to mention internal senses
(balance, pain, muscular feedback, etc). By the time a baby is 6 months old,
she's seen and processed a lot of data. Don't know if it's more than Google's
18B pages, but it's a lot.

-we get _correlated_ data. Google has to use a ton of pages for language because it only gets usage, not context. Much (most?) of the meaning in language comes from context, but using text you only get the context that's explicitly stated. Speech is so economical because humans get to factor in the speaker, the relationship with the speaker, body language, tone of voice, location, recent events, historical events, shared experiences, etc, etc, etc. Humans have a million ways to evaluate everything they read or hear, and without that, you need a ton of text to make sure you cover those situations.

-we have a _mental model_. Everything we do or learn adds to the model we have of the world, either by explicit facts (A can of Coke has 160 calories) or by relative frequencies (there are no purple cows but a lot of brown ones). My model of automobile engines is very crude and inaccurate while my model of programming is very good. Also, because I have (or can build) a model, I have a way to evaluate new data. Does this add anything to a part of my model (pg's essays did this for me)? Does it confirm a part of the model that wasn't sure (more experimental data)? Does it contradict a weakly held belief? Does is contradict a strongly held belief? Is it internally consistent? Is the source trustworthy?

This mental model might just be a bunch of statistically relevant
correlations, but that sounds like neurons with positive or negative
attractions of varying strength. Kind of like a brain. I believe Jeff Hawkins
is on to something (see On Intelligence
<http://www.amazon.com/o/asin/0805078533/pchristensen-20>), but there needs to
be correlated data (like vision/hearing/touch are correlated) and the ability
to evaluate data sources.

I agree that if humans can do it, machines can do it, but I think you're
vastly underestimating the amount and quality of data humans get.

~~~
evgen
Don't want to be pedantic here, but your info on our visual bandwidth is a bit
out of date. We actually only process about 10M/sec of visual data. Your brain
does a very good job of fooling your conscious self, but what you are
perceiving as HD-quality resolution is actually only gathered in the narrow
cone of your current focal point. The rest of what you "see" is of much lower
bandwidth and mostly a mental trick. We also don't store very much of this
sensory data for later processing.

~~~
pchristensen
Yeah, I knew all that but my comment was already pretty long. Still, 10M/sec *
every waking hour of life is still a lot of data.

------
Retric
But, Data creates Algorithms. For some set's of problems using Machine
Learning / AI works well. But, it's inportant to understand what limitations
your data creates in the same way that you need to understand what bugs exist
in your code.

------
felideon
Sounds like a paradox in Lisp since code is data.

Does this mean all Lisp code is good? :)

~~~
sridharvembu
I believe we need a "converse of Lisp" - in Lisp code is data, I believe what
we need is the notion "data replaces (most) code". That leads to the question,
what really is data, and I believe Codd supplies the best answer to that
question. One of the truly original ideas in Computer Science that post-dates
Lisp (and is not anticipated by Lisp) is Codd's relational model of data,
which is not to be confused with relational databases used for storage.

Note that Codd's model is not Turing-complete, while all but the most trivial
definitions of code lead to Turing-complete systems, hence the parenthetical
most in my "data replaces (most) code". Data is easy, code is hard could be
another way to state that.

We have experimented with such ideas, and we can report that they do
significantly improve clarity and therefore productivity.

As an aside to a pg essay, I believe clarity is _not_ the same as succinctness
and as a corollary, succinctness does not imply productivity except in the
somewhat trivial sense of ease of typing.

~~~
ken
It almost sounds like you're describing Subtext: <http://subtextual.org/>

~~~
magoghm
Subtext looks wonderful. Thanks for the link!

