

Intuition & Data-Driven Machine Learning - igrigorik
http://www.igvita.com/2011/04/20/intuition-data-driven-machine-learning/

======
phren0logy
Nice talk, I especially like the example of using gzip to test similarity.
Basically:

    
    
      compress a
      compress b
      c = a + b
      compress c
    

If c is smaller than a+b, then a and b have some similarity that was
compressed. The difference between c and a+b is a similarity score.

Pretty cool, and requires no domain-specific insights about a and b. Obviously
it's not a perfect solution, but I bet it will solve a lot of interesting
problems.

~~~
gojomo
Though, keep in mind the limits of your compression algorithm when applying
this insight. GZIP uses only a 32KiB lookbehind dictionary. So it works well
for short documents:

    
    
      $ head -c16384 /dev/random > 16KiB
      $ cat 16KiB | gzip | wc -c
       16407
      $ cat 16KiB 16KiB | gzip | wc -c
       16628
      $ cat 16KiB 16KiB 16KiB | gzip | wc -c
       16790
    

But not so great for anything 32KiB or larger. Here, it doesn't even detect
exact content repetition after 32KiB:

    
    
      $ head -c32768 /dev/random > 32KiB
      $ cat 32KiB | gzip | wc -c
       32791
      $ cat 32KiB 32KiB | gzip | wc -c
       65564
      $ cat 32KiB 32KiB 32KiB | gzip | wc -c
       98337
    

BZIP2 uses (by default and at most) a 900KB block size, so does better as a
'compare' on larger files, but again fails to find exact duplication separated
by 900KB:

    
    
      $ cat 32KiB | bzip2 | wc -c
       33266
      $ cat 32KiB 32KiB | bzip2 | wc -c
       41263
      $ cat 32KiB 32KiB 32KiB | bzip2 | wc -c
       41396
      $ cat 32KiB 32KiB 32KiB 32KiB | bzip2 | wc -c
       45426
    
      $ head -c450000 /dev/random > 450KB
      $ cat 450KB | bzip2 | wc -c 
      452422
      $ cat 450KB 450KB | bzip2 | wc -c 
      563107
      $ cat 450KB 450KB 450KB | bzip2 | wc -c 
     1015533
      $ cat 450KB 450KB 450KB 450KB | bzip2 | wc -c
    
      $ head -c900000 /dev/random > 900kB
      $ cat 900kB | bzip2 | wc -c
      903828
      $ cat 900kB 900kB | bzip2 | wc -c
     1807770
      $ cat 900kB 900kB 900kB | bzip2 | wc -c
     2711478
    

(I'm kind of surprised the BZIP2 results jump so much from the 1x32 to 2x32
trial, but then much less to trials 3x32 and 4x32, and that the 2x450 to 3x450
size jumps so much after the 1x to 2x didn't. Which I think just demonstrates
that the specifics of the compression algorithm matter a lot, so you might
want to be careful using this technique as a scalar magnitude-of-similarity
measure even within the dictionary/block size.)

------
icandoitbetter
This is such a beautiful example.

------
helwr
great talk, I'd only add that CS schools don't teach ML algorithms, they teach
theoretical models. Just look at Bishop's PRML book, which is the de facto
standard textbook these days or Andrew Ng's lectures on Youtube. While they
present a good overview of Learning Theory Concepts, the actual algorithms are
largely left to students to devise (or use existing ML libraries in black-box
mode). The distance between academia and industry is even larger than what
Ilya describes.

