

Predictive learning vs. representation learning - cottonseed
https://hips.seas.harvard.edu/blog/2013/02/04/predictive-learning-vs-representation-learning/

======
eli_gottlieb
I read all the way into the next post, "What the hell is representation?". I
don't really see why the question is all that confusing (first Red Alert for
ignorance and bullshit goes here). My first halfway-informed guess is: a
representation is a three-tuple of a compression function, a resulting data-
structure, and a decompression function that, taken all together, capture at
least some of the causal/computational structure of the process generating the
data.

Of course, this extreme generality is what makes representation learning Very
Hard, but also very powerful when you can manage to get it working at all.

Compression function: because we choose to assume (without ever being able to
prove, see Chaitin) that our data is _not_ algorithmically random.
Conveniently, most real-world data isn't algorithmically random, so learning
about the underlying process _must_ result in a description that can compress
the data more efficiently than just writing down the sample-set.

Data structure: the actual representation, which (ideally) explains (most of)
the data on its own, plus or minus some kind of noise.

Decompression function: learning would be useless if we couldn't perform _any_
kind of inference or prediction. We need a way to take the internal
representation, plug in some parameter values (either ones we've learned from
data or deliberately counterfactual ones), and then make a prediction about
further samples which will constitute the actual action taken to perform a
useful task.

So it's a question, I think, of whether we're doing information-theoretic
learning, or _algorithmic_ information-theoretic learning. Mind, this could
all be bullshit, as I'm a total novice at this stuff.

~~~
tristanz
Everybody brings their own terminology to these issues, given their
background. This is all just statistics. While I think the intuition of your
terminology makes some sense, it's probably the wrong frame. It's important
for people to use same terminology, with a clear mathematical definition.
Unfortunately there's still a big gap between statistics and machine learning
communities.

Everything in "learning" follows a good parameterization of p(y, x), the joint
distribution of the data, whether unobserved or not. If you have that, you can
get everything else.

The core idea of representational learning is important, even though it's
obvious in hindsight. When doing statistics, many people assume the parameters
governing the conditional distribution p(y | x) are distinct from those
governing p(x). So if you're just interested in predicting y, you don't need
to model p(x) to get the posterior distribution p(y | x). Representational
learning suggests that the parameters of these distributions are coupled. It's
saying that if you understand the structure of the world -- you have a good
representation of p(x) -- you can make better predictions p(y | x) with less
data. This makes a lot of sense because the distinction between x and y is
totally arbitrary.

