
Universal Prediction - lainon
http://philsci-archive.pitt.edu/14486/
======
voidmain
The argument of this very long and technical paper seems to boil down to this:

Solomonoff induction is (in the limit) as good as any computable prediction
method, which sounds pretty "universal". But it is not itself a computable
prediction method (it's just semi-computable). And if you are satisfied with
semi-computability in a predictor, then you should demand instead that a
"universal" predictor always be as good as any _semi-_ computable predictor,
and it can't meet that more demanding requirement. And so on in infinite
regress.

I find this kind of underwhelming. Solomonoff induction represents, in
practice, an unattainable ideal that prediction systems can aspire to
approximate. This argument shows that if you somehow could reach the ideal,
there is an infinite regress of super-ideals beyond it. So what?

~~~
cousin_it
I think you did a better job than all other commenters here (who seem to be
mostly free-associating on the title and abstract), but your comment still
doesn't quite capture the contribution of the paper or the reasonable
takeaways from it.

The key is section 4.3.3, which draws a distinction between measures and
predictors. The Solomonoff lower-semicomputable semimeasure is universal for
all lower-semicomputable semimeasures, not just for some narrower class. But
if we forget the idea of lower-semicomputable semimeasures and treat
Solomonoff as a member of a wider class - limit-computable predictors - it
turns out to be not universal for that class. In fact, moving from measures to
predictors is so lossy that no predictor can be universal for its own class,
by a simple diagonal argument. To me that means Solomonoff's approach with
measures is more nuanced and informative than the approach with predictors.

~~~
voidmain
In my admittedly very cursory skim of the paper, I gave it the benefit of the
doubt on this point, because this seemed convincing:

"for the purpose of prediction, we are, of course, not so much interested in
the probabilities issued by the measure functions, as in the conditional
probabilities that give the corresponding predictors’ outputs."

Do you disagree?

(Also, my original summary is wrong at least in that I should have said limit
computable where I said semi computable!)

~~~
cousin_it
I guess the question is whether we want universality over all limit-computable
predictors, or only those derived from lower-semicomputable semimeasures.
Since the latter class contains its own universal element and also every
computable predictor, maybe it's okay? But yeah, I can see how you can have a
different takeaway. Sorry for being a bit harsh in the previous comment.

~~~
voidmain
No need to apologize, I thought your comment was perfectly civil. It can be
hard to read tone on the internet!

------
Xcelerate
I'm honestly surprised there's not more research / funding on universal
prediction. It seems like a top down approach for true AI (start from theory
and work backwards to something feasible/practical) as opposed to the current
engineering mindset (start with something that works, i.e., neural networks,
and try to do more with them).

The ideas related to universal prediction were initially developed in the
1960s by Chaitin, Solomonoff, and Kolmogorov, but little has been done with
the concept since, partly due to the fact that computers weren't very powerful
back then and partly due to the fact that a lot of their theorems ran up
against barriers of uncomputability.

Marcus Hutter and Shane Legg come to mind as people who have done some work in
this area, but I'm not sure if Shane has continued investigating this approach
separately from his work with neural networks at DeepMind.

~~~
acutesoftware
I think there are 2 main problems with working on universal prediction:

1\. getting all the information to the program in a usable, computational
format.

2\. Defining the requirements (old problem but a good one). How would you
actually design an algorithm to predict something completely unknown. ML works
well here, as it is fairly narrow and 'easily' definable

It certainly is the most interesting problem though, and I'd love to see more
research attempted.

~~~
QML
How would you make an algorithm to predict something unknown?

This is essentially the field of sequential decision making, which encompasses
everything from reinforcement learning to multi-armed bandits. [1]

As an example, let’s pretend we have the task of predicting labels of a size 2
alphabet {0, 1}. These labels are come to us via a stream; so in contrast to
classical ML, we are not given access to the full “dataset” beforehand to
train our predictor. But instead, each round, a label is given to us; and in
between each round, we can only predict the next label with the set of labels
previously revealed.

Note that we have no model of how the labels are generated — this is what we
mean by “predict something unknown”. In general, we can assume two cases: it’s
either adversarial or predictable (shaky definitions, I know).

In an adversarial case, we assume the labels are generated such that we incur
a heavy loss. Say, if we predict 0, the adversary outputs 1; vice versa. It
turns out with deterministic strategies and a certain adversarial model, we
can always incur loss linear to the amount of rounds; this goes a bit into
game theory. To counter this, we can employ a random strategy where we flip a
fair coin to decide what to predict. This is somewhat analogous to you, in the
classic tie breaking game, of choosing {rock, paper, scissors} with 1/3
probability.

But what if the labels weren’t generated adversarially? Remember, we don’t
have a model of how the labels were generated. Say, the sequence was 000000...
predicting with a fair coin would be a bad strategy then; and instead it’d be
better to simply predict 0. A better strategy here would be something like
follow-the-leader, where you predict the most frequent label seen so far.

But recall, we don’t know how the labels are generated — it could very well
alternate between periods of predictability and unpredictability.

What we need is a mixed strategy which does well in both cases. One algorithm
that turns out to do this is called multiplication weights, which assigns
“weights” to each label and increases or decreases them proportionally each
round depending on what the label is.

There’s a whole lot of other ground that can be covered but I’d recommend
reading a book instead of reading it here. One resource is “Prediction,
Learning, and Games”—but feel free to find another book since this one assumes
a background in probability.

[1] Actually the first section of the linked pdf above.

~~~
Xcelerate
> How would you make an algorithm to predict something unknown?

Wouldn't AIXI address this? I believe the algorithm is known; the hard part is
how to approximate it in a practical (i.e., remotely feasible) way...

------
adyavanapalli
From the abstract:

In this dissertation I investigate the theoretical possibility of a universal
method of prediction. A prediction method is universal if it is always able to
learn what there is to learn from data: if it is always able to extrapolate
given data about past observations to maximally successful predictions about
future observations. The context of this investigation is the broader
philosophical question into the possibility of a formal specification of
inductive or scientific reasoning, a question that also touches on modern-day
speculation about a fully automatized data-driven science.

I investigate, in particular, a specific mathematical definition of a
universal prediction method, that goes back to the early days of artificial
intelligence and that has a direct line to modern developments in machine
learning. This definition essentially aims to combine all possible prediction
algorithms. An alternative interpretation is that this definition formalizes
the idea that learning from data is equivalent to compressing data. In this
guise, the definition is often presented as an implementation and even as a
justification of Occam's razor, the principle that we should look for simple
explanations.

The conclusions of my investigation are negative. I show that the proposed
definition cannot be interpreted as a universal prediction method, as turns
out to be exposed by a mathematical argument that it was actually intended to
overcome. Moreover, I show that the suggested justification of Occam's razor
does not work, and I argue that the relevant notion of simplicity as
compressibility is problematic itself.

------
gumby
It's amazing to see a paper referring to Solomonoff on HN. It does seem that
if ever there was a time for his work it is now.

I find his writing obscure, and in person I could almost never understand
anything he said. He really never fit in with the world of symbolic AI (called
GOFAI these days).

------
joe_the_user
It's possible to have universal prediction for some definition of universal
and prediction.

The problem is that any universe one starts predicting could look an
indefinitely nested set of conditional distributions.

So your system could be going along learning and predicting when suddenly
things changed and it would have to start all over. It would have some
mechanism for pivoting at that point but of course that would less than
perfect. So there are inherent limits if you talk arbitrary universe.

~~~
laretluval
Interesting. Here's a possible way out. Suppose the universe is always
changing according to some state-transition function that gets you to time t
to time t+1. We have a machine existing in this universe which is trying to
predict the state at t+1. You point out this is impossible because the
transition function might suddenly change for some particular t+1. But if the
change is massive enough, it will also destroy the internal workings of the
prediction machine. Conditional on the universe retaining enough regularity
for the prediction machine to survive, maybe it would be possible to make
"universal" predictions.

~~~
onoffswitch
This seems a bit circular.

The universe (U) itself would be the prediction machine.

To be fair, there would be no definition of "circular" without the universe.

The reality we - or habitants from other universes or other domensions - can
measure seems to be some kind of "Fractal collapsing" at the
dimension/universe that the "measurer" is located.

\---

Kind of related:

Dostoiévsk, in his book notes from underground proposes an interesting
problem.

I paraphrase it below:

What would you do if you receive from god himself a manuscript of the actions
(down to the minimum possible dt) that will maximize your "satisfaction
function" considering all your life?

\---

Rick and Morty also explored this using another analogy:

Psychologists are like horses in a caroussel trying to understand it's
mechanism. They'll do their best, but in the end won't help much and feel
terrified.

------
qwerty456127
Isn't being able to reliably and precisely predict anything is just about
building a sufficiently precise model of the system? So, if we could build a
sufficiently precise model of the world we live in wouldn't it also "automate
science"? In fact, some people say our universe and we ourselves actually are
such a model already.

And although it may be impossible to model the whole universe with model
scientists in it, I'm rather certain it is going to possible at some point to
build an artificial brain that is going to be a model of an extremely smart
human additionally enhanced with direct interface to huge databases and
computational powers this way being able to figure out almost anything swiftly
making an extremely resourceful automatic scientist.

------
paraschopra
I recently made a video on the possibility of using the insights from deep
networks to automate science.
[https://youtu.be/Y-WgVcWQYs4](https://youtu.be/Y-WgVcWQYs4)

I find the possibility of automatically deriving science from raw data
intriguing. Is anyone working in the area?

------
RytusOfVa
I only read some of the first two parts of this dissertation.

I wonder whether there are any good classes of learnable patterns. In, say,
the continued fractions, or the decimal deals, there is a fine but essential
line between patterns of coefficients with repetands, and those which do not
repeat.

------
nurettin
>> I show that the suggested justification of Occam's razor does not work, and
I argue that the relevant notion of simplicity as compressibility is
problematic itself

This is just sad.

------
georgewsinger
Judging by the abstract, this looks like a non-trivial piece of philosophy.

