
Entropy Maximization and intelligent behaviour - aidanrocke
http://paulispace.com/intelligence/2017/07/06/maxent.html
======
RangerScience
Okay, TL;DR:

"Causal Entropic Forcing" is something like an AI's utility function, where
the agent attempts to maximize future possibilities. Since this is meaningless
(all possible futures are possible), what you actually want to do is make it
as easy as possible to _get_ to those futures - aka, their entropic adjacency,
hence the name, causal entropic forcing.

However, CEF requires that the agent can actually _predict_ possible future
states of the system, which comes with some serious issues. In the original
paper, this is covered by access to perfect simulators, but those aren't
available in real-world situations.

This post discusses how to (possibly) use recurrent neural networks to make
such predictions; how to do so effectively, and with consideration of the NN's
confidence in it's predictions.

It's pretty cool!

~~~
highd
In that case, aren't the big benchmark gains being claimed mostly the result
of changing the problem to allow a perfect simulator in the system? The
original author is claiming benchmark results either with a perfect simulator
or with a pre-trained neural network mimicking a simulator. It seems like a
massive change to the original problem. Otherwise the utility function is very
similar to Q learning, just optimizing for future "flexibility" instead of
future "reward".

Basically we should be considering the current posted results versus Q
learning with an equally accurate pre-trained forward simulator, which I don't
think anyone has done.

~~~
RangerScience
> gains being claimed

AFAIK, the claims being gained are that they didn't have to supply the system
with any goals, and it "figured out" the basic tests - included tool use and
cooperation.

> with an equally accurate pre-trained forward simulator

AFAIK, that's exactly what the OP is discussing: how would this system perform
when you replace the perfect simulator with an RNN trained to predict?

~~~
highd
I'm referring to posts like this:
[https://entropicai.blogspot.fr/2017/06/openai-first-
record.h...](https://entropicai.blogspot.fr/2017/06/openai-first-record.html)

~~~
RangerScience
I'm not familiar with these works. Reading...

[edit]

Okay, I don't really understand what they're doing, so, my guess is that they
have a component that predicts future states of the game, and they use
something inspired by fractals to determine which future states to sample.
Then, they're either using the score as the metric for the "value" of that
future state (in which case it's not CEF), or they're ignoring the score and
measuring something corresponding to entropy (or future-possibilies-remaining
since it's pacman), and then they _are_ using CEF.

It's like Data playing that weird chess game; the CEF aspect doesn't help you
play any better, but it gives you a different win condition that turns out to
"win" better than directly trying to win.

If, that is, my bad understanding is in any way accurate.

------
RangerScience
This (causal entropic forcing) is one of the coolest ideas I've ever come
across; it's one of my go-to stories about machine learning and philosophy.

It combines well with Jeremy England's theory that life is entropically
inevitable: [https://www.scientificamerican.com/article/a-new-physics-
the...](https://www.scientificamerican.com/article/a-new-physics-theory-of-
life/)

..and I wonder sometimes if you could make a religion out of all this;
morality and existence based on entropic math. Consider that the goal of CEF
is maximized possibilities, smoke a bowl, and think about fractals and
holograms.

One of the things I find really interesting about CEF is that it doesn't
specifically help with understanding or predicting the world around you; it
just gives a very effective way to determine what possible actions you should
actually do. Given that (AFAIK) the human brain/mind is itself a combination
of many systems, it seems to me to be very elegant that a CEF agent is also a
combination of systems, each of which have limitations and issues.

~~~
eli_gottlieb
>One of the things I find really interesting about CEF is that it doesn't
specifically help with understanding or predicting the world around you; it
just gives a very effective way to determine what possible actions you should
actually do.

Well, it gives _one_ way to prescribe actions, but _no_ way to prescribe
actions we actually care about. Maximizing possible futures rightfully ought
be a mere subgoal or consequence of the actual prescriptions we care about.

Personally I like free-energy theory best, but it really still needs some work
to distinguish which "predictions" change to accommodate prediction-errors and
which drive action. The original equations basically claim they _both_ change
_at the same time_ to minimize the free-energy, but by then the generative
models and recognition densities themselves approach tautology.

In a certain sense, you could view _anything_ which behaves "teleologically",
which self-organizes and moves itself preferentially into some states over
others, as engaging in active inference on _some_ generative density. The
problem is to describe or prescribe what the generative and recognition
densities _actually are_ , lest the theory just be mere philosophy.

~~~
MrQuincle
Isn't free-energy nothing more than a maximization approximation to Bayesian
inference.

p(u|x)=p(x|u)p(u)/p(x)

\+ max in log space: max ln p(x|u) + ln p(u)

\+ use variational approximation, e.g. Kullback-Leibler: min ln
KL(q(u)|p(x|u)) + ln p(u)

\+ define free energy: F = ln p(u) - KL, so it can be maximized

Hence, rather than with KL where we minimize over a ratio with conditional
p(x|u), we maximize over the joint 1/p(x,y). So we optimize for both
likelihood and prior.

Sounds to me as a sloppy Bayesian approach. :-) Normally, the prior is
intended to be used as full distribution. Not to get a max. probable value
from.

In this approach we maximize for both prior and likelihood. Seems logical that
we get all kind of possible trade-offs. Which should we choose? And why is
free-energy so perfect?

~~~
eli_gottlieb
The "free-energy principle" in this case isn't just variational inference via
a free-energy cost derived from the KL divergence. It involves treating
_action_ (control signals) as a variational parameter to the recognition
density. The agent thus treats their likelihood function as a descriptive
model of the world, and their prior as a _prescriptive_ model: they update
their beliefs (recognition density) to be accurate Bayesian inferences, but
then _act_ so as to reduce prior improbability (despite accumulation of
likelihood data).

------
highd
I've been trying to parse this body of work - there doesn't seem to be a
writeup on the exact implementation, just that it's using "Causal Entropic
Forces". They have this writeup on the optimization implementation here:
[https://arxiv.org/pdf/1705.08691.pdf](https://arxiv.org/pdf/1705.08691.pdf)

One red flag for me is that they're simultaneously claiming that there's no
training during the OpenAI Gym while also claiming that the optimization
approach is relevant. In that case, what is being optimized? It seems like
they might be optimizing over previous simulations - there's frequent
reference to having access to a "simulator". In that case, that should
effectively count as training, right? I was under the impression that the
OpenAI Gym was supposed to benchmark untrained approaches so they could be
compared by learning time. Hence the gradually increasing training curves in
the other approaches.

~~~
gabrielgoh
I think an analogy can be made with Bayesian statistics. In principle,
Bayesian statistics requires no training, just a way of sampling from the
posterior, usually done with expensive MCMC methods.

Here, we do not need training of any kind either, just a monte-carlo
simulation of the environment and an approximation of which path has the
greatest path entropy. Bsaically given a state, you do

\- Compute the path entropy for all states you can move to

\- Move into the state with greatest path entropy

The tradeoff here is that all the work occurs in inference - every decision
requires a complex simulation. In training based approaches the heavy lifting
is done during training, and inference is easy

~~~
highd
Yes - the issue is that the work is currently presented as requiring "no
training", but it has simply relocated that problem to constructing a perfect
simulation of the environment. It then uses the fact that current benchmarking
systems have available simulations to "cheat" rather than learning that
function itself. One of the most difficult and interesting parts of
reinforcement learning is constructing the function that determines the
evolution of the system. If you know the evolution function a priori the
problem is mostly trivial - i.e. alpha-beta search, graph searching, etc.

It's interesting that this merit function works in the absence of a real
reward signal, but there's no fair comparison against systems using a reward
signal due to this huge alteration to the problem that is providing a perfect
simulation.

~~~
gabrielgoh
i agree completely, and that what's happening is nothing more than brute force
search. Though I do think this is still interesting as the reward here is
potentially much more well-conditioned than the rewards in RL.

Having said that there are situations where this will fail completely, e.g. in
maze solving, where the goal is not to play to keep playing but to play to
reach the end.

~~~
highd
It seems like a more comparable reinforcement learning thing to do would be to
combine the entropy criterion with a known reward when available in some way
and then do Q learning on that without the simulation requirement. Then in
cases where reward is uncertain or infrequent you fall back to a flexibility
heuristic.

~~~
robertsdionne
Maybe like [https://pathak22.github.io/noreward-
rl/](https://pathak22.github.io/noreward-rl/)

------
pizza
I remember this -- entropica -- from, well must have been like 5 or 6 years
ago now

[http://www.entropica.com/](http://www.entropica.com/)

------
pzone
This blog post seems to be a comment or response aimed at people who already
understand the paper, not an exposition for someone encountering it for the
first time. I think I'm moderately well versed in probability and information
theory and couldn't make heads or tails of it.

~~~
RangerScience
I think I understand it all well enough to explain! Want to ask some
questions?

------
mehwoot
_Maximizing your number of future options is not always a good idea. Sometimes
fewer options are better provided that these are more useful options_

I guess I'm missing something, because this seems to negate the entire
point... isn't the point that number of future options is a good measure of
"more useful options"?

~~~
TuringTest
I think that sentence is meant to highlight one _weakness_ of that measure. It
may be a good heuristic in many circumstances, but if you have direct
knowledge about the problem domain (like in the football players example),
applying this specific heuristic may give better results than using the
generic one. To solve difficult problems with approximate methods, you usually
need to combine several heuristics anyway.

------
mrdrozdov
At quick glance, this work seems related to Information Maximization like is
done in the papers for InfoGAN, VIME, and Intrinsic Motivation (for automatic
goal-setting in RL).

------
canjobear
How does this relate to concepts like AIXI and Solomonoff induction?

~~~
chriswarbo
Solomonoff induction is an (uncomputable) method which takes a sequence of
inputs and predicts the subsequent inputs. If that sequence comes from some
sensor, like a camera, then it can be used to predict what that sensor will
detect in the future (and hence, indirectly, what the future state of the
world will be). Solomonoff induction is completely passive, it doesn't say
anything how to choose an action to take.

AIXI applies Solomonoff induction to a reinforcement learning (RL) setting:
the sequence is split into three parts: "observations" (passive input, e.g.
from a camera), "actions" (which are under the agent's control) and "rewards"
(which are numbers). AIXI uses Solomonoff induction to calculate what the
total future rewards will be, _if_ the sequence so far were followed by action
A; or by action B; etc. and then performs whichever of those actions gave the
largest predicted reward. This _does_ tell us which action to take (at least,
computable approximations do), but it relies on there being a source of
reward; all sorts of "AI safety" research (e.g. intelligence.org ) is based
around what such a reward should look like, and ways that an AI might achieve
high reward whilst subverting our intentions.

This 'causal entropic force' is a sort of implicit reward: the system is
rewarded when it is able to efficiently reach other states; so it ends up
'putting itself in a good position', whatever that might mean in a particular
situation.

It hand-waves away a few key points: it needs a good predictor (e.g.
Solomonoff induction, or something computable), and it also seems to need a
world model which tells it what the "states" are. Solomonoff and AIXI don't
need to be given a model: they build their own implicitly. They _do_ need
their input to be hooked up, e.g. to take pixels from a camera or whatever,
but that's a known property of the implementation (e.g. the hardware available
on a robot), whereas there's usually a bunch of ways we could model the world,
with no "obvious" right answer, and that can directly affect how the system
behaves.

