Hacker News new | past | comments | ask | show | jobs | submit login
Why traditional reinforcement learning will probably not yield AGI [pdf] (philpapers.org)
82 points by xamuel 37 days ago | hide | past | favorite | 115 comments

I applaud the effort but the problem with RL as a model of learning is in the definition of RL itself. The idea of using "rewards" as a primary learning mechanism and a path to actual cognition is just wrong, full stop. It's a wrong level of abstraction and is too wasteful in energy spent.

Looking at it from CogSci perspective it is essentially an offshoot of behaviorism, using a coarse and extremely inefficient model of learning as reward and punishment, iterative trial and error process.

This 'Skinnerism' has been discredited in cognitive psychology decades ago and makes absolutely no biological sense whatsoever for the simple reason that any organism trying to adapt in this way will be eaten by predators before minimizing its "error function" sufficiently.

Living learning organisms have limited resources (energy and time), and they cut the search space drastically through shortcuts and heuristics and hardcoded biases instead of doing some kind of brute force optimization.

This is the case where computational efficiency [1] comes first and sets the constraints by which cognitive apparatus needs to be developed.

As for actual cognition models a good place to start is not ML/AI field (which tends to getting stuck in local minima as a whole), but state-of-the-art cognitive psychology, and may be looking at research in "distributional semantics", "concept spaces", "sparse representations", "small-world networks" and "learning and memory" neuroscience.

You'd be surprised how much knowledge we gained about the mind since those RL & ANN models developed in the 1940s.

[1] https://www.amazon.com/Circuits-Mind-Leslie-G-Valiant/dp/019...

> "This 'Skinnerism' has been discredited in cognitive psychology decades ago and makes absolutely no biological sense whatsoever for the simple reason that any organism trying to adapt in this way will be eaten by predators before minimizing its "error function" sufficiently."

> "Living learning organisms have limited resources (energy and time), and they cut the search space drastically through shortcuts and heuristics and hardcoded biases instead of doing some kind of brute force optimization."

But those heuristics and hardcoded biases were developed through brute force optimization over the course of billions of years, a massive amount of energy input and many organisms being devoured.

> But those heuristics and hardcoded biases were developed through brute force optimization over the course of billions of years, a massive amount of energy input and many organisms being devoured.

This is true in the context of the universe as a whole, not by the organism itself.

Except no organism is born a blank slate. Parent is correct in that our prior was massively expensive to construct

So we can expect our ANN’s to yield AGI in a few million or billion years? That doesn’t sound like a good place to put our current efforts then.

That does not necessarily follow, as I imagine you well know.

Why wouldn't it follow? Human intelligence evolved in the real world with all its vast information content. Deep learning systems are only trained on a few terrabytes of data of a single type (images, text, sound etc). Even if they can be trained faster than the rate at which animals evolved, their training data is so poor, compared to the "data" that "trained" animal intelligence that we'll be lucky if we can arrive at anything comparable to animal intelligence by deep learning in a billion years.

Or unlucky, as the case may be.

You elided the "necessarily".

One can rationally argue either way over the speculative proposition that reinforcement learning will yield AI in less than a few million years, but that it took evolution half a billion years is hardly conclusive, and certainly not grounds for stopping work.

Not grounds for stopping work[1], but perhaps grounds to explore other avenues[2] to see if something else might yield faster results.

I’m no expert, but my personal opinion is that AGI will probably be some hybrid approach that uses some reinforcement learning mixed with other techniques. At the very least, I think an AGI will need to exist in an interactive environment rather than just trained on preset datasets. Prior context or not, a child doesn’t learn by being shown a lot of images, it learns by being able to poke at the world to see what happens. I think an AGI will likely require some aspect of that (and apply reinforcement learning that way).

But like I said, I’m no expert and that’s just my layperson opinion.

[1] if the goal is AGI, if it’s not then of course there’s no reason to stop

[2] some people are doing just that, of course

Fair enough, though I do not think the evidence from evolution moves the needle much with respect to the timeline. For one thing, evolution was not dedicated to the achievement of intelligence.

Sounds reasonable.

>> You elided the "necessarily".

Well, if it follows, then it follows necessarily. But maybe that's just a deformation professionelle? I spend a lot of time working with automated theorem proving where there's no ifs and buts about conclusions following from premises.

If I am not mistaken, it does not necessarily follow unless it turns out to be a sound argument in every possible world.

Ah, so you are making a formal argument? In that case you should stick to formal language. And probably publish it in a different venue :)

No, I am simply responding to your rather formal point, in kind. Unless you are aguing for it being an established fact that the time evolution took to produce intelligent life rules out any form of reinforcement learning producing AI in any remotely reasonable period of time, then that original point of yours does not seem to be going anywhere.

In your work on theorem proving, am I right in guessing that there are no 'ifs' or 'buts' because the truth of premises is not an issue? In the "evolution argument", the premises/lemmas are not just that evolution took a long time, but also something along the lines of significant speedup not being possible.

You might notice that in another comment, I suggested that we might still be in the AI Cambrian. I'm not being inconsistent, as no-one knows for sure one way or the other.

I didn't make a formal point- my comment is a comment on an internet message board, where it's very unlikely to find formal arguments being made. But perhaps we do not agree on what constitutes a "(rather) formal point"? I made a point in informal language and in a casual manner and as part of an informal discussion ... on Hacker News. We are not going to prove or disprove any theorems here.

But, to be sure, as is common when this kind of informal conversation suddendly sprouts semi-formal language, like "argument", "claim", "proof", "necessarily follows" etc, I am not even sure what exactly it is we are arguing about, anymore. What exactly is your disagreement with my comment? Could you please explain?

"Necessarily" has general usage as well, you know... why would you read it otherwise, especially given the reasonable observation you make about this site? And my original point is not actually wrong, either: whether reinforcement learning will proceed at the pace of evolution is a topic of speculation - it is possible that it will, and possible that it will not.

Insofar is I have an issue with your comment, it is that it is not going anywhere, as I explained in my previous post.

>> Insofar is I have an issue with your comment, it is that it is not going anywhere, as I explained in my previous post.

I see this god-moding of my comment as a pretend-polite way to tell me I'm takling nonsense, that seems to be designed to avoid criticism for being rude to one's interlocutor on a site that has strong norms against that sort of thing, but without really trying to understand why those norms exist, i.e. because they make for more productive conversations and less wasting of everyone's time.

You made a comment to say that unless I claim that X (which you came up with), then my comment is not going anywhere. The intellectually corteous and honest response to a comment with which one does not agree is to try and understand the reasoning of the comment. Not to claim that there is only one possible explanation and therefore the comment must be wrong. That is just a straw man in sheep's clothing.

And this is not surprising given that it comes at the heels of nitpicking about supposedly important terminology (necessarily!). This is how discussions like this one go, very often. And that's why they should be avoided, because they just waste everyone's time.

"Necessarily", when read according to your own expectations for this forum, made an important difference to my original post (without it, I would have been insisting that the issue is settled already), so it was reasonable for me to point out its removal. The nitpicking over it began with your response to me doing so, and you have kept it going by taking the worst possible reading of what I write. This is, indeed, how things sometimes go.

Meanwhile, in a branching thread, I had a short discussion with the author of the post I originally replied to, in which I agreed with the points he made there. Both of us, I think, clarified our positions and reached common ground. That is how it is supposed to go.

I did not set out to pick a fight with you, and if I had anticipated how you would take my words, I would have phrased things more clearly.

I think the point huh is being mad is that individual people (or models) dot learn that way. It’s not like models training models, all the way down.

Individual people are not trained from scratch. ML models often have to be (modulo fine-tuning) since the field is still young.

That's already changing. That we have only relatively recently moved beyond always starting from scratch might indicate that we are still in the Cambrian of AI, however...

Reinforcement learning is Turing complete [1], so if AI is possible at all, then it can be realised through RL.

   cognitive psychology
You are overselling the insights of this discipline. Has cognitive psychology solved its replication problems? Where is the world-beating AI that is based on "concept spaces", "sparse representations", "small-world networks" and "learning and memory" neuroscience?

[1] https://arxiv.org/abs/1505.00521

Note that Turing complete means undecidable. An algorithm that can learn any program that can be computed by a Universal Turing Machine must, in the worst case, search an infinite program space. So, even if a Neural Turing Machine can learn arbitrary programs (I haven't read the paper so I can't say) it might need to consume infinite resources before learning any particular progarm.

In short- Turing completeness is no guaranteed path to AGI. Assuming an "AGI program" exists, it is hidden away in an infinity of almost identical, but not quite, programs.

>Where is .. AI

it's always 5 years away because mainstream AI researches are stuck with yak shaving their gradient descents.

I mean you can't just throw things at the wall and hope they stick, but it's literally the state of the art, if you follow ML conferences and their world-beating toy benchmarks results, with a lot of pseudo-rigorous handwaving for theory.

The reason physics has been so successful is that their theory closely followed empirical data and constraints imposed by nature.

I think the only hope to achieve common sense in AI is to align it with hard constraints living organisms have, using those constraints as a guide.

A few terms I mentioned are coming from that POV, if you dig a bit deeper they all have direct physical manifestation in natural learning systems.

>> it's always 5 years away because mainstream AI researches are stuck with yak shaving their gradient descents.

A small correction: that's deep learning researches, not AI researchers and not all machine learning researchers even. To be charitable, it's not even all deep learning researchers. It's just that the field of deep learning research has been inundated with new entrants who are sufficiently skilled to grok the practicalities but lack understanding of AI scholarship and produce unfortunately shoddy work that does not advance the field (any field, any of the aforementioned ones).

As a personal example, my current PhD studies are in Inductive Logic Programming which is, in short, machine-learning of logic programs (you know, Prolog etc). I would not be able to publish any papers without a theoretical section with actual theoretical results (i.e. theorems and their proofs - and it better be a theorem other than "more parameters beget better accuracy", which is not really a theorem). Reviewers would just reject such a paper without second thought, regardless of how many leaderboards I beat in my empirical results section.

And of course there are all the other fields of AI were work continues - search, classical planning, constraint satisfaction, automated theorem proving, knowledge engineering and so on and so forth.

Bottom line- the shoddy scholarship you flag up does not characterise the field of AI research, as a whole, it only afflicts a majority of modern deep learning research.

>Reinforcement learning is Turing complete [1], so if AI is possible at all, then it can be realised through RL.

Brainfuck is also Turing complete, so logically if we just do, for instance, Markov chain Monte Carlo for Bayesian program learning in Brainfuck, we can realize AGI that way.

"Everything is possible, but nothing is easy." The Turing tarpit.

> Reinforcement learning is Turing complete [1], so if AI is possible at all, then it can be realised through RL.

This seems like overstating your point. Nobody has been able to rigorously define "AI" yet, so there's no way of saying whether it's possible with a Turing machine architecture. The human brain, at least, doesn't seem that similar to a Turing architecture. Neurons don't carry out anything like discrete operations.

Maybe it's possible to run AGI on a Turing machine, maybe it's not, but there are more options than simply "possible with a Turing machine" or "completely impossible".

While I agree that we don't have an agreed upon definition of AI, the problem is firmly in the "I" part of AI! The "A" part is taken to mean implementable by a computer, i.e. a Turing machine. This is the content of the Church–Turing thesis [1].

[1] https://en.wikipedia.org/wiki/Church%E2%80%93Turing_thesis

> It states that a function on the natural numbers can be calculated by an effective method if and only if it is computable by a Turing machine.

The article at the top of this thread is specifically about properties ("The Generalized Archimedean Property") that real numbers do not possess.

There's also a little bit of slipperiness around the use of "AI" vs. "AGI" - you could easily argue (and people do!) that we've already achieved "AI" for many specialized domains. It's the General bit that seems to be the sticking point, and that this article focuses on.

Maybe it's possible to run AGI on a Turing machine, maybe it's not,

Arguing from ignorance, of course nothing is knowable for certain. However there has been a lot of work on the universality of Turing machines, showing that a Turing machine can simulate any conceivable concept of finite computation and can approximate any conventional physical system.

I think a more useful way to express your intuition is to note that if human-built AGI comes into existence, it might be runnable on a Turing machine but quite possibly not efficiently so.

> The idea of using "rewards" as a learning mechanism and a path to actual cognition is just wrong, full stop.

I'm a layman (just a software engineer) but am curious, I train my cat only with rewards (never punishment because apparently doesn't work on cats) and the kitty learned how to high-five me, sit, jump, follow me etc. It seems to work really well for us. Basically, ever time he does something desirable, I click my pen and give him his favorite treats. Is this ineffective?

Your cat was already capable cognition before you started training it. GP is talking about generating a cognition where it did not previously exist.

Why couldn't a mechanism (say A Neural Turing or whatever) that you train be "cognition capable" when you start and then be trained to actual behavior after that?

You would need something that is "cognition capable" first and that has not been invented yet.

It's hard to know. Maybe something "cognition cable" exists, it's just the proper train routine hasn't been provided to it.

But regardless, the broader point is yeah, combine something akin cognition capability and the proper training routine and there you go, AGI from "reinforcement learning", broadly defined.

Also a layman, but I think OPs point wasn't that it isn't possible, but that's it's not effective or analogous to how humans or other species learn.

For example, your cat's brain isn't just a randomly initialised neural net. Your cat comes pre-wired in such a way that it understands certain things about its environment and has certain innate biases that allow you to train it to do simple tricks with relative ease through a reward mechanism.

A more analogous example would be building a cat-like robot with four legs and a neural processor then switching it on and expecting to be able to train it with treats. Without a useful initial neural state (founded with an understanding of cognitive psychology and neuroscience) it would be almost totally useless.

As someone who's a software engineer (not data scientist) but is interested in consciousness and by extension AGI and dabbled in some ML algorithms, I find it surprising how often I see the sentiments of AGI being possible or impossible using some sort of algorithm.

Obviously I could be missing some great breadth and depth of research (there's definitely a lot I don't know) but from what I've read "we have no idea" is a pretty accurate description with how far we've come when it comes to consciousness, and I would imagine even less for the newer field of AI/AGI (consciousness has been around for a while P: and our theories have mostly sidestepped this real world phenomenon).

> "The idea of using "rewards" as a learning mechanism and a path to actual cognition is just wrong, full stop."

This to me is a huge red flag (mostly of ego/hubris). I think if we rephrased the goal to not talk about "AGI" and maybe around quantitative things like the things you've listed ("computational efficiency", likelihood of being stuck in local minimimas, etc) then I'd happily concede that we should be looking at "X" and not "Y" but unless I've missed something, again likely, when we're talking about AGI, we're talking about consciousness (epiphenomenon that come about through physical/deterministic interactions). A quick way to gut check myself here is twisting what you state is not a good place to start "ML/AI field... gets stuck in local minima" and ask myself is it possible that local minima (which we consider "bad" for current/traditional tasks) could be necessary for consciousness ? I think the widely accepted answer to this is currently "We don't know".

If I think that achieving AGI is going to be similar to what the algorithms and architecture we currently use (where the likelihood of being stuck in a local minima is something we can look at) then sure, your opinions stand. But that is just a guess and unless I'm mistaken AGI hasn't been achieved because we don't know how to do it.

This isn't to say that we should have 100% of the data before making strong judgements like this about a subject. It's just that the subject of "consciousness" is a big one (I'd say THE big one) so making such strong statements about something we know we don't know much about is interesting. <- this is where I get flashbacks to SE world where a missing piece of data can really throw you off or leads to wrong assumptions and when I think about consciousness we know we don't know a lot.

Stumbled upon this the other day, seems interesting ¯\_(ツ)_/¯


"Penrose argues that human consciousness is non-algorithmic, and thus is not capable of being modeled by a conventional Turing machine, which includes a digital computer. Penrose hypothesizes that quantum mechanics plays an essential role in the understanding of human consciousness. The collapse of the quantum wavefunction is seen as playing an important role in brain function."

Penrose is a superlatively brilliant physicist, but his opinion on AI is the worst sort of woo. It’s little better than reading Chopra. His argument is most salient as an example of attempting to use authority in one field to garner authority in another.

My point was that RL/DL is being used like some kind of massive hammer to hit all the nails. Cognition requires different, specialized, energy-efficient tools.

> consciousness

All talk about this is premature and "pre-science", before we figure out more basic, fundamental things like object storage and recall from memory, object recognition from sensory input, concept representation and formation, the exact mechanism of "chunking" [1], "translational invariance" [2], generalization along concept hierarchy and different scales, representation of causal structures, proximity search and heuristics, innate coordinate system, innate "grammar".

Even having a working, biologically-plausible model of navigation in 3d spaces by mice, without spending a ton of energy training the model, would be a good first step. In fact there is evidence that navigational capacity [3] is the basis of more abstract forms of thinking.

On all of these things we have decades worth of research and widely published, fundamental, Nobel-winning discoveries which are almost completely ignored by the AI field stuck in its comfort zone. Saying "we have no idea" is just being lazy.

Edit: As for OP's actual paper I think something like complex-valued RL [4] might bypass his main claims entirely. But my point is that RL itself is a dead end, trivializing the problem at hand.

[1] https://en.wikipedia.org/wiki/Chunking_(psychology)

[2] http://www.moreisdifferent.com/2017/09/hinton-whats-wrong-wi...

[3] http://www.scholarpedia.org/article/Grid_cells

[4] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22c...

I have learned to play piano and drive a car. IMO both of these took two completely different sets of systems and algorithms in order to accomplish the learning task. Nothing I learned from piano applies to driving and vice versa. The only thing in common is my brain. We want a computer though to apply those algorithms I learned driving and playing piano to golf and have it work. We will then have "AGI". Obviously that fails. Obviously.

I think it will require at least a few more centuries to build AGI.

>state-of-the-art cognitive psychology, and may be looking at research in "distributional semantics", "concept spaces", "sparse representations", "small-world networks" and "learning and memory" neuroscience.

Look, uh, I've read Gardenfors too, but are those really the state of the art? I don't remember there being anything about them at CogSci this past summer. Maybe I wasn't paying close-enough attention?

In RL/DL context any CogSci developments after 1943 is the state of the art.

Some interesting recent work [1] related to Gardenfors ideas was combining them with discovery of place & grid cells, and extending the "cognitive maps" and spatial navigation machinery into concept spaces, treating the innate coordinate system as foundation for abstraction and generalization facilities.

And they actually found empirical data to prove it in [1] and related papers, so Gardenfors was right.

I believe it gotta be the starting point for anyone seriously considering an AI, kind of like Cartesian foundation. It also aligns nicely with rich "distributional semantics" work and popular vector space models.

[1] https://pubmed.ncbi.nlm.nih.gov/27313047/

> The idea of using "rewards" as a primary learning mechanism and a path to actual cognition is just wrong, full stop. It's a wrong level of abstraction and is too wasteful in energy spent.

A lot of research in RL is focused on intrinsic motivation and the question of whether we can bootstrap our own 'rewards' from our ability to predict and control the future according to some self-defined goals/hypotheses.

do you know about the dopamine reward error hypothesis? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6721851/ is it so wrong? what does cognitive psychology have to say about how these neurons work? this is a lot more recent than the 40s and behaviorism.

dopamine rewards operate on a different time scale vs. that required by these error correction models. I don't remember the exact paper, will need to look it up, but it was orders of magnitude difference in response times.

Edit: for authoritative reference on biologically-plausible learning see anything by Edmund Rolls [1]. He explicitly stated in his recent book [2] that something like back-propagation, or similar error correction mechanisms have no supporting evidence in experimental data collected so far

[1] https://www.oxcns.org/profile.html

[2] https://www.amazon.com/Cerebral-Cortex-Principles-Edmund-Rol...

thanks for the link, it's been a lovely rabbit hole :)

Although I have only skimmed the paper, I think it's kinda trying to say (although someone with a better mathematical background than me might poke me for this) that the reward hypothesis (http://www.incompleteideas.net/rlai.cs.ualberta.ca/RLAI/rewa...) - the notion that every goal or purpose can be framed as the maximization of a real-valued function - isn't really applicable for most of the time. This is quite intuitively agreeable even without the math - do we really think that the many things we do in our lives were perfromed to optimize an "oracle" loss function? Our human mind is comprised of ridiculously complex systems of neurons and cells that generates a variety of emergent behaviors, and saying that those emergent behaviors are actually a solution of a very complex optimization problem is very, very bold. Often the reward functions are just abstractions of what we perceive (although they aren't entirely useless - keep in mind that all models are wrong but some are useful).

Although the paper is trying to say that the real number system isn't robust enough to express the goal/purpose of more complicated, "abstract" tasks, it speculates that a higher-order number system (such as the hyperreal or surreal numbers) would be able to achieve this. I currently disagree with this view - I view of "intelligence" as we know of today more as emergent phenomena of complex systems of autonomous agents (in the case of human intelligence, the emergent phenomena of neurons and other cells interacting with the external world), but that's a topic for another day.

I think you understood the basic gist of the paper quite well, that's a good way of describing it, and thanks for the link.

>it speculates that a higher-order number system (such as the hyperreal or surreal numbers) would be able to achieve this

I didn't mean to give that impression, sorry if it came off that way. Rather, what I say is that those number systems don't suffer the particular flaw that the real numbers suffer. There might still be other flaws. That's why in the beginning of Section 4 I wrote: "There are at least two potential ways to change RL so as to make it applicable to such tasks and, thus, at least potentially capable of leading to AGI. Of course, there is no guarantee that removing the roadblock in this paper will cause RL to lead to AGI. There might be other roadblocks besides the inadequate reward number system"

Author here. One particularly topical observation (topical because HN has recently featured discussions about ways to merge statistical and symbolic approaches to AI), from Section 4.2.3: certain cutting-edge number systems such as Conway's "surreal numbers" are so sophisticated that they require lots of symbolic logic just to do basic operations with. Of course this makes these number systems hard to work with, but they do not suffer the flaws pointed out in the paper, which the easier-to-work-with real numbers suffer. Since surreal numbers inherently require symbolic logical methods to work with, it follows that any sort of statistics-based RL agent for environments with surreal number rewards, would automatically combine symbolic logic and statistics.

I think you are hiding a key assumption - that an AGI must be capable of performing optimally on a task with Archimedean measure. This is a very strong assumption - performing eps-close to optimal may certainly fall in the definition of AGI, and this might be achieved by a non-Archimedean approximation. Defining AGI as performing optimally on any set of tasks is problematic from a computational theory perspective in general - even with real reward signals.

Furthermore, infinite rewards are not compatible with human behavior. Humans never optimize for a single event at infinite expense w.r.t. other goals.

That an AGI must be capable of performing optimally on a task with Archimedean measure.

If that's correct, this may be just one of those many problems where the optimal solution is far harder than a near-optimal solution. Examples include linear programming and the traveling salesman problem, where a true optimum is NP-hard to find, but you can get very close with far less work.

Hi, thanks for looking at my paper. I do not assume that an AGI must be capable of performing optimally on all tasks in general--indeed, that's quite impossible. When measuring the performance of RL agents, one must come up with some way of aggregating performance across many environments, but that's beside the point of this paper. The point of this paper is that if you're forced to use real numbers as rewards, you can't even communicate all environments to the agent without misleading the agent. Whether the agent could perform well or poorly in the environments once communicated, is beside the point.

"Assuming AGI agents are Turing computable, no individual AGI can possibly comprehend codes for all computable ordinals, because the set of codes of computable ordinals is badly non-computably-enumerable."

I was going to criticize this paper as crankery in the vein of Penrose, but first I thought I'd just compute all possible ordinals in my brain to make sure I'm a general intelligence.


Hi, thanks for looking at my paper. If you're interested in the relation between Lucas-Penrose stuff and enumeratability of ordinal codes, you might like I.J. Good (1969), "Godel's theorem is a red herring" (2 pages). Can you elaborate on what it is about my paper that strikes you as crankery? I'm a fan of yours so it would be much appreciated.

Sadly, I was only able to count through a vanishingly small subset of the reals in a finite time, and therefore am not a general intelligence, and so it would be foolish of me, a machine made of a handful of atoms, to try to criticize this paper. It sure would be nice if I could appreciate music, but you've proven that's impossible, so it is what it is.

Whether man is machine can't be trivially answered using a few handwavy applications of Godel's theorem nor of observations about the structure of the reals. My paper does not make any attempt to weigh on that question, neither does it claim nor imply that humans have any supernatural powers of enumerating reals or ordinals, and you grossly misrepresent me by implying it does.

Rather, my paper is on the less ambitious question of whether the traditional RL model (with its real-valued rewards) accurately captures the full set of reward-giving environments an AGI should be capable of comprehending.

None, one, many, all.

Hah. Still got it.

I think you are right in constructing situations where real numbers are inadequate. It is also right that you do not claim that hyperreals or surreals suffice, you are merely pointing out that they may help you to do better than the reals.

But I have often wondered - why are people hung up on linear ordering? Why not non-total partial orders?


Is this insistence on linear orders because of simplistic modeling on the part of cognitive science and AI people, or is there some problem with general partial orders?

One could certainly contemplate versions of RL with non-linear orderings. I guess the reason people care about linear ordering is because you want the agent to at least understand "this outcome is better than that outcome". How would we hope for a good nonlinear-RL agent to behave in an environment with 2 buttons, one of which always gives reward X, and the other of which always gives reward Y, where X and Y are incomparable?

That's understandable. But many decisions in life are like that. Do you want to be close to your roots and your parents, or do you want a high-flying career in a remote city? Choices involve sacrifices as well as gains, and many meaningful outcomes are incomparable among themselves.

The article was pretty over my head, but does your argument hold to the various augmented neural net systems such as Neural Turing Machines, or Differentiable Neural Computers?

The argument isn't so much about the type of agent (which I think is what Neural Turing Machines etc. are about), it's about the type of environment. In traditional Reinforcement Learning, environments give real-number-valued rewards (or even rational-number-valued rewards which is even more constrained). Presumably this was a decision that was made with hardly a second thought because real numbers are most familiar to people... but such "appeals to familiarity" are totally irrelevant for such an alien field as AGI :) The point of the paper is that a genuine AGI should be able to comprehend environments that involve rewards with a more sophisticated structure than can be accurately represented using real numbers.

There are multi-reward agents and multi-task agents but all rewards get added into a final scalar value. And gradient based methods need to have this one scalar value to derive gradients from.

Since you mentioned higher dimensional representations for rewards, I want to remind that the sub-fields of Inverse RL and Model-based RL are concerned with reward representation and prediction by neural nets.

Also, it doesn't seem like a good idea to try to disprove an entire field with a purely theoretical (a-priori) argument. There should be at least some consideration given to the state of the art in the field.

I make no attempt to disprove an entire field, rather to question an implicit assumption, namely that real number rewards are flexible enough to capture all relevant environments an AGI should be able to comprehend. Quite the opposite, I indicate ways reinforcement learning could be modified to get past the roadblock I point out with real numbers.

IIUC, the claim is that the very idea of a (real valued) “objective function” to be “optimized” is broken?

Broken in the sense that it's not flexible enough to apply to all conceivable environments a genuine AGI could navigate, without misleading that AGI. But I should stress that real number objective functions are probably fine for many specific interesting environments, I'm not trying to say that real number objective functions are useless. Just that they aren't flexible enough to cover all environments :)

Fair enough. It would be interesting/instructive to construct (relatively simple) examples where we can see that they’re broken :-)

Easy to come up with examples using exotic money-related constructions. Suppose there's something called a "superdollar". If you have a superdollar, you can use it to create an arbitrary number of dollars for yourself, any time you want, which you can trade for goods and services. If you want, you can also trade the superdollar itself. Now picture an environment with two buttons, one of which always rewards you one dollar, and the other of which always rewards you one superdollar. Shoe-horning this environment into traditional RL, you'd have to assign the superdollar button some finite reward, say a million. But then you would mislead the traditional-RL-agent into thinking a million dollars was as good as one superdollar, which clearly is not true.

But isn't the sophisticated structure what leads to the real number. If you change the rewards to something more complex surely you still have to pick between actions and at some point you'll have to evaluate which one is "better" and I can't see why you couldn't use real numbers to represent utility.

I mean, humans are general intelligences, and you can translate pretty much any human reward into money, which is a real number.

The paper is super long though. Maybe someone can give a TL;DR that makes sense.

Question: when you say "I can't see why you couldn't use real numbers to represent utility", does your reasoning for that have anything to do with Dedekind cuts, Cauchy sequences, or complete ordered fields? Because that's what the real numbers _are_. If your reasoning has nothing to do with these sort of things, then it can't possibly be sound because in order to argue that X has such-and-such property, you need to know what X actually _is_.

To repeat an example I posted for someone else: Suppose there's something called a "superdollar". If you have a superdollar, you can use it to create an arbitrary number of dollars for yourself, any time you want, which you can trade for goods and services. If you want, you can also trade the superdollar itself. Now picture an environment with two buttons, one of which always rewards you one dollar, and the other of which always rewards you one superdollar. Shoe-horning this environment into traditional RL, you'd have to assign the superdollar button some finite reward, say a million. But then you would mislead the traditional-RL-agent into thinking a million dollars was as good as one superdollar, which clearly is not true.

Good example, although what if you just assigned it a reward of like 100 trillion dollars? It might not be exactly correct but then you're assuming that exactly correct rewards are required for AGI which seems like a pretty big assumption.

Actually I thought about this some more, and maybe money wasn't the best example, but I think there must be some internal measure of utility that humans use that can be represented by real numbers.

Imagine you are presented with an array of possible actions with associated (possibly estimated) rewards. You can only pick one. Maybe there are some doors but you can only open one - behind the first is $1m, behind the second is a superdollar, behind the third is a button that cures world hunger, behind the 4th is your loving family, whatever.

As a human I can pick one. No matter what the rewards are. Even if one reward is "you essentially become God". That means I can order them, and therefore that they can be represented by real numbers (plus infinity for the god option).

I don't see why the infinity would cause an issue: the "you can now do literally anything" reward is worth more than every other reward, but it's the only one. Also it doesn't actually exist so who cares?

Actually I guess it can exist in games, e.g. God mode in Quake. But that should have an infinite reward and agents should choose it over everything else so I can't see the problem really.

>I mean, humans are general intelligences, and you can translate pretty much any human reward into money, which is a real number.

A lot of people have written quite a lot of arguments that this is false.

A lot of people have written a lot of arguments about everything. Has anyone actually demonstrated that it isn't true?

do your claims apply in complex-valued RL context?

I'm surprised to see that that's actually a thing people write about. I'm afraid I can't answer your question right now, as I have no idea why anyone would do that (as opposed to, say, vector-valued rewards---i.e., why is the ability to complex-multiply two complex rewards relevant, as opposed to merely comparing separate components of rewards from R^2). I'll have to read up on the subject :)

TLDR: reinforcement learning cannot handle AGI, because reinforcement learning rewards must be finite, but a true artificial intelligence could reason about infinite numbers.

I think this is complete nonsense. Humans don't receive infinite rewards, either, but we still think about infinite numbers. We typically think about infinite numbers in terms of finite representations, like finite proofs about their properties. Reinforcement learning could in theory do the same thing.

The real reason that traditional reinforcement learning will not yield AGI is that the model is too limited. The current state of the art in reinforcement learning relies on representing the entire world and your entire strategy as a single vector. For most RL tasks we haven't even gotten deep networks working; RL is somewhat behind areas like vision or natural language processing.

The way forward isn't to worry about infinite numbers, it's to develop better architectures that let RL solve more problems.

Agreed. In terms of base assumptions, running AGI on computers also puts us at a disadvantage due to multiple layers of abstraction. That is to say, AGI is what organic brains are as opposed to what we're asking computers to do. It's like a complexity difference between building a machine to throw a ball and building one to accurately simulate the physics of throwing a soft rubber ball through turbulent air onto grass. A dragonfly uses just sixteen neurons to take visual input from thousands of ommatidia and use it to track prey in 3D space. How many transistors would we have to use to accomplish the same task? Now scale that up to 86 billion neurons in a human brain. That's the scale of the problem we're looking at after we figure out how to program it.

>Humans don't receive infinite rewards

Maybe you're right, but I don't think it's as obvious as you imply.

Consider this question. Can there exist hypothetical rewards x1,x2,x3,..., each one of which is significantly better than the previous, and at the same time another hypothetical reward y such that y is significantly better than each x_i?

If such rewards can exist, and if "significantly better" implies "at least +1 better", then that necessarily means y must be infinite.

Now, how would you argue that such rewards can't exist? If you say, "because then y would have to be infinite", then you're arguing in circles, assuming what you want to prove in the first place. It might well be that such rewards indeed can't exist, but if so, the argument would have to be different, and I suspect nontrivial.

Humans don't receive infinite rewards because any "reward system" in our brain is implemented by receiving a finite amount of electrochemical pleasure signals over a finite amount of time.

Human-equivalent minds can obviously be implemented on top a framework that does not have inherently infinite or infinitely divisible values, because human minds are implemented on a top of a substrate that uses a finite amount of discrete neurotransmitters to do everything, and manages to work just fine.

If human minds can reason about infinite concepts using "hardware" where all the neural signals used to represent that concept are finite, then why should it be impossible a reinforcement learning system learn to reason about infinite concepts while using only finite numbers to form that representation?

>Humans don't receive infinite rewards because any "reward system" in our brain is implemented by receiving a finite amount of electrochemical pleasure signals over a finite amount of time.

This is like saying computers can't represent infinity because they have only finitely many bytes.

Suppose the treasury rewarded you a "superdollar", which is a special object that allows you to create any number of dollars that you want, on demand, as many times as you want. How many dollars would you say this superdollar is worth? Obviously, no finite number of dollars would be worth that one superdollar. The human mind can certainly understand the relative value of a superdollar vs. any number of dollars. That the human mind is implemented through finite electrochemical processes is irrelevant.

The worth of a superdollar is equal to as much dollars as you can spend per second times the seconds you can expect to live. Both numbers are large but finite, the first is bounded by the value of global economy (no matter how many dollars you create, the total purchasing power of these dollars can't grow beyond that), the second is bounded if not to a few hundred years then by the time until the heat death of the universe.

In a similar manner, the total amount of reward that you might ever get is capped by the amount of pleasure your brain can perceive at any given moment (which is finite) times your lifespan (which is finite).

You might reason and hypothesise about agents perceiving infinite rewards (as we are doing now), but this has nothing to do with the reality of homo sapiens rewards system(s), or, in fact, the rewards system of any agent existing in our physical reality, which is effectively bounded both in time and space.

Another problem with your paper is related to this one: AGI is defined as a program, that can solve any task a human can. You haven't showed, that humans are capable of solving arbitrary tasks with transfinite rewards, therefore you have not demonstrated, that (even potential) inability to solve them within current framework implies inability to create AGI in it.

I don't think the numbers really have to be "infinite" exactly (well, I suppose it depends what you mean). Suppose you introduce a data type which consists of an ordered pair of floating point numbers, along with an ordering relation, such that the elements of the data type are ordered lexicographically (i.e. if the first entries differ, then the one with a larger first entry is larger. Otherwise, the one with the larger second entry. If they are equal then they are equal.)

If you have an environment which gives rewards of this type, and you want the model which gets the highest reward, you are likely to have an issue if you try to represent the rewards using a single floating point number. (well, you could just use the first entry, and do decently well, but you would lose out a bit on what could be accomplished on the second number.) Of course, because there are, in actuality, only finitely many floating point values of a given precision, you can actually give an enumeration of the values of this type in order, and if you use that enumeration for the rewards, then that could work.

However, when we use floating point numbers, we sometimes sorta-pretend that they don't have a finite range of actual-number-values . We sometimes sorta-pretend that they are the actual real numbers (with a little fuzziness and errors tacked on, when we are being careful). We use them in computing "derivatives" and such. And this works pretty well! But if we wanted to use the enumeration of the pairs of floats, if we just treated them as the (bigint) integer index of the pair, we would lose all the nice interpretation that goes with floating point numbers, and be left with only the ordering. The notion of the relative distances between the different values would be lost. The whole "use the derivative of this function (except we are using floating point numbers)" trick stops being applicable.

While adding together any number of copies of (0.0 , 1.0) would never surpass (1.0 , -5.0) , that's not really a reason to make it so we can't define the preferences we want to represent, uh, unable to handle that situation.

So in this paper (https://arxiv.org/pdf/1707.02389.pdf), (If I'm groking it correctly) Terrance Tao seems to embed a Turing Machine in a vector space. So is using a single vector not a real problem, or am I just not getting it?

Humans do have infinite rewards (or at least penalties). Death is a pretty common one for example.

We do not treat death as an infinite penalty in our decisionmaking, as evidenced by various cases of people choosing to die for one reason or another, prioritizing the achievement of some other goal (with a finite reward) over survival.

People can and do use statements like "death must be avoided at all costs" (it's also mentioned in the paper), but that statement simply is hyperbole which is not exactly literally true description of reality.

Is that an accurate TL;DR? I'm pretty sure no...

It's not saying you need rewards that are infinite in value per se. It's saying that you might have more situations that need to be differentiated from each other than there are real numbers. For instance, you might need to have one set of rewards that map to the real numbers, and then another set of rewards that also map to the real numbers when compared to themselves, but are all considered strictly greater than the first set.

It's assserting that there's a need to differentiate more situations from each other than there are real numbers, and basing its conclusions on the assertion, but it is not providing a convincing basis why that assertion/assumption/hypothesis is true.

The author demonstrates that a system without such a capability would not be able to solve a certain set of problems. However, going from that claim to a claim that this capability is needed for human level AGI is a non-sequitur - there is no evidence that such a capability is needed for human-level intelligence, and there's no evidence (at least not mentioned in the paper) that humans have the exact capability described.

I like reductionist maths counter-proofs based on "x cannot contain y", but I try to be skeptical they apply, because it is plain real mathematicians can reason about infinite things, from finite symbols, as chains of symbols. the "cannot contain" set includes things I can state, but not enumerate.

I'm skeptical about AGI anyway. The proof is unnecessary, I tend to an even more reductionist model: Our lack of understanding where intelligence is, in the brain, goes to our failure to model it. "morally" making claims "its alive" based on a lack of understanding, is a bit like chemistry by alchemy. If you don't know why, you didn't make it any more than making a baby would have.

Feynman said it better: lots of physics proofs are built on partial models of sub-atomics, which if you ask questions about become unknowables too. Its turtles-all-the-way-down stuff.

traditional reinforcement learning will probably not yield AGI, any more than any current method, based on not understanding GI, and until we understand GI, I do not believe any connectionist, or learning model will derive it.

So cockroaches have most probably not a big neural network, but they are able to do lots of stuff, including reproduction.

I think this kind of machine intelligence is already at reach of currents models, or we are close to be able to make an "e-cockroach".

I think we'll learn lots of stuff just seconds after having put in the world that kind of limited artificial intelligence.

And then the models, through sensors, will have the entire world at reach to begin auto-improvement tasks.

And this would solve the issue "we're just training the models with very limited data, how could they evolve faster than millions of years?"

I think the FAANG, the big players, realized this limitation a LOT time ago (10 years maybe).

And they are already trying some things to solve the limited data issue, giving their models all the information they can extract from cellphones, the current iteration of "massive network of sensors to train BIG - multiple, almost hidden from the public - models.

If I have to bet, I'd bet the whole information is being stored to be "replayed" when whole new more advanced models emerge eventually.

People love talking about intelligence but I'm yet to see a measurable definition. The best anyone can manage is "you know it when you see it" (like porn). Am I missing something or is this all just a dark road to an unknown destination (possibly just a dead end?)

The value of definability is moot at best imho. "Mathematics" is also not easy to define, there are volumes philosophers of mathematics discussing this, some even say "mathematics is what mathematicians find interesting" (similar to your porn example) but this doesn't stop us from studying mathematics. Same goes for science e.g.. People like Popper or Kuhn spent a lot of mental cycles arguing what is and what is not science, yet people still do science every day without reading them.

In some ways this way of thinking is too meta. In order to be a good mathematician, it is not necessary to understand the nature of mathematics from an outside perspective. This view can be very useful e.g. if you're working on foundations, but that doesn't mean before being good at mathematics one must be good at understanding the nature of mathematics. That seems like the job of philosophers, not mathematicians. (Well, sometimes the set has intersections, e.g. Brouwer, Hilbert, Godel and Penrose (theo. physicist) wrote some works on phil. of math).

EDIT: To express this slightly more formally: in order to understand a theory of a model, you do not need to understand the model comprehensively. You can be an expert in intelligence science by studying the falsifiable and predictive theories of intelligence science without understanding the nature of intelligence itself.

> Same goes for science e.g.. People like Popper or Kuhn spent a lot of mental cycles arguing what is and what is not science, yet people still do science every day without reading them.

I'm quite fond of a terrible mathematical joke which this reminded me of.

"I'm worried about my nephew, I was trying to teach him to add numbers, but he can't even pronounce zermelo fraenkel set theory, how's he ever going to learn it?!"

One approach is by Legg&Hutter; "Intelligence measures an agent's ability to achieve goals in a wide range of environments", which they also try to formalize in https://zoo.cs.yale.edu/classes/cs671/12f/12f-papers/legg+hu...

This goes hand in hand with a functional definition of knowledge, where we judge whether a system or an agent "really knows" something through measuring success or failure in a variety of scenarios where knowing that thing (and properly applying that knowledge) is necessary to make an effective decision. Or, quoting Forrest Gump, stupid is as stupid does.

We have to make machines feel pain

I've long thought that to reproduce animal-like intelligence artificially, you need to simulate an environment and stimuli, some unequivocally positive and some unequivocally negative.

Of course AGI does not need to be "animal-like intelligence".

How do you supposed this is to happen? Some just call it a cost function.

Pretty sure the IRS would figure out a way to defeat any sufficiently advanced AI system.

Yeah of course not, computers today as we know them will never lead to agi.

The argument is totally flawed and therefore BS. It rests on a clearly wrong assumption, that you can't represent number sets with higher cardinality (there are more real numbers, than natural numbers) using the sets of lower cardinality.

You can use natural numbers to represent real numbers with arbitrary precision. You can also use natural numbers to represent non-Archimedian number systems.

Yes you can, but said representations will necessarily be misleading. It's illuminating to consider Big-O notations: why don't we "simplify", since real numbers are so much easier, why don't we declare, e.g., that O(n) is "1", O(n^2) is "2", etc.? Well, then what should O(2^n) be? A million, perhaps? But then what about O(n^1000000)? To be consistent, you'd have to say O(n^1000000) was something below a million, since that complexity level is below O(2^n). And then whatever you chose for that, that would constrain you to pack O(n^billion), O(n^trillion), etc., into smaller and smaller ranges of possible numbers. You'd inevitably be forced to eventually assign almost identical numbers to, say, O(n^(2^100)) and O(n^(2^999)), which is misleading because those are not at all close to each other in complexity.

I don't see how representation of floats by natural numbers that we all use on daily basis is misleading.

In fact, your entire comment is just an ordered sequence of natural numbers, and it does not seem to be very misleading (though it tried to trick).

Besides, the article is purposed to be a rigorous mathematical proof of current representations in RL being unsuitable for AGI, but I haven't seen a Definition for "misleading" there.

?? Current reinforcement learning algorithms don't score an action by producing a line of text that is read by a human. It assigns a value by writing a floating point number into a register on the computer somewhere. If it were otherwise, and there was some kind of interpretation unit that was needed to compare two rewards and decide which one was bigger, then hey, you're not using the reals for your rewards anymore, just like the paper suggested you'd have to.

In that interpretation the article basically says "you can't precisely optimize function F in domain X with by optimizing function F' in domain Y, when cardinality of Y < cardinality of X".

Well, obviously! That is basically the definition of cardinality!

The argument in the paper has nothing to do with cardinality.

Review my comment a few steps above about Big-O. The reasoning there shows the real numbers, despite having cardinality of the continuum, are inadequate for measuring even the following countable set of Big-O complexity classes: O(2^n) together with O(n),O(n^2),...,O(n^i),...

Is there something special about intelligence that implies we only need to consider RL-environments with real-valued rewards, and not, say, big-O-complexity-class rewards? Maybe there is. But if there is, the proof will have to be nontrivial---at a bare minimum, it should make some reference to what the reals are (e.g., the unique complete ordered field). This would be quite world-shattering, to real analysts if to no-one else. But none of the RL literature seems to make any reference to what the reals actually are, instead just taking it for granted that the reals are a magical number system as flexible as anybody could ever want. (Other authors essentially pointed this out before I did; see footnote 10 in my paper.)

I re-read your Big-O comment, and it still does not make sense. The fact, that mapping all n^i to arctan(i) and i^n to arctan(i) + pi is somehow "misleading" is not making it theoretically impossible for RL algorithms from finding an optimal solution in that metric. If for a specific instance you're saying n^1e300 is wrong because it is impractical, and 2^n would be preferred, you simply posed the original task incorrectly by asking for a wrong metric.

It's not about whether the agent will or will not find an optimal solution (no agent can possibly find good solutions in every environment). It's about whether the agent could even understand the true environment based on your real-value-reward description of it.

Suppose the true environment has various tasks the agent can do which give big-O-complexity-valued rewards, one task giving a reward of O(2^n), and others giving rewards of O(n^i) for various i. For concreteness, say Task A rewards O(2^n), Task B rewards O(n^10000), and Task C rewards O(n^20000).

Now suppose you present this to the agent using real-valued rewards, say, where O(n^i) is replaced by arctan(i) and O(2^n) is replaced by arctan(2)+pi, as you suggest, then the agent will be deluded into thinking, e.g., that Task B and Task C give almost identical rewards (Task B gives reward 1.57069633 and Task C gives reward 1.57074633, which barely differ from each other at all). This is misleading because in the true environment, Task C gives much more reward than Task B. Yes, the agent understands Task C gives a bigger reward, but the agent totally mis-understands how much bigger :)

Your paper does not prove anything in regards to what agent "mis-undestands". "Understanding" is not a mathematical concept in this case. It would still progressively find more and more optimal solutions to this task, so I see no problem there.

Suppose I claimed it's impossible to understand Shakespeare with all the consonants are deleted, leaving only vowels. Most people would accept that claim without pedantically asking me to define "understand". Same story here. I'm arguing that certain environments, when shoe-horned into the traditional RL model, are like Shakespeare with all the consonants removed.

The agent might (or might not) find progressively more and more optimal solutions to the interpreted environment, but not to the true environment (unless by dumb luck), because it does not see what the true environment actually is. For example, in the concrete environment described above, you could suppose completing Task C takes more steps than Task B. Then the agent is likely to conclude, on the basis of the misleading arctan rewards, that Task C is not worth the extra steps. Depending on the extra steps, this conclusion could be badly false.

misleading in the sense of "doesn't preserve the desired properties of closeness and such."

No, non-Archimedian means transfinite arithmetic (or nonstandard reals, depending on whether you're looking for discrete or continuous).

What is your point exactly?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact