Hacker News new | past | comments | ask | show | jobs | submit login

Having studied this extensively back when they were called Genetic Algorithms, I would like to offer a few insights.

1) One of the biggest reasons they fell out of favor for more "mathematical" approaches was that no one could really explain why exactly they worked. It makes sense on the surface that "survival of the fittest" and doing something akin to multiple stochastic gradient descents would work, but no one has really been able to produce a mathematical proof as to why.

Since other folks are producing good examples of "explainable AI", I don't know how Genetic Algorithms/programming could be made 'explainable' as to why they achieved an optimal solution other than hand-waving to how evolution works in nature.

2) The most important thing to define is the fitness function, this defines what the search space looks like and how easily a globally optimal solution can be derived. For a good example of an interesting search space that a genetic program would have a difficult time with, see Schwefel functions [0]. Back when I researched these things closely, my intuition was that reality rarely fits neatly into good fitness functions and I felt that at the point you are understanding the problem, you may just be better off with a direct approach, which leads to

3) Genetic programming should only really be considered when there are no known alternatives or they are way too computationally expensive.

In either case, I would welcome a resurgence in a topic I once knew quite well, though I haven't been in that field for a few years now.

[0] https://jamesmccaffrey.files.wordpress.com/2011/12/schwefels...

1) One of the biggest reasons they fell out of favor for more "mathematical" approaches was that no one could really explain why exactly they worked.

Kind of like how nobody can really explain how the brain works, or life in general. My gut feeling is that it is hubris to think that we are going to "figure out" intelligence with increasingly sophisticated mathematical models anytime soon. We are not giving proper credit to how complex it is, and the multi-billion year developmental process that it took. We think we can just short-circuit that with some fancy math because we've had success with planetary orbits and other comparatively rudimentary phenomena.

The current industry approaches are great for extracting certain kinds of value out of large data sets, but in terms of producing a result that could even begin to be considered as interesting as life (i.e. AGI or "strong AI"), I believe we will have to rely on creating a system whose inner workings are too complex for us to understand.

In other words, going off of Arthur C Clarke's definition, life is magic. And we're trying to create something equally magical. Almost by definition, if we can analytically understand it, it's not going to be interesting enough.

> We are not giving proper credit to how complex it is, and the multi-billion year developmental process that it took.

Or we are simply not ready to accept that it's simply a big book of heuristics fine-tuned over biological eons.

It's just big. We have too many interwoven, interdependent, synergistic faculties. Input, output, and a lot of mental stuff for making the right connections between the ins and the outs. Theory of mind, basic reasoning, the whole limbic system (emotions, basic behavior, dopaminergic motivaton), the executive functions in the prefrontal cortex, all are very specialized things, and we have a laundry list of those, all fine-tuned for each other.

And there's no big magic. Nothing to "understand", no closed formula for consciousness. It's simply a faculty that makes the "all's good, you're conscious" light go green, and it's easy to do that after all the other stuff are working well that does the heavy lifting to make sense of reality.

I'd call this pulling a Dennett: trivializing complexity to something that cannot or just doesn't have to be explained. Being unable to conceive consciousness at this moment doesn't mean there's nothing to conceive of: even if we never get to the final satisfactory answer, there is undoubtedly much more room left for useful concepts we don't have yet, around or inside this idea

I'd call bullshit. Dennett's argument not that the brain is complex but that it's not obvious it is not reductible.

That's a lot of nots. So you are saying that Dennett says that the brain might be reducible[1]?

I don't think that's a strong claim or that it even qualifies as a claim at all. Lots of things might decompose into simple components if subjected to the right analysis, very few things definitely won't - for example many clever people have spent a great deal of time attempting to reduce quantum and cosmic scale physics to simple intuitively founded laws... If Dennett's claim is that the human brain is the same order of object as the universe I can accept it only if we agree that all objects share the same order. Where does that get us?

[1] Apologies, I don't know what reductible means, but guessed typo - I'm open to education though and unworried by typos!

Your last paragraph seems to contain the kind of overconfidence that I'm talking about. I don't understand how you can say "consciousness is simply X" or "it's easy to do that [if you handwave away the hard parts]." Clearly it's not that simple or easy, or we would have done it.

We can't even create life from non-life. How can we begin to understand all the stuff you're talking about that's been layered on top? We don't understand this stuff well enough to just handwave it away as unimportant or trivial.

I'm assuming a simpler model, no need for magic, because so far I don't see what behavior/data this simple model cannot explain.

> Clearly it's not that simple or easy, or we would have done it.

We don't have the computational power yet. Not to mention the vast amount of development required. Think of the climate models, that are huge (millions of lines of code), but they're still nowhere near complete enough, and they only have to model sunlight (Earth's rotation, orbital position, albedo), clouds, flows (winds and currents), some topography (big mountains, big flats), ice (melting, freezing), some chemistry (CO2, salts). And they only have to match a simple graph, not the behavior of a human mind (eg Turing test).

So, it's not easy, even if simple.

> We can't even create life from non-life.

We understand life. Cells, RNA, DNA, proteins, mitochondria, actins, etc. It's big, it's a lot of moving parts, and we understand it, but we can't just pop a big chunk of matter into an atomic assembler and make a cell.

And I think intelligence/sentience is similar. It's big, not magic.

> We don't have the computational power yet

Certainly you do realise that this has been a moving goalpost for half a century? It seems that lately people started to avoid giving a concrete estimate of the power required, though. It was so easy in the 90-s! «Human visual/verbal system processes gygabytes per second/has n flops to the xth» or the like. Well, now we have that and more; how come a deeper modeling, a finer processing, a more complicated network has come to be needed?

And your examples are incorrect, for example Navier-Stokes equations plus some general physics knowledge always have allowed us to estimate how much data we need for a certain fidelity of a finite-term weather forecast. Certainly we need more for a complete climate model, but we know what we need. No such thing about the brain.

It's an easy way to score some rationality points by voicing rejection of "magic", but it's a strawman. Nobody will bother arguing for a mythical homunculus in the seat of the soul, nor even for a concise formula summing up the workings of the mind. Pick harder targets. "It's just big" or "it's just a bunch of heuristics cobbled together" is a non-explanation. The brain is not a Rube Goldberg machine that manages to produce any sort of work simply due to its excessive complexity – it is energetically economical, taking into account that neurons are living cells that need to sustain their metabolism and not merely "compute" when provided with energy. Its discrete elements aren't really small by today's standards, nor are they fast. The number of synapses is ridiculous, but since they aren't independent, at a glance it doesn't add that much complexity too (unless we abandon reason and emulate everything close to the physical level).

Yet we have failed to realistically emulate a worm. By all accounts we have enough power for 302 neurons already. There's no workload to give to overwhelm available supercomputers. It's knowledge and understanding that we lack, and it's high time to give up on the delusion that more power, naturally coming in the future, will somehow enable a creation of predictive brain model, for this would truly be magic.

I know that people constantly underestimated the required computing power, as more and more finer details of the brain and cognition are unraveling. That doesn't make my argument invalid. I don't think we need to do a full brain emulation. That's the worst case scenario.

We're getting pretty good at computer vision, what's lacking is the backend for reasoning, for generating the distributions for object segmentation and scene interpretation. Basically the supervisor. (As unsupervised learning is of course just means that the supervision and goal/utility functions are external/exogenous to the ML system, such as natural selection in case of evolution.)

My example illustrates that yes, we can give an upper bound on molecule by molecule climate modeling, but that's just a large exponential number, not interesting, what we're interested in is useful approximations, which are polynomial, but they being models, they need a lot of special treatment for the edge cases. (Literally the edges of homogeneous structures, like ice-water-air, water-air, water-land, air-land [mountains, big flats, etc] interfaces. And the second order induced effects, like currents, and so on.) That means precise measurements of these effects, and modelling them. (Which would be needed anyway, even if we were to do a back to the basics N-S hydrodynamics model, as there are a lot of parameters to fine-tune.)

For the brain we know the number of neurons, the firing activity, the bandwidth of signals, etc. We can estimate the upper limit in information terms, no biggie, but that doesn't get us [much] closer for the requirements of a realistic implementation.

> Yet we have failed to realistically emulate a worm.

http://openworm.org/getting_started.html#goal seems to be matter of time, not lack of understanding. ( https://github.com/openworm/OpenWorm#quickstart ) But maybe I'm not up to date on the issues.

> it's high time to give up on the delusion that more power, naturally coming in the future, will somehow enable a creation of predictive brain model, for this would truly be magic.

a) people are saying exactly this for years, that we have enough data already, we need better theories/models

b) they fail to accept that more computing power and data is the way to test and generate theories.

> The brain is not a Rube Goldberg machine that manages to produce any sort of work simply due to its excessive complexity

A Rube Goldberg machine is simple, just has a lot of simple failure modes. (A trigger fails to trigger the next part, either because the part itself fails, or the interface between parts failed.)

> Its discrete elements aren't really small by today's standards,

If you mean cells, or cortices, agreed.

If you mean functional cognitive constituents, I also agree, but a bit disagree, as they are small parts of a big mind, all interwoven, influencing, inhibiting, motivating, restricting, reinforcing, calibrating, guiding, enhancing each other to certain degrees.

So in that sense consciousness is a big matrix which gives the coefficients for the coupling "constants" between parts. A magical formula if you will. But not more magical, than the SM of physics.

If intelligence/sentience isn't magic for you, then consider existence. How strange it is to be anything at all.

Yes, indeed, but that's philosophy at its best. Thinking about nothing or everything. Zero and/or infinite complexity.

No predictive power whatsoever.

Also, I'm amazed by consciousness, by reasoning, by our cognition, by intelligence, how we apply it, day-to-day, from pure math to messy, but useful engineering, through the ugliness of realpolitik, and the beautiful and dreadful human tangle that our civilization is. The contrasts, the why-s. (Consider the so stark difference between US and Mexico especially the border towns, which is of course deceiving, as the problems don't stop at the border, the cities, states, nations are connected. The warring cartels, the corruption, the hopeless have-nots, the dealers, the addicts, the war on drugs/terror/smuggling/slavery/yaddayadda, the DEA, ATF, their foreign counterparts, the policy going against the market, the hard on crime ideology, the big data vs gerrymandering case just on the Supreme Court's plate, the pure math and reasoning behind all that again, are all connected, just harder to frame them in a "deep" picture.)

But so far, none involves any actual irreducible complexity. No magical formula, just layers upon layers of complexity and fine-tuning.

We don't fully understand life. We don't even understand all the proteins. We sure don't understand a single neuron.

We understand many small and big things about life, yes.

In part I agree with you. The big difference is that we don't understand the fundamental difference between what is alive and what isn't. We have many different ideas about the quality that is called "life" or living. We have no clue about what it is.

We have little or no understanding of the complex protocols that occur within a cell. If we did, our standard manufacturing techniques would be vastly different.

We can modify DNA and RNA in interesting ways, but they are not living. It is not until we put them into an already existing living cell that we can reprogram some characteristics of that cell.

It's a spectrum. A rock is non-alive, and a human talking to another human is rather alive. A virus is closer to a rock than a cockroach is to a human newborn in terms of life, but a brain dead patient is probably closer to a tree than to a butterfly, and so on.

We have pretty fine understanding of cells, but our materials science and manufacturing technology is not "vastly parallel incremental molecular", but "big precise drastic pure chunk" based compared to cellular manufacturing. Not to mention protein folding and self-assembling biomachines and so on. We're getting there.

> We can't even create life from non-life.

We can't really define life in the first place.

But this is not necessarily to your favour. I think it's more of an indication of how the world doesn't fit into our... anthropomorphic way of thinking. That is, everything follows the laws of physics, no magic involved. We aren't special.

We certainly can define life, it's just that people don't generally agree on a definition. Some people get offended if you don't include their favorite things in your definition.

Even so, we understand it just the same no matter how you define it, because what we understand is not a function of word choice or definition. It's a function of capability.

> We can't even create life from non-life.

Have you not been following the work of Craig Venter? Depending on your point of view, he's already done it. Even if you don't agree, you have to admit that he's probably one of the few closest to actually doing it.

Craig Venter has not created life from non-life. He has synthesised code that can reproduce and grow into a synthetic life form once implanted into an already living cell. So no, not life from non-life.

200 years ago there was nothing to understand in electricity: it was just a liquid :) https://en.wikipedia.org/wiki/Fluid_theory_of_electricity

There were experiments that were not explained by the liquid theory.

Now we have data and people for some reason want to claim that a theory with magical super complex and not-even-yet-describable and very-very-irreducible element(s) is a better fit than a good old box full of tiny yet specialized parts fine-tuned to work together over millions of years.

Can you think of any experiment that cannot be explained by your theory?

You mean, is this a falsifiable/testable theory? Yes, it's testable, we see very specific neuropathologies, almost like on-off switches affecting very specific functions/faculties of the mind, and they usually correspond well to brain damage locations.

So in that sense the experiment is to enumerate the basic (built-in) functional components of the mind and corresponding implementational level machinery, and the of course the reverse (try to enumerate the implementation components and match them with functions) can generate important data (is there a function that has no implementation?).

That said, since the claim is that there's no magical component in the mind, and that's kind of hard to prove, but easily falsifiable, just find a/the magical component.

The problem is the same as with the soul, and the self, and so on.

There is plenty of magic going on. Today we cannot replicate or understand how emergent properties born of biological structures. Not even in "simple" systems as the metabolic pathways.

We mapped the whole genome and connectome of C. elegans, no? And most of that is understood. For example, it seems to be a good model for substance addiction (especially for nicotine). That seems a pretty complex emergent behavior to me.

If you mean bigger biology, yes, sure, we don't have a full map of functional genomics for humans, but we're getting there.

Or maybe not, maybe it's so exponentially more complex, that it'd take as much time to understand it as it took for evolution to work it out. (Especially considering that evolution played with every individual, whereas we like to constrain our data gathering to non-aggressive methods.)

We have the connectome of C. elegans, yes, but we are still pretty far from understanding how it 'works'. The functional connections are still an active area of research, and it is greatly complicated by connections not explicit in the connectome (neuromodulator effects), as well as the internal dynamics of neurons and non-linear network dynamics.

The connectome is necessary, but far from sufficient, to 'understand' a brain, even one made from only 302 neurons as in C. elegans.

Same way waves emerge in water.

The rules that govern a system can create patterns, which themselves behave according to rules, but with a set of rules that was "hard to predict" from the underlying system.

That is exactly my point. We can use fluid dynamics and PDEs in waves. We understand some properties and processes. We are nowhere as close in biological system.

I put the example of the metabolic pathways because last time checked (~2015) the most advanced things in the field were extremely simple and without any predictive power. Things like calculating the kernel of a stoichiometric matrix or the centrality of a node in the interactomic graph.

But you've now shifted the "how" question. (Or I misunderstood the original.)

It is not a question of how there is emergence, why there is magic. The answer to that is is because systematic interactions at a low level can create higher level playing fields.

So the "how" is now a technical question, what is this system, how complex is it, and at which levels can we understand it. And since this system has been learning how to avoid erasure by entropy or by competition for 3.5 billion years, it has searched quite a possibility space, namely 2^1277500000000, if we assume making a copy every day.

There's probably no need to go that low-level for modeling a mind, but of course the aggregate effects of biochemistry has to be taken into account (and it's full of non-linearities).

None of that means we don't understand the principles. I'd say it's pretty much like fusion. Yes, we know how the Sun works, but putting it into a bottle is a bit of a pickle, similarly with brains. (Except brains have a lot more complexity.)

Keep in mind the genome really isn’t big enough to store enough hueristics to make a functioning human.

Yes, of course, the problem of bootstrapping consciousness from blueprints of a human mind is that we depend on our parents' whole epigenetic and other extra informational make up, plus their support for years while our mind finishes setting up.

While I can’t predict that we can solve intelligence completely any time soon, I would argue that machine intelligence allows us to examine the phenomena of intelligence more thoroughly than anything else in the past.

I would expect this to yield new insights. So at a minimum, I’d think we can learn more now than we have been able to before. Maybe that will lead to a great increase in understanding of intelligence or a slight one, but it will lead to an increase in our understanding of the phenomena.

The only insights will be related to what insights we get about the programming we are doing with machines. It is an entirely different order to providing insights into even the simplest of organic neurological systems.

Computer simulations can give insight into simple systems like water flows, etc. Simulating more complex systems like a single living cells or any system built on livings cells would require systems that would not really be worth building. It would be simpler to use cells directly.

> going off of Arthur C Clarke's definition, life is magic. And we're trying to create something equally magical.

I assume you're referring to his "any sufficiently advanced technology is indistinguishable from magic"? If so, you're misrepresenting it, because he's clearly saying it's not magic, it just appears that way to the unadvanced. And there's a big difference between "appears to be" and "is".

But we are the unadvanced on this matter, so it's magic.

Also 'appears to be' != 'indistinguishable', the latter is far closer to 'is' imo.

The quote is about something seeming a certain way when we don't understand it, not about it actually being that way. It is not saying the thing itself is literally indistinguishable from magic. Anyone who understood the thing (eg an advanced alien race who created a technology far beyond our current capabilities) would not see it as magic, and would know that it isn't magic.

Taking your point a step further, I often wonder if mathematics is a local minimum for humans. It has been so damn effective in so many ways that we can’t imagine that there might be some other mechanism for solving hard problems. In cases where the mathematics gets really complex, I wonder if this is a hint that there’s some other way to represent the situation.

Computational processes have picked up where mathematics stops.

>The current industry approaches are great for extracting certain kinds of value out of large data sets, but in terms of producing a result that could even begin to be considered as interesting as life (i.e. AGI or "strong AI"), I believe we will have to rely on creating a system whose inner workings are too complex for us to understand.

Speak for yourself.

> My gut feeling is that it is hubris to think that we are going to "figure out" intelligence with increasingly sophisticated mathematical models anytime soon.

We did it already. Compter understand language, translate it, react to it. They can recognize items on a picture. Is there a task left which can't be done by computers better and faster than by humans?

>Almost by definition, if we can analytically understand it, it's not going to be interesting enough.

I think current ML is magic. I understand the math behind it. But still, I'm amazed every time when the training is over and it actually works like intended. Everything which is big enough is more than the sum of it's parts.

>I'm amazed every time when the training is over and it actually works like intended. Everything which is big enough is more than the sum of it's parts.

What about when it doesn't work as intended and fails ridiculously, even though it usually works perfectly well?


Humans do the same thing; not working as intended some of the time. It's just that the failure modes for ML are different, and so we see them as ridiculous.

Well, if the result of the different failure modes means that a machine can't tell the difference between what obviously resembles a turtle and a rifle, or a cat and guacamole, then that's something that anyone who watched the video is better at. You can call it a different failure mode, but these are things no human would misclassify unless seriously ill, and being able to classify simple objects is important to our day to day lives.

Imagine some sort of Robocop deciding to neutralize someone for holding a turtle toy.

> Is there a task left which can't be done by computers better and faster than by humans?

Are you serious? You think we're done?

Probably better if you gave a task example like being a mother or a father or a grandparent or an uncle or an aunt or a mentor or a friend or anything to do with human interaction.

There is no market for fathers and uncles. Nobody wants a roboter as father. Mentor of friend sounds more interesting though.

Show this leopard sofa to 100 humans for 1 second each and see how many will make the same mistake. We are 99% there and you point to the 1% to prove how bad we failed. Sure, the error rate will go down even further in the next years. But what we have is more than good enough to be used in commercial products. It's not like we are trying to win a contest man vs. machine (which we will or already have).

> Is there a task left which can't be done by computers better and faster than by humans?

All the tasks humans still earn money doing. And given that we're nowhere near full automation, I'd say it's quite a few tasks.

Like the collector in the supermarket who scans the products and takes my cash? Surely it must be impossible to automate such a complex task.

Industry/Economy lacks behind state-of-the-art technology by decades.

That's a poor example, since self-checkout scanners have existed for a while now. But notice how they aren't used exclusively. The bigger orders still require the manned scanners, and the self-checkout always has someone on duty.

A better example is plumbing. How would you go about automating a human plumber who handles all sorts of piping and crawl spaces in a large variety of settings?

Self-checkout is not automated, it is just the customer who does the work instead of an employee.

It was an example of a job which can be automated (and as you pointed out, is already automated), but which is still dominated by human workers. You suggested that all the jobs done by humans can't be automated. My response is: most (or at least some) can, but aren't.

Sure, plumbing is a much more challenging example. But it is an economic problem. The cost to automate plumbing is much higher than the utility of it.

You should study ML a bit more, to see just how wrong you are about it

I said two things. Which am I wrong about?

1. We haven't 'figured out' intelligence, far from it, we don't even know what we don't know.

Your comment sounds like Lord Kelvin proclaiming that physics is "over" a couple years before people figured out there were huge holes in the theory which eventually led to quantum mechanics. Our understanding of intelligence is probably less complete than our understanding of physics _back then_.

2. ML is really underwhelming if you measure it against actually intelligent behavior. Figuring out cool regression mechanisms is neat and all, but that's what it is, and it has nothing to do with intelligence in the general sense, much like expert systems had nothing to do with actual domain knowledge, they were just one of the most primitive models, low-hanging fruit that we could exploit.

We definitely have not 'figured' out intelligence

Wait, I'm confused...you're saying genetic approaches fell out of favor because they're basically just stochastic gradient descent? Most of modern DL relies heavily on SGD at various points during training.

My impression is that they fell out of favor precisely because don't actually use any gradients, and end up converging on good maxima slower than you could if you used the gradients from the net. Am I off base here?

If you mutate genomes by small additive modifications to a vector of continuous parameters, then taking lots of samples and keeping the best is essentially a stochastic approximation to gradient descent. However, unlike the SGD used in deep learning, it doesn't make use of calculus and therefore requires many more samples (exponentially more, in the worst case) to get a gradient of equivalent accuracy. I.e. it's slow.

If your mutations aren't small, or your parameters are not continuously valued, or your fitness function is hard to differentiate analytically, genetic algorithms might still come out ahead.

On the other hand, even if evolutionary algorithms require a lot of samples, they are embarrassingly parallel: you can easily try all samples simultaneously. If you have enough resources to throw at the problem, it can be faster (although more resource-intensive) to estimate the gradient this way than to compute an accurate gradient analytically.

SGD is embarrassingly parallel as well. You can train a net on several different examples simultaneously and combine the gradients or learned weights.

The reason it's not done so much is because the bandwidth of moving huge numbers of gradients or weights between computers is pretty significant. There's been all sorts of research into compressing them or reducing the precision. However this is a problem for evolutionary algorithms as well.

Not exactly. The required bandwidth for the evolutionary strategy is actually very small: if every node knows each other's random seed, they can reconstruct the best model themselves, using the seed of whichever node declares the best result. There is no need to transfer any weights. Of course, this trick only works if it's cheap to compute the weights other nodes are using, which is not the case for SGD. So evolutionary strategies have an advantage here, even if it's not a decisive one.

Well that would work for simple hillclimbing. In a full genetic algorithm, every member of the population is the product of thousands of previous individuals.

I mean, we were talking about simple hill climbing.

Still, even more complex systems still do not require particularly high bandwidth. You only need to broadcast the fitness of the individuals, and from there each node can independently recalculate and combine the best ones.

Is it true that genetic algorithms have the benefit of being able to find the global optima more consistently due to the incorporation of randomness in subsequent generations? Whereas DL models often get stuck at local optima?

DL models don't often get stuck at local optima. In theory, they could be vulnerable to that, but in practice they are not, it simply doesn't happen in most practical supervised learning applications.

I'm not up to date on theoretical research about this topic, but as far as I recall there are some interesting demonstrations on realistic problems showing that all the different "local" optima resulting from different random initializations are actually all "connected", i.e. there exists a nondecreasing route how you can get from a worse "local optimum" to the better one, so in reality it's not a local optimum, it's just that it's nontrivial to find the path to a better optimum if the space is very highdimentional.

The theories I've heard for why you don't get stuck in local optima when using deep learning include:

  1. It's hard to get stuck in multidimensional space
  2. There are more saddles than convex local optima
  3. There are many local optima, but they are all useful
  4. Something related to spin glass theory (which I don't understand)
  5. There is no theory, or we haven't found it yet; all we 
  know is that it works in practice and the theory will have to catch up later

I can't comment on #4 but with regards to 1-2 we can make an argument that high dimensional space is generally friendly towards stochastic gradient descent.

Consider a neural net that produces a single output given `N` inputs– it's basically a function `f(x) = f(x_1, x_2, …, x_N)` A local minimum x* has the gradient `∇f(x) = (0, 0, …, 0)` and an NxN Hessian of the form `[H•f(x)]_{ij} = (∂^2 f)/(∂x_i ∂x_j)`.

The critical point is a local maximum if the Hessian is positive definite at x and a local minimum if it's negative definite; it's a saddle point otherwise. This corresponds to the Hessian having a mixture of positive and negative eigenvalues.

Heuristically, we have N eigenvalues, and absent prior information we can expect that the probability of each eigenvalue being positive is 1/2, and 1/2 for the eigenvalue being negative instead. So the probability of an arbitrary critical point being a local minimum is `(1/2)^N`, which becomes extremely small as N grows large. Large values of N are kinda the bread-and-butter of deep learning, so we expect that most critical points we'll encounter are saddle points. Additionally, the inherent noise in SGD means that we're unlikely to stay trapped at a saddle point, because once you're nudged away from the saddle, the rate at which you slide off it increases rapidly.

So if your net appears to have converged, it's probably at a local optimum with a reasonably deep basin of attraction, assuming you're using the bag of tricks we've accumulated in the last decade (stuff like adding a bit of noise and randomizing the order in which the training data is presented).

As for 3 & 5, we kinda cheat because if your model appears to have converged but is not performing adequately, we step outside of the learning algorithm and modify the structure of the neural net or tune the hyperparameters. I don't know if we'll ever have a general theory that explains why neural nets seem to work so well, because you'd have to characterize both the possible tasks and the possible models, but perhaps we'll gradually chip away at the problem as we gather more empirical data and formalize heuristics based on those findings.

I think it's a combination of (1) and (2): in a high-dimensional space, for a local optimum to be convex it has to be convex in every dimension, the probability of which falls off exponentially in the number of dimensions. So in practice, they're all saddles.

I would not be comfortable making such assumptions on the topologies of fitness functions in high-dimensional spaces. I think it was not so long ago when I read an article here at HN about how weird the high dimensional spaces are, but can't find that quickly.

Further, the mindboggling size of the high-dimensional spaces make me all but guaranteed that not a single non-trivial neural network made by homo sapiens has ever been in global maxima. But I have nothing but my hunch on this.

This is common wisdom I think is false. You absolutely do get stuck in local optima frequently with reinforcement learning. OpenAI has a good example somewhere of a robot trying to put a peg through a hole. Trained with regular gradient descent it just gets stuck putting the peg pretty close to the hole, but not through it.

I'm not even sure that it's not a problem in general. I know I've watched NNs frequently get stuck in local minima even on incredibly simple datasets like xor or spirals. SGD and dropout are widely used in part because they add noise to the gradients that can help break out of local optima. But that's not a perfect method

This is field dependent, it feels more like an attribute of certain types of data rather than certain algorithms.

Reinforcement learning was on my mind when I was writing about "practical supervised learning applications" because yes, RL is different in that regard. And various function calculation examples (starting with XOR) indeed do so.

However, if we're applying neural networks for the (wide and practically important!) class of "pattern recognition" tasks like processing image or language data, then it's different, and those are full fields where you can easily spend a whole career working on just one of these types of data. Perhaps there's a relation with the structure and redundancy inherent in data like this.

It depends, but I don't think general statements like that are necessarily true. Most methods allow trade-offs between converging quickly on local optima vs finding global optima. The mutation aspect of genetic algorithms is just one way to do this. With well tuned mutation parameters, genetic algorithms can definitely be successful at finding global optima, but similar measures exist for most methods.

What do you mean by "find" and how do you define "global optima"?

For "find" you could discuss convergence rates vs. the points that are being converged to. If you add randomness at every iteration you're not even converging, at which point annealing rate becomes a related issue.

For "global optima" are we talking about training error, or test error, or some other kind of function value?

Deep learning has enough randomness. The random weight initialization, the random choice of samples (as in name Stochastic Gradient Descent), the random flip/crop/noise in data augmentation, and randomness injected by dropout.

You are not off base at all, thanks for clarify and sorry for the confusion, I did not mean to say it was using gradient descent. It's been a while. The term I was thinking of was multiple "simulated annealing".

Genetic algorithms essentially run coordinate descent with some tweaks. On the positive side, coordinate descent doesn't need gradients and thus happily optimizes complex models. On the negative side, coordinate descent is roughly N times slower than gradient descent where N is the number of parameters being optimized.

GAs are not "basically just stochastic gradient descent." That's the confusion.

Why did they fall out of favour? Fashion, perhaps a natural break in progress - hit a wall and couldn't get further. But there is plenty of GA research going on.

Sentient employee here. I'll give an example on a problem for which we use evolutionary algorithms: website optimization.

Say you want to try many various changes like the title of your page, the color of the background, the position of your buy button etc. We solve this problem by trying out random variations of these websites - like A/B testing with more candidates - and by crossing the best performing ones to create a new generation of websites. This helps us find good performing variations in a very big search space.

This would be hard to do with deep learning as we start with no data at all, measuring the performance is quite noisy and you can't compute a gradient to know how to evolve your website. There is no smoothness between a title and another.

Also you could try to make a linear model for to see what effect each change has, but that doesn't take into account all the dependencies that can be complex, for example what title goes well with what background color. Evolutionary computation helps implicitly optimize without having to formulate a model.

What is the advantage of using evolutionary algos in this case over using something like Thompson Sampling or Contextual Multi-Armed Bandit? (http://www.kdd.org/kdd2017/papers/view/an-efficient-bandit-a...)

They can actually work together.

Thompson Sampling / Multi-armed bandit algorithms are traffic allocation strategies to use the least amount of traffic to find the optimal variant.

So say you have version A/B/../N variants of a website, you can split the traffic equally or use a a bandit algorithm and find the best performing variant of it.

But that only helps you test N variants. If you want to test for 4 titles and 3 color buttons, 3 layouts and 5 images you already have 4x3x3x5=180 different variations and you can't test them all. Evolutionary algorithms can help you search through a much bigger search space:

First you try 10 variations, and allocate traffic equally or through a bandit algorithm. Then you combine the top performing candidates multiple times to create a new generation and start over again.

Evolution helps you find which candidates to try in a large search space and then a bandits algorithm can help you allocate traffic optimally.

You have to be careful with A/B testing with Bandit algorithms though. If the conversion rate changes over time or if you have visitors that don't always convert instantly you have to take that into account: https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm...

The paper linked above does directly address the case of multiple experiments occurring in the same context. They address this with hill-climbing over those 180 different variations. The use of a bayesian linear regression takes place of the exploration found with Thompson sampling.

You're right, the paper linked above is a different way of solving the same problem. In their case they use a model to decide which website variants to show. Their model accounts for independent effects and pairwise dependencies. Evolution allows you to optimize without needing an explicit model.

I don't think they account for potentially changing conversion rates over time or delayed conversions.

Aside from that, I'd be curious to see how these two approaches compare in a real-life situation.

They can work in conjunction - In the domain of website optimization, where visitor attributes are often greater predictors of value than website content, a system driven by search space optimization can more easily take into account changes in those variables - eg; time of day, traffic source, device type - and incorporate those inputs to climb multiple 'hills' simultaneously.

The allocation of traffic based on the evolving optimal search space (blue button for visitors from Facebook) can then be driven through an MAB or something similar.

I think, not much. This is an optimization problem to which you can apply any algorithm that "fits" --- no closed-form functional form, interactive feedback --- so bayesian optimization (with GPs), some form of RL, evolutionary algorithms et al.

Thanks everybody for your responses. Food for thought :-)

Ok... so... let's say your A/B only optimizes the colour and location of a single CTA button, and conversion rate is your fitness. 2 previous generations have top/blue and bottom/red. How does the next one look?

Each generation has like 10~20 candidate websites. The better the conversion rate, the more likely it is to get chosen as a parent.

With your example, let's say two parents with top/blue and bottom/red are chosen, then their offspring will either be top/red or bottom/blue because we make sure the same website isn't tested twice.

More generally each feature of the offspring will be randomly picked from the features of the two parents.

So two parents A/A/B and B/C/B can give the offspring A/C/B or B/A/B. There is also some random mutations that are possible with a low likelihood.

link to the product: https://www.ascend.ai/

Regarding 3), do you know of any work on genetic programming as a method of doing research into evolvability itself? So basically, as a form of simulation?

Tierra obviously counts, but I was thinking of more specific examples. Say, something like this paper, which showed that adding a tiny cost-function to a network spontaneously makes it more modular:

[0] http://rspb.royalsocietypublishing.org/content/280/1755/2012...

Yes there has been some very interesting recent work. In particular, how evolvability emerges and is harnessed in evolutionary computation. A few papers come to mind:

1. Evolvability is Inevitable: http://journals.plos.org/plosone/article?id=10.1371/journal....

2. Extinction Events can Accelerate Evolution (2015): http://journals.plos.org/plosone/article?id=10.1371/journal....

3. Evolvability Search: Directly selecting for evolvability in order to study and produce it (2016): http://www.evolvingai.org/mengistu-lehman-clune-2016-evolvab...

Evolvability can also evolve away too. For instance, a gene that decreases the mutation rate to 0. Most mutations are harmful, so any organism with the gene will be more likely to have successful children. And eventually the gene will become dominant and there will be no more mutations. And evolution will stop.

What you say is fairly hypothetical, since the way genetics works at a fundamental level prevents this possibility.

You are also oversimplifying by not distinguishing local and global success. Locally, this gene will be successful. Globally, some mutations will be beneficial, and the children with those mutations will be more successful.

The gene may become "dominant" (and then gain the beneficial mutations through sexual reproduction), but it will not achieve a monopoly and not remove the existence of evolution.

It is somewhat comparable to why some people are left-handed[0].

However, what you allude to is basically the necessity of evolution of forms of "error correction" against mutations for complicated organisms. Sex plays a large part in that as well[1]. And interestingly, DNA repair, another involved mechanism, may actually help evolvability[2].

[0] https://www.youtube.com/watch?v=TGLYcYCm2FM

[1] https://www.quantamagazine.org/missing-mutations-suggest-a-r...

[2] https://www.quantamagazine.org/beating-the-odds-for-lucky-mu...

It's not hypothetical. We haven't observed it in nature because any organism that evolved this would simply go extinct. But similar "evolving to extinction" phenomenons have been observed. Such as genes in mice that make the entire population male, or transposons in plants: http://lesswrong.com/lw/l5/evolving_to_extinction/

>it will not achieve a monopoly and not remove the existence of evolution.

Yes it will. It's easy to do simulations. Statistically the children of the organism with less mutations will have an advantage. Eventually the gene will reach 100% of the population. The population will stop evolving and eventually go extinct.

To be clear on this: I agree with almost all of your reasoning (in this and the previous post), but this is not exactly the most convincing way to open your counter-argument:

> It's not hypothetical.

> We haven't observed it in nature

Also, the fact that in the long term such a mutation would cause a population to go extinct eventually is not really evidence that we should not find this in the wild, because it would still be effective in the short term. Look at that recent story about the mutated lobster taking over the waters of Germany for an example.

Anyway, the crucial disagreement lies here:

> It's easy to do simulations.

Yes, if a gene would evolve that brings the mutation rate to zero, this simulation works. But achieving such a thing sounds a lot like beating the laws of thermodynamics and stopping entropy from increasing. And sure, reducing entropy locally is possible by globally increasing it - in this case that would mean increasing the energy/resource budget spent on it, but I suspect that this in itself will be at a cost too high to give an advantage.

I basically don't believe (and I fully admit that this is a subjective point of view) that the "easy simulations" are noisy, large or complex enough to reflect the messy biological reality here. Essentially, there's too much approximation going on.

And btw, evolving to extinction is not the same thing: a gene that makes the entire population male is not in itself advantageous like reducing mutations is. It is more generic than the "mutation rate zero"-gene scenario, the latter is basically a specific hypothetical example of it.

Yep. The original wave of genetic algorithms largely depended on some hand-wavy "building block" ideas that no one could really prove. It turned out that it was because proving them is impossible in the general sense, as we found out from the NFL theorems in the mid-to-late 90s, and it wasn't even clear the field had a scientific basis at all.

So I was surprised to see them make a return about a decade later. Hopefully there is a little more rigor this time around.

> as we found out from the NFL theorems in the mid-to-late 90s, and it wasn't even clear the field had a scientific basis at all.

NFL theorems are, should I say, purely theoretical and provide no insight on real-world problems. Say we try to find a function that is an optimal solution to something. NFL theorems consider the space of all possible functions, the overwhelming majority of which are discontinuous. Whereas real life problems tend to have functions that are at least more or less continuous.

Not just discontinuous but have high Kolmogorov complexity (effectively meaning that the value of the objective function is random and has no real relation to the input arguments) so not a surprise that you can't do better than random!

Honestly, there's no justification to be using NFL theorems to explain why we can't optimize well on real world tasks.

Edit: And such high Kolmogorov complexity function constitute most possible objective functions -- i.e. exponentially more than the number of objective functions that don't have high Kolmogorov complexity. And all real world objective functions have comparatively low Kolmogorov complexity.

Good notion, pointing the Kolmogorov complexity.

Yeah. You have a function, so basically a long array of numbers, and you want to find the maximum. If the data in the array has some structure, like it's sampled from a sine wave or something, you can use some strategies to find the maximum. Like gradient descent, or binary search. Something.

But if the array is filled with random numbers, looking at other arrays elements give absolutely no hint on what might be in an array element you haven't yet looked at. So there doesn't exist any more efficient strategies to find the maximum number, than linear or random search.

And the space of all possible functions mostly consists of discontinuous functions that are, for all purposes, just samples of random noise.

This is all NFL theorems say. I really don't understand how they got be such a big deal.

>So there doesn't exist any more efficient strategies to find the maximum number, than linear or random search.

For classic old school computers yes. I'm not so sure about quantum computers. Consider: https://en.wikipedia.org/wiki/Grover%27s_algorithm

I don't think Grover ("the other GA") helps us here. https://qbnets.wordpress.com/2010/01/06/grovers-algorithm-fo...

I hear this sentiment a lot lately, and I do not agree with it at all, you're basically just hand-waving the issue away and retroactively deciding what is a real world, practical problem or not after we've already found a solution to those problems. The NFL theorems are not inherently worthless, you are making them worthless because you are throwing away nearly all of the possible problem space, under the assumption that space is not relevant, but that is just a guess, you don't actually know.

I mean, yeah, sure, not knowing if we've hit the global minimum on some optimization problem may not matter as humans, because no one gives a damn, but that is a completely arbitrary line, not a mathematical one. Too often I hear people say "well, NFL is irrelevant" as an argument for why they are mathematically correct, or why their results are mathematically significant, and that's simply a load of crap. Maybe you're right, maybe you're wrong, but you're basically just throwing darts at a dartboard.

> make a return about a decade later.

Maybe that is how it looked from the outside. But the field continued as normal with no major problems, solving lots of industry problems with little hype.

> no one could really explain why exactly they worked

Genetic algorithms work on problems where some subsets of variables are approximately separable (uncorrelated) from some other subsets of variables.

F(a,b,c,d,e,f) ≈ F1(a,b,c) × F2(d,e,f)

So if you have found a good combination or a,b,c it makes sense to try how it works it any promising combinations of d,e,f. Some natural world problems really have this property.

My thesis was on Generic Algorithm. I stopped and started working on Deep Learning mainly because like you said, GAs don't really have a strong mathematical foundation. Ironically, no one could really explain why CNNs work mathematically either. I've heard a lot of hand-wavy arguments about local search, local sensitivity, etc. However, no one could really prove anything meaningful. There are some papers around certain types of architecture is invariant under certain types of affine transformations. But all of them sounds like trying to convince ourselves rather than putting a firm mathematical framework to guide our research. Maybe that's why natural inspired algorithms are getting attention, the community is throwing stuff on the wall to see what sticks. It's funny to me because Genetic Algorithms were once frowned upon by majority of the community. I guess the moral lesson is stop chasing what's trendy.

Why should we expect there to be any mathematical foundation to this stuff? It's quite possible to imagine an alternate universe where GAs and neural nets don't work. Because they have different datasets that don't fit the structure of NNs well. Or problems that happen to not be solvable by the search strategies of GAs.

In fact we have many such problems in our own universe. I can give many examples of things NNs and GAs don't work well on right now. Those are just ignored by the research.

> Why should we expect there to be any mathematical foundation to this stuff?

i would be surprised that "this stuff" would be exception to the unreasonable effectiveness of mathematics. mathematics underpins virtually every observed phenomenon, including theoretical physics, computer science, economics. in fact, the mathematical structure of any physical theory often points the way to further advances in that theory and even to empirical predictions.

to not expect that "this stuff" should not have any mathematical foundation is a fantastically naive view.

No, there is no mathematical foundation for most of that stuff. No mathematician can prove that gravity exists. You can describe the laws of physics with mathematical expressions. Just as you can describe machine learning algorithms with mathematical expressions. But you can't prove the laws of physics are true with math. And I don't expect anyone to ever prove that Machine Learning should work.

why does the existence of such problems disprove the existence of a mathematical foundation? A well-founded mathematical foundation would prove/predict/explain why such problems don't "fit" with the "structure of NNs" with precise lower/upper bounds. Anything that works, and especially everything that doesn't work, must have an explanation. God doesn't play dice.

Maybe there's a reason. But I don't think you will ever be able to "prove" it. In any kind of formal way that would be satisfying to mathematicians.

Solomonoff induction is the best attempt to try to formalize machine learning. And in theory any machine learning algorithm that works, works because it approximates Solomonoff induction somehow. But proving anything approximates Solomonoff induction is absurdly difficult or impossible. Because it's incomputable and involves the search space of all possible computer programs.

I'll admit that finding the right fitness function is hard, but I find that multi-dimensional fitness functions are well-suited to finding a set of solutions that a human can choose from. If you are finding paths for a fleet of delivery trucks, you can optimize separately for time, distance, and cost. Then, with the solutions that are better at all three than every other solution, pick 10 different ones and let a human make a decision.

I agree that when there's a fast, perfect solution, it doesn't make sense to use genetic algorithms. But when finding solutions to a non-general problem (optimize CNC tooling to produce a list of orders, each of which has a series of operations that require a certain amount of time, on certain machines, and require being moved from machine to machine, such that you produce the most on-time orders for high-priority clients), genetic algorithms can work very well.

Another reason is for most demos and examples of Genetic Algorithms, that Simulated Annealing is almost always more optimal.

Let me first say that I'm not certain this is the case for the purposes lined out in the article, because I'm not entirely sure what they're trying to do.

Unless you have a crossover operator that really makes sense for your fitness function and problem, a GA is basically nothing but a bunch of SA processes running in parallel. In that case you would prefer SA, because it has less hyperparameters to tune, convergence is better defined, and there are proven methods to tune the hyperparameters.

It's not that hard if you have a working GA optimizer, to test how well it does with SA as a baseline, because the algorithms are fairly similar. Unfortunately not many demos do this baseline comparison and I'm pretty sure most of them won't do significantly better with GAs.

I think you've confused genetic algorithms with genetic programming. They're not the same thing.

> I don't know how Genetic Algorithms/programming could be made 'explainable' as to why they achieved an optimal solution other than hand-waving to how evolution works in nature.

This is quite confused. You're comparing different levels of the systems. In neural networks, we would like to know why a numerical model (which has been optimised by gradient descent) gives the outputs it gives. In GAs, (1) the objects being created usually aren't numerical models -- think instead of solutions to TSPs; (2) the reason the object is good is rather easy to see -- one just has to look at the objective function and verify that that the object has the desired properties; (3) we don't really care what other objects were considered during the search process.

> (2) the reason the object is good is rather easy to see -- one just has to look at the objective function and verify that that the object has the desired properties;

Minor point: whether the result of evolution is easy to understand or not depends on the representation (encoding). Even GA results can be difficult to understand if they describe complex objects. In the case of GP, bloat (rapid increase in average program size in your population) can make results very difficult to interpret.

Regarding 1, that seems surprising: biology has had many mathematical proofs and models (for almost a century now, e.g. Fisher) explaining why evolution works. Evolution works in nature for non-hand-wavey reasons, doesn't the same logic justify artificial evolution in genetic algorithms?

Regarding your points:

1) "explainable AI" is better in GA than in deep learning. GA gives you a structure that works and probably easier to understand than any Deep Learning model (which it is a huge math function). There are so many things that also doesn't make sense why they work in deep learning but we still use them, that is the same with GA.

2) Knowing the fitness function doesn't mean you can solve the problem. When the search space is so big you need something to search on it, and there is where GA can shine. It is also the same with Deep Learning and mostly evolutionary computation. The search space is so huge for a "brute force algorithm". You need heuristics and GA works well for some problems, the same way gradient descent works well for others too.

> I don't know how Genetic Algorithms/programming could be made 'explainable' as to why they achieved an optimal solution other than hand-waving to how evolution works in nature.

With either GA or ANN, you don't know that the algorithm has achieved the optimal solution and I would be surprised if you could ever prove that in the general case.

What you can know is that your algorithm performs well, e.g. by testing how well it plays go or drives a car or recognises faces, or whatever.

Re 3, last resort: isn't it also true that any other method can likely be improved upon by taking whatever network results and using it as a starting point for genetic improvements? Sometimes the extra expense won't be worthwhile, of course, but if you want to reduce the number of edge cases for autonomous driving, say, it might well be worthwhile IMHO.

(There's an earlier discussion about the myth of local optima here that might or might not answer my question.)

> I felt that at the point you are understanding the problem, you may just be better off with a direct approach

I formed a similar impression in my PhD research.

Not that relevant, but wanted to elaborate a little: I exhaustively solved a toy version of my problem on a cluster, but found no discernable gradients. Good solutions were isolated spikes, with no elevations adjacent that could lead to them. So, random sampling would be as good as you'd get.

The thing to do is change the problem, transform the space/dimensions, so better solutions were spatially proximate. But then, I'd be solving the problem.

Another approach is to have the computer do this, seach the space of search spaces. But this higher-level space is even less likely to have informative gradients.

OTOH, the data was in terms of a language, which would have introduced its own artefacts. A better search space would compensate for those, and might have been easy to find.

> no one could really explain why exactly they worked

That is the first time I have heard that claim, and since we have a large body of knowledge describing how evolution works (that sampo description is one the clearest I've seen) and how it can be optimized, I imagine you are talking about some other problem.

Is it about predicting the causes of some learned trait? Is there some interesting research on that?

They are referring to Genetic Algorithms. There was a theory called the "building block hypothesis" but no one could prove it (turned out it was impossible to prove). The field was sort of run on hand waving for several decades.

The fact that it is impossible to prove is what is new to me.

Even more because it's clearly not a property of genetic algorithms in general, but a very powerful effect that one aims into achieving with genetic algorithms and a good domain modeling. I don't really understand what is the meaning of something like that being impossible to prove.

Clearly lack of a sound theoretical basis or proof for why deep learning works has not stopped its proliferation. For a practitioner, the proof is in the pudding: generalized results, novel solutions that provably work, new designs that fulfill the given objective(s). At the end of the day, those are what really matter for practical applications.

Crossing two genotypes and making mutations so that they produce consistent encoding was for me what put genetic algorithms to purely academic area - for things like TSP I either couldn't come up with good crossing function or it was producing mostly the same results after applying some "consistentification".

If you like GP, you might like PGE more (self bias)


In your experience, are GA/EA suitable for combinatorical problems?

My domain is basically a combinatorical problem on large, sparse graphs. I am currently focused on DL with lots of custom feature engineering. I am making progress but its slow.

I don't understand why A.I. can't be explainable. Can't they just add logging every time it makes a decision and then trace through the trail of decisions to the final result?

The AI that is not explainable is not because it cannot log things, its because the semantic interpretation of what it can log is hard. Starting with the real world input (which we understand) a lot of algorithms progressively apply mathematical transformations till reaching the output. It is the real world "meanings"of these transformations, or what is eventually learned: the stack of these transformations - that is hard to grasp.

If you are willing to accept that as an explanation. Yes, it's easily explainable. A big list of anonymous decisions is easy. Naming those decisions is hard.

I pretty much agree with what you said, but DL (and even more deep reinforcement learning) is really computationally expensive

Where's the posts on explainable AI? I did a ctrl+f but didn't see any.

Isn't that the same question as to why the whole universe exists? Why chaos results in order? Isn't there a proof that 'given such random process, an universe like we have is inevitably created'?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact