Hacker News new | past | comments | ask | show | jobs | submit login

It would also be nice to remove the "magically thinking" around machine learning. It's mathematically related to all prior signal processing techniques (mostly a proper superset) but it also have fundamental limits that no one talks about seriously. ML et al. are NOT MAGIC but they are treated as if they were.

And that is in itself a dangerous moral and ethical lapse.




> It would also be nice to remove the "magically thinking" around machine learning.

To be honest it would be a morally and ethical less dangerous world if we could get our feet back on the ground in relation to digital technologies in general.

> fundamental limits that no one talks about seriously.

I am starting to touch and stumble into the invisible cultural walls that I think make people "afraid" to talk about limitations. I am not yet done analysing that, but suspect it has something to do with the maxim that people are reluctant to question things on which their salary depends. That seems to be a difference between "scientists" and "hackers" in some way.

Going back to Hal Abelson's philosophy, "magic" is a legitimate mechanism in coding, because we suppose that something is possible, and by an inductive/deductive interplay (abduction) we create the conditions for the magic to be true.

The danger comes when that "trick" (which is really one of Faith) is mixed with ignorance and monomaniacal fervour, and so inflated to a general philosophy about technology.


> suspect it has something to do with the maxim that people are reluctant to question things on which their salary depends.

I once worked on a team that spent a lot of time building models to optimize parts of the app for user behavior (trying to intentionally remain vague for anonymity reasons). Through an easy experiment I ran I ended up (accidentally) demonstrating that the majority of DS work was not adding more than minimal improvements, and so little monetary value and it did not justify any of the time spend on this.

I was let go not long after this, despite having help lead the team to record revenues by using a simple model (which ultimately was what proved the futility of much of the work the team did).

Just a word of caution as you

> start to touch and stumble into the invisible cultural walls that I think make people "afraid" to talk about limitations


It's long been my suspicion that much of tech is just throwing more and more effort into ever diminishing returns and I think a lot of us at least feel that too, but the pay is good and you don't have to dig ditches, so what are you going to do?


Good story. I guess you had done with your work there. Sometimes teams/places have a way of naturally helping us move to the next stage.

Competences work at multiple levels, visible and invisible. Being good at your job. Showing you're good at your job. Believing in your job. Getting other people to believe in your job. Getting other people to believe that you believe in your job... and so on ad absurdum. Once one part of that slips the whole game can unravel fast.


Honestly at this point it kind of is magic. These things are knocking out astonishing novel tasks every month, but the state of our knowledge is "why does sgd even work lol". There is no coherent theory.


> "why does sgd even work lol"

I find this hand a little over played.

It depends on the degree of fidelity we demand of the answer and how deep we want to go questioning the layers of answers. However, if one is happy with a LOL CATS fidelity, which suffices in many cases, we do have a good enough understanding of SGD -- change the parameters slightly in the direction that makes the system work a little bit better, rinse and repeat.

No one would be astonished that using such a system leads to better parameter settings than ones starting point, or at least not significantly worse.

Its only when we ask more questions, ask deeper questions that we get to "we do not understand why SGD works so astonishingly well"


Yeah I didn't mean to imply "Why does SGD result in lower training loss than the initial weights" is an open question. But I don't think even lolcatz would call that a sufficient explanation. After all if the only criterion is "improves on initial training loss" you could just try random weights and pick the best one. The non-convexity makes sgd already pretty mysterious, and that is without even getting into the generalization performance, which seems to imply that somehow sgd is implicitly regularizing.


With over-parameterized neural networks, the problem essentially becomes convex and even linear [1], and in many contexts provably converges to a global minimum [2], [3].

The question then becomes: why does this generalize [4], given that the classical theory of Vapnik and others [5] becomes vacuous, no longer guaranteeing lack of over-fitting?

This is less well understood, although there is recent theoretical work here too.

[1] Lee et al (2019). Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. https://proceedings.neurips.cc/paper/2019/hash/0d1a9651497a3...

[2] Allen-Zhu et al (2019). A convergence theory for deep learning via over-parameterization. https://proceedings.mlr.press/v97/allen-zhu19a.html

[3] Du et al (2019). Gradient Descent Finds Global Minima of Deep Neural Networks. http://proceedings.mlr.press/v97/du19c.html

[4] Zhang et al (2016). Understanding deep learning requires rethinking generalization.

[5] Vapnik (1999). The nature of statistical learning theory. https://arxiv.org/abs/1611.03530


I dont disagree, except perhaps the lolcatz's demand for rigour. Improve with small and simple steps till you cant is not a bad idea after all.

BTW your randomized algorithm with a minor tweak is surprisingly (unbelievably) effective -- randomize the weights of the hidden layers, do a gradient descent on just the final layer. Note the loss is even convex in the last layer weights if matching/canonical activation function is used. In fact you dont even have to try different random choices, but of course that would help. The random kitchen sink line of results are a more recent heir to this line of work.

I suspect that you already know this and the fact that the noise in SGD does indeed regularize and the way it does so for convex function has been well understood since the 70s, so I am leaving this tidbit for others who are new to this area.


Why are there so few local minima, you mean?

I think it’d have to be related to the huge number of dimensions it works on. But I have no idea how I’d even begin to prove that.


Its not even certain that they are few. Whats rather unsettling is that with these local moves of SGD the parameters settle on a good enough local minima in spite of the fact that we know that many local minima exists that have zero or near zero training loss. There are glimmers or insight here and there but the thing is yet to be fully understood


Honestly at this point it kind of is magic.

How much of that magic is smoke and mirrors? For example, the First Tech Challenge (from FIRST Robotics) used Tensor Flow to train a library to detect the difference between a white sphere vs a golden cube using a mobile phone's on-board camera.

The first time I saw it, it did seem pretty magical. Then in testing realized it was basically a glorified color sensor.

I think these things make for great and astonishing demos but don't hold up to their promise. Happy to hear real-world examples that I can look into though.


Even if it were practically useless (which it is not, although the practical applications are less impressive than the research achievements at this point), it would be magical. Deep learning has dominated imagenet for a decade now, for example. One reason this is magical is because the sota models are extremely over parametrized. There exist weights that perform perfectly on the training data but give random answers on the test data [0]. But in practice these degenerate weights are not found during sgd. What's going on there? As far as I know there is no satisfying explanation.

[0] https://arxiv.org/abs/1611.03530


If you look at these “degenerate” parameterizations, they’re clearly islands in the sea of weight parameter space. It’s clear that what you’re searching for is not a “minimum” per say but an amorphous fuzzy blobby manifold. Think of it like sculpting a specific 3D shape out of clay. Sure there are exact moves to sculpt the shape, but if you’re just gently forming the clay you can get very close to the final form but still have some rough edges.

As for a formal analysis, I just can’t imagine there existing a formal analysis of ML that can describe the distinctly qualitative aspects of it. It’s like coming up with physics equations to explain art.


I mentored an FTC team that was using the vision system this year, and my overall impression was that the TensorFlow model was absolute garbage and probably performed worse than a simple "identify blobs by color" algorithm would have.

The vision model was tolerably decent at tracking incremental updates to object positioning, but for some reason would take 2+ seconds to notice that a valid object was now in view (which is quite a lot, in the context of a 30s autonomous period), and frequently identified the back walls of the game field as giant cubes.


there's a big difference between a glorified color sensor and a well trained deep learning library (I can say this with authority because I hired an intern at Google to help build one of those detectors). It's still not magic, but a well-trained network is robust and generalizable in a way that a color sensor cannot be.


It depends on the angle that people approach solving the problem. In my current field in cancer biology / drug response, people don't often know the features well enough comparing to normal everyday features such as natural images or natural text. In that setting the understanding of the feature space / biological systems is more important than understanding of the models themselves. The models are (if I may say, merely) a tool to search and narrow down the factors. After that scientists can design experiments to further interrogate the complex system itself. As the ML model grows bigger, the interrogative space also grow. Depending on the goal, it may not be necessary to have a fully interpretable model as long as the features themselves help advancing the understanding of the complex biology system.


No neural networks aee stagnant on most key NLP tasks. While there has been some advances in cool tasks, the needed tasks for NLU are potently wintered.


I think this is a common problem and comes because we stressed how these models are not interpretable. It is kinda like talking about Schrödinger's cat. With a game of telephone people think the cat is both alive and dead and not that our models can't predict definite outcomes, only probabilities. Similarly with ML people do not understand that "not interpretable" doesn't mean we can't know anything about the model's decision making, but that we can't know everything that the model is choosing to do. Worse though, I think a lot of ML folks themselves don't know a lot of stats and signal processing. They just aren't things that aren't taught in undergrad and frequently not in grad.


Along with that it becomes remarkably more difficult to distinguish causation vs correlation although I’m sure that point is heavily debated


> difficult to distinguish causation vs correlation

I mean this is an extremely difficult thing to disentangle in the first place. It is very common for people in one breath to recite that correlation does not equate to causation and then in the next breath propose causation. Cliches are cliches because people keep making the error. People really need to understand that developing causal graphs is really difficult, and that there's almost always more than one causal factor (a big sticking point for politics and the politicization of science, to me, is that people think there are one and only one causal factor).

Developing causal models is fucking hard. But there is work in that area in ML. It just isn't as "sexy" because they aren't as good. The barrier to entry is A LOT higher than other type of learning, so this prevents a lot of people from pursuing this area. But still, it is an necessary condition if we're ever going to develop AGI. It's probably better to judge how close we are to AGI with causal learning than it is for something like Dall-E. But most people aren't aware of this because they aren't in the weeds.

I should also mention that causal learning doesn't necessitate that we can understand the causal relationships within our model, just the data. So our model wouldn't be interpretable although it could interpret the data and form causal DAGs.


>It is kinda like talking about Schrödinger's cat. With a game of telephone people think the cat is both alive and dead and not that our models can't predict definite outcomes, only probabilities.

That is literally the point of the thought experiment: https://en.wikipedia.org/wiki/Wave_function_collapse

It isn't just our models that can't explain it, there are real physical limits which mean that _no_ model can predict what state the cat is in.

The only reason why cats are a more outrageous example than electrons is that we see cats behave classically all the time.

The only vaguely plausible explanation why cat states are impossible in general is that large quantum system become spontaneously self decoherent at large enough numbers of particles.


Yes, _no model can_ is an important part. But what I was focusing on is that the cat is either alive or dead in the box and not both. Just because we can't tell doesn't mean that's not true. Particles are observers and the wave function is collapsed from the perspective of the cat, but not from our perspective where we can't measure. But people misunderstand this as the cat behaving in a quantum state, which isn't happening. People have also assumed "observer" means "human", when particles themselves are observers. Which is why the cat is actually in a classical state (either alive or dead, not both), because within that box the particles are interacting. The confusion comes because the analogy being misinterpreted (the analogy assumes a lot of things that can't actually happen because it is... an analogy and to understand it you really need to have a lot of other base knowledge to understand the assumptions being made).


> _no_ model can predict what state the cat is in.

Perhaps you meant to say "...state the cat will be in when observed"?

Otherwise, an important nitpick applies: superposition means that the system is not in any single state, so there's nothing to "predict" - it's a superposition of all possible states.

Prediction comes in when one asks what state will be observed when a measurement is made. As far as we know, that can only be answered probabilistically. So no model can specifically predict the outcome of a measurement, when multiple outcomes are possible.


I mean even without being observed by humans, the cat is in either the alive or dead state. It can't be in both. It is just that our mathematical models can't tell us with complete certainty which state that is. (People also seem to think humans are the only observers. Particles are observers too)


I think what is not mentioned nearly enough is the need for isolation to prevent decoherence of the cat. You need to make sure the box is it's own universe totally disconnected from the rest. Then I'd say it is a little more intuitive that parallel insides of the box might exist in superposition.


But it also can't be a real cat. Because if it was a real cat then the cat itself collapses the the wave function. Literally any particle interaction does. Really what is important here is that us being on the outside and in a different reference frame (we're assuming we can't do any measurements of things inside the box. Think information barrier) we can't obtain any definite prediction of the cat's state, only probabilistic. The information barrier is the important part here.


Maybe this is just my soft, theory-laiden pure math brain talking, but I'd be a lot less impressed with machine learning if we had a decent formal understanding of them. As is they're way weirder than I think most engineering types give them credit for. But then again, that's how I feel about a lot of applied stuff, it all feels a little magic. I can read the papers, I can mess around with it, but somehow it's still surprising how well it can work.


Ultimately it comes down to gradient-based descent (which is pretty magical in its own right), but what's most surprising to me is that the loss landscape is actually organized enough to yield impressive results. Obviously the difficulties of training large NNs are well-documented, but I'm surprised it's even that easy


People will (and I'm sure do) use this magical thinking politically, persuading people to trust the computer and therefore, unwittingly, trust the persons who control the computer. That, to me, is the greatest threat - it is an obvious way to grab power, and most people I know don't even question it. It's a major consequence of mass public surveillance.


Bureaucracies would love for a blackbox to delegate all of their decisions and responsibilities to in an effort to shift liability away from themselves.

You can't be liable for anything, you were just doing what the computer told you to do, and computers aren't fallible like people are.


NNs are just glorified logistic regression. People should simply understand that neural networks cannot emulate a dumb calculator accurately, this simple fact is enough to realize being an universal approximator is in practice a fallacy, and true Causal NLU or AGI is essentially out of reach of neural networks, by design. Only a brain fidel architecture would have hope however C.elegans retro engineering is underfunded and spiking neural networks are untrainable.


> NNs are just glorified logistic regression.

2015 called, they want you back! Now seriously, "just" does an amazing amount of work for you. How do you "just" make logistic regression write articles on politics, convert queries in SQL statements? or draw a daikon radish in a tutu?

Humans are "just" chemistry and electricity, and the whole universe just a few types of forces and particles. But that doesn't explain our complexity at all.


Neural networks do achieve impressive things but they also fail to achieve essential things that preclude them from an AGI or Causal NLU ambition, such as the inability to approximate a dumb calculator without significant accuracy loss.


It's a model mismatch, not an inherent impossibility. A calculator needs to have an adaptive number of intermediate steps. Usually our models have fixed depth, but in auto-regressive modelling the tape can become longer as needed by the stepwise algorithm. Recent models show LMs can do arithmetic, symbolic math and common sense chain-of-thought step by step reasoning and reach much higher accuracies.

In other words, we too can't do three digit multiplication in our heads reliably, but can do it much better on paper, step by step. The problem you were mentioning is caused by the bad approach - LMs need intermediate reasoning steps to get from problem to solution, like us. We just need to ask them to produce the whole reasoning chain.

- Chain of Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903

- Deep Learning for Symbolic Mathematics https://arxiv.org/abs/1912.01412


I can’t approximate a dumb calculator without significant accuracy loss. Not without emulating symbolic computation, which current AI is perfectly capable of doing if you ask it the right way.

Whatever makes you think it’s necessary for AGI, when we don’t have it?


NNs fails to do any algorithmy like pathfinding, sorting, etc The point is not that you have it it's that you can have it by learning and using a pen and paper. Natural language understanding require both neural network like pattern recognition abilities and advanced algorithmic calculations. Since neural networks are pathetically bad at algorithmy, we need neuro-symbolic software. However the symbolic part is rigid and program synthesis is exponential. Therefore the brain is the only technology on earth to be able to dynamically code algorithmic solutions. Neural networks have only solved a subset of the class of automated programs.


There are about 3,610 results for "neural network pathfinding" in Google Scholar since 2021. Try a search.


And as you can trivially see it is outputting nonsense values https://www.lovebirb.com/Projects/ANN-Pathfinder?pgid=kqe249... (see last slide) At least in this implementation

Even if it had 80% accuracy (optimistic) it would still he too mediocre to be used at any serious scale.


Using gradient based techniques does a LOT to force neural network weights to resemble surfaces that they do not at all look like when using global optimization and gradient free techniques to optimize them.

Most of the stupid crap that people give about degenerate cases where deep learning doesn't work (cartpoll in reinforcement learning, sine/infinite unbounded functions) are showcasing how bad gradient based training is - not how bad deep learning is at solving these problems. I can within seconds solve cartpoll with neural networks using neuroevolution of weights....


Do you mean that a network trained to imitate a calculator won’t do so accurately or that there is no combination of weights which would produce the behaviors of a calculator?

Because, with RELU activation, I’m fairly confident that the latter, at least, is possible.

(Where inputs are given using digits (where each digit could be represented with one floating point input), and the output is also represented with digits)

Like, you can implement a lookup table with neural net architecture. That’s not an issue.

And composing a lookup table with itself a number of times lets one do addition, etc.

... ok, I suppose for multiplication you would have to like, use more working space than what would effectively be a convolution, and one might complain that this extra structure of the network is “what is really doing the work”, but, I don’t think it is more complicated than the existing NN architectures?


I am talking about training a neural network to achieve calculations. And yes look-up tables might be fit for addition but not for multiplication. The accuracy would be <90% which is a joke for any serious use.


Well, the main issue I see is where to put the n^2 memory (where n is the number of digits) when doing multiplication. (Or, doesn’t need n^2 space, could do it in less, but might need to put more structure into the architecture?)

If the weights are designed, and the network architecture allows something to hold the information needed, then there is really no obstacle to having it get multiplication entirely (not just 90%).

Now, would that be learnable? I’m not so sure, at least with the architecture one would use if designing the weights.

But,

I see no reason a transformer model couldn’t be trained on multiplication-with-work-shown and produce text fitting all of those patterns, and successfully perform multiplication for many digits that way.

And, by “showing all work” I don’t necessarily mean “in a way a person would typically show their work”, but in a easier-for-machine way.


>cannot emulate a dumb calculator accurately

Neither can people, for the most part.


knock knock. some critic from the 70s arrived. hows gofai going?


Oh yes it's not GOFAI that has won the ARC challenge it's neural networks, right? right? https://www.kaggle.com/c/abstraction-and-reasoning-challenge

I have more expertise in deep learning than anyone else here and the delusions of the incoming transformer winter will be painful to watch. In the meantime, enjoy your echo chamber.


Are you Schmidhuber's alt?


> I have more expertise in deep learning than anyone else here

I... I guess it's possible?


> I have more expertise in deep learning than anyone else here

No, you don't. Looking at your experience, there is simply no way that you are the foremost expert in DL on HN.


Haha what do you know of my experience? Here is a glimpse of my unique pedigree https://www.metaculus.com/notebooks/10677/substance-is-all-y...


I have no idea what you think that proves.

What I know of your experience shows a low number of years of experience, a lack of papers, and a lack of true hands-on experience at the small number of companies in the world that have the resources to truly investigate large models. How can you know so much about LLMs without ever having the resource to train one?

I'm obviously not going to dox you, so you can easily just dismiss what I'm saying. But even just reading through your HN comments shows arrogance in your own knowledge (across multiple domains).

A specifically memorable quote is:

> I frequently create unique on the internet [words]

This is very true. Your erudition is apparently only matched by the uniqueness of the words you use when on the internet.


> the delusions of the incoming transformer winter will be painful to watch

Meaning?


Meaning that HN in ten years will mock current HN


> ML et al. are NOT MAGIC but they are treated as if they were.

They're not magic - nothing is, but what are they?

> but it also have fundamental limits that no one talks about seriously

What are these fundamental limits? 20 years ago I imagine skeptics in your camp would have set these "fundamental" limits at lower than DALL-E 2, GPT-3, AlphaStar etc. Or are you talking about limits today? In which case, sure, but I think "fundamental" is the wrong word to use there given they change continuously.

> It's mathematically related to all prior signal processing techniques (mostly a proper superset)

And human brains are what if not signal processing machines?


> They're not magic - nothing is, but what are they?

Emergent magic.


When you have a complex system that produces nth-order effects, then the only approach is to treat it as empirical phenomena (aka black box magic), and that is what most research papers in this field do.


In the 80s and 90s it was really common to anthropomorphize spaghetti code.

Just because something is difficult to analyze doesn't mean it has limitless power.


In my wishful thinking, by far the best way to do that would be for the courts to stick companies with full legal liability for the shortcomings of their "machine learning" systems. And if it's fairly easily demonstrate that GiantCo's ML decision making system is a sexist, racist, ageist...then GiantCo is not just guilty, but also presumed to have known in advance that they were systematically and deliberately on the wrong side of the law.


> And that is in itself a dangerous moral and ethical lapse.

Agreed. Is it an original lapse, or derivative though? When researchers/engineers oversell their story to get the funding they wouldn't otherwise get, where is the collapse? With the engineers/researchers? Or with the forces that built a system where that was the only way forward for them?

When a hungry thief steals to eat, is the thief morally bankrupt? Or is those that engineered the shortage?


Is it the ones who engineered the shortage, or the ones who engineered the system in which the ones who engineered the shortage operated when they were designing the other system?


Supplant "magic" with "not understood"

Suddenly it all becomes a lot more palatable that many don't know how it works.


> And that is in itself a dangerous moral and ethical lapse.

More importantly, it can be a dangerous business lapse.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: