Hacker News new | past | comments | ask | show | jobs | submit login
Frank Rosenblatt's perceptron paved the way for AI 60 years too soon (2019) (cornell.edu)
184 points by ilamont on Feb 6, 2022 | hide | past | favorite | 56 comments



The answers here may be of interest: https://ai.stackexchange.com/questions/1288/did-minsky-and-p...

The unfortunate thing was not the Perceptrons book but the fact that Rosenblatt died prematurely soon after. He was very well-equipped to defend and carry on work on NNs, I think.



I was reading through the references in https://blogs.umass.edu/comphon/2017/06/15/did-frank-rosenbl... linked in another comment, specifically https://www.gwern.net/docs/ai/1993-olazaran.pdf https://www.gwern.net/docs/ai/1996-olazaran.pdf by a Mikel Olazaran who spent a while interviewing what looks to be almost all the surviving connectionists & Minsky etc.

Olazaran argues that all the connectionists were perfectly aware of the _Perceptrons_ headline conclusion about single-layer perceptrons being hopelessly linear, which drafts had been circulating for like 4 years beforehand as well, and most regarded it as unimportant (pointing out that humans can't solve the parity of a grid of dots either without painfully counting them out one by one) and having an obvious solution (multiple layers) that they all, Rosenblatt especially, had put a lot of work into trying. The problem was, none of the multi-layer things worked, and people had run out of ideas. So most of the connectionist researchers got sucked away by things that were working at the time (eg the Stanford group was having huge success with adaptive antennas & telephone filters which accidentally come out of their NN work), and funding dried up (for both exogenous political reasons related to military R&D being cut, and just the lack of results compared to alternative research programs like the symbolics approaches which were enjoying their initial flush of success in theorem proving and checkers etc). So when, years later, _Perceptrons_ came out with all of its i's dotted and ts-crossed, it didn't "kill connectionism" because that had already died. What _Perceptrons_ really did was it served as a kind of excuse or Schelling point to make the death 'official' and cement the dominance of the symbolic approaches. Rosenblatt never gave up, but he had already been left high and dry with no more funding and no research community.

Olazaran directly asks several of them whether more funding or work would have helped, and it seems everyone agrees that it would've been useless. The computers just weren't there in the '60s. (One notes that it might have worked in the '70s if anyone had paid attention to the invention of backpropagation, pointing out that Rumelhart et al doing the PDP studies on backprop were using the equivalent of PCs for those studies in the late '80s, so if you were patient you could've done them on minicomputers/mainframes in the '70s. But not the '60s.)


The programmers at stackoverflow must be amazing if they fix it with stackoverflow offline


I think it often goes understated just how much the emergence of massive datasets led to the successes of neural networks.

I once heard Daphne Koller say that before big data, neural networks were always the second best way to do anything.


Agreed. There’s a question of ownership in that too. It’s frequently brought up in regards to Copilot. If companies had had to pay to license all the photos or code or writing they train on, a lot of these datasets wouldn’t exist, and then neither would the models.

Which is why I wish there was a copyleft open source data analogue. If you train on everyone’s public data, your model should have to be just as publicly available.


I hereby request that you submit your brain for... public availability.


I don't understand why these models are not "derived works" under the GPL family of licences.


Licenses don't get to define what is a derived work which needs a license from the author, that's defined solely by copyright law - and it's a tricky issue because the definitions used in copyright law aren't intended for a scenario like this and are rather ambiguous (at least in US) and IMHO does not have any reasonable case law to inform which plausible interpretations would be valid and which would not.

It's not clear whether the model trained on some data is copyrightable at all (and if it isn't, it can't be a derived work and gets no protection nor restrictions from copyright law) as in general facts about a work - including things like word frequency statistics, which was a popular type of trained language models not that long ago (e.g. n-gram models used in statistical MT systems) - are not copyrightable and in that case making/distributing those is not an exclusive right of the author and needs no license or permission, that was settled long ago between publishers and e.g. dictionary makers.

There is also the notion that mechanistic transformations or difficult labor can't result in copyrightable work, there needs to be human creativity involved; in US copyright law (as in Fiest Publications Inc. v. Rural phone service Co, also see https://www.gutenberg.org/help/no_sweat_copyright.html) the fact that making some work required lots of work and cost you lots of money does not imply that it deserves copyright protection, no matter how much work was required; so it's irrelevant that someone spent millions of dollars worth of GPU time, and it could certainly be argued (case law would be useful here!) that the software used to train the model is copyrightable, but the model output by that software is not.

Perhaps case law or some new explicit law will settle otherwise, but currently all the research and industry is proceeding with the assumption that a trained model is not a derived work (in the copyright law sense) from the training data, and as far as I see this assumption is not being challenged in courts.


Especially when you consider that on one end some companies forbid their employees to even read GPL source to alleviate the legal risks of accidental copy, and on the other hand, there's the common practice of clean room implementations for more or less the same reason. If brains can be tainted by reading code, how does that not apply to Copilot?


> some companies forbid their employees to even read GPL source

Is this really common enough it's even worth mentioning?


LeCun said that on this recent podcast too: https://www.listennotes.com/podcasts/machine-learning-street...

Before the ImageNet dataset, CNNs weren't worth it.


What was the first best way?


SVMs and Random Forests.


> successes of neural networks

Too early to say that with a straght face yet.

"Deep learning" produced some amazing generative art, and some very flashy research papers, but it failed to drive business decisions. (And not for the lack of trying, that's for sure.)


What about machine translation, speech recognition and transcription, speech synthesis, image classification, segmentation and other computer vision tasks, facial recognition, image and audio denoising, beating world champions at Go along with chess engines, language models for text generation, image editing (e.g. FaceApp)...

You don’t have to fall for the hype and think deep learning will lead us to AGI in the next 5 years, but dismissing it as generative art and flashy papers isn’t any more accurate.


Did you read my post? None of that drives business decisions.

As for Alexa, et al - these things fall into the "generative art" bucket. The search results they give aren't any better than the Eliza-tier expert systems of yore. They just feel much better and more human when you use them.

(Which is also important, but doesn't drive business decisions, except as part of a marketing strategy.)


What do you mean by doesn’t drive business decisions? ~65 million Alexas were sold last year, and the product would be fairly useless without very high performance speech recognition powered by deep learning.

I personally own a startup that uses deep neural networks as part of its core product functionality and our business decisions would be drastically different if it weren’t for modern machine learning. We could technically ship some similar products, but they would either be far inferior or take orders of magnitude longer to engineer by hand.


Most business requires data analysis and planning. Under the hood regression models and/or some sort of component analysis is used.

Deep learning is a cool feature to differentiate Alexa from competing home electronic gadgets.

But if you want to forecast Alexa sales, or understand market segmentation, or the portrait of a typical Alexa buyer then you need something other than deep learning because neural nets utterly failed in this domain.

The business metrics problem is vastly, vastly more important than the "making cool gadgets" problem, and huge resources were poured into making deep learning a thing in this space. The money was mostly wasted. (A negative result is still a result, but still the misallocation of resources is staggering.)


I get that it didn’t work in your particular field and that obviously that’s what matters the most to you, but I would suggest reading up on its wider applications before dismissing the whole technology. Entire classes of products suddenly being possible is not just a “cool feature”.

FYI, deep learning isn’t a differentiating feature for the Alexa, it powers essentially all modern voice applications.


As the other commenter mentions, neural network-driven speech recognition represents the current state of the art and powers products like Alexa and Google Home.

https://ai.googleblog.com/2020/11/improving-on-device-speech...


How can an article like this talk about the significance of a single layer perceptron and not talk about statistics’s contributions, like regression models? Binary classification with a single layer perceptron paved the way, but logistic regression isn’t worth mentioning?


That's also weird to me. Both Probit (1934) and Logistic regression (1943) models, which are also linear classifiers, precede the Perceptron model (1958) by many years.


The story of Perceptrons---both the idea and the book of that title---is instructive about how science proceeds in practice. The folklore is that this book killed neural net research, but if you read it you'll find it's not as damning as you might expect. Apparently it circulated widely as a manuscript before publication, and the circulating manuscript was much more negative in tone, and this is what shaped people's perceptions.


Multilayer neural networks weren’t really a viable tool until the backpropagation algorithm for determining internal parameters was developed in 1985 (cf https://apps.dtic.mil/sti/pdfs/ADA164453.pdf )


I think you'll find that backpropagation was essentially developed (multiple times) during the 60s in the field of control theory and first implemented in the early 70s.

Ed: To be clear, the idea to use them to adapt the weights of NNs was also from the 70s but only rediscovered and applied to MLPs by at least two independent groups/individuals in the 80s.


I think you'll find that backpropagation was essentially developed (multiple times) during the 60s in the field of control theory and first implemented in the early 70s.

Indeed. The book Talking Nets: An Oral History of Neural Networks[1] covers a lot of this ground. Read it and you'll see many people who were involved in the early history NN's mentioning how backprop was discovered and re-discovered over and over again.

[1]: https://www.amazon.com/Talking-Nets-History-Neural-Networks/...


What really made the difference was non-linear activations.

Without a non-linearity depth doesn’t buy you anything.


Truth! Without nonlinearity, each layer is essentially just a matrix multiplication. You could just as well use a single equivalent layer that represents the product of the layers' matrices. Nonlinearity lets a single node partition the input space into two (fuzzy) equivalence classes. That's powerful stuff.


Interesting counterpoint: a NN is not just activation. It's also learning, and multilayer stacks learn differently depending on depth even if they are entirely linear. This seems to be an artefact due to the optimizer in use (SGD for example). The jury seems to be out still on why this is.. the issue has been known since long, and I recently saw a short paper Yann LeCun posted about exactly this as well, they stacked a bunch of completely linear layers on top of a "normal" DL stack and got differences in the final system, even though you can collapse all the linear layers at any time to a single linear mixing layer.


It took us long enough to settle on ReLu. Step function activations were brutal.


I seem to recall that before ReLU it was usually tanh.


Tanh belongs to a class of functions called "smooth steps" which I guess was being abbreviated as "step." Obviously the derivative of a step is zero everywhere it's defined so backprop wouldn't work.


An interesting thing is that ReLu (which is absolutely the most common activation function) has a zero derivative over half the space its defined, and backprop still works (most of the time :). There are leaky ReLU to give some slope to the negative side, but it seems in most normal deep learning networks this is not required, there are always some neurons that aren't in the off regime to "catch the error".


One of a whole class of functions called sigmoidal, usally the sigmoid or the tanh function. These activation functions dominated the field for a very long time.


Yea -- I was taught about this in my first year Cybernetics course in the 1990s; the idea that there was an AI crisis is overcooked but it definitely tipped the entire industry towards expert systems.


It's worth noting that the Perceptron (at least early) was a system of circuits, not a program as we'd think of it now.

Here's a 1963 paper, "System and Circuit Designs for The Tobermory Perceptron", which includes circuit diagrams. It would be intriguing to build one, a bit like building a working Babbage machine.

https://blogs.umass.edu/brain-wars/files/2016/03/nagy-1963-t...


Great find!

All the pdfs (use filter) this blog has posted, http://web.archive.org/web/*/blogs.umass.edu/brain-wars/*


The Perceptron is also interesting because its classic mistake bound analysis is one of the earliest instances of modern learning theory. The mistake bound states that if the data is linearly separable (i.e. a perfect linear classifier exists), then Perceptron makes no more than (r / m)² misclassifications, where r is the radius of a ball containing the data set and m is the margin between the two classes.

The analysis is modern because 1) unlike traditional statistics, it does not make any assumptions about the distributions of each class, and 2) is a non-asymptotic guarantee. It is also an online analysis, in the sense that it only compares performance against the optimal fixed classifier for the sequence that was actually observed, rather than any notion of an underlying distribution.

More details: http://www.argmin.net/2021/11/04/perceptron/


I remember taking a class at UCSD in 1971 or 1972 that touched on Perceptrons (among a lot of other things, it was basically a survey course). I wonder sometimes what the world would be like today if they had realized the importance of hidden layers back then.


Data and compute power in the 70s probably would've limited the usefulness and led people to try other things (perhaps not even "probably" - that could be what really happened), though maybe it could've been revisited with success by the late 90s.

Makes you wonder if there are other abandoned techniques that might be worth circling back to nowadays...


A former colleague of mine, Terry Koken, who worked with Rosenblatt at Cornell for a while and is quoted in the article, dropped a comment last year saying pretty much that:

https://blogs.umass.edu/comphon/2017/06/15/did-frank-rosenbl...


k-NN methods always love more data and can be embarrassingly competitive with far fancier algorithms. Should always be considered as a classification baseline:

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm


Empirical methods/experimentation have classically taken a back seat to mathematical proofs in computer science.


Perceptron: https://en.wikipedia.org/wiki/Perceptron #History, #Variants

An MLP: Multilayer Perceptron can learn XOR and is considered a feed-forward ANN: Artificial Neural Network: https://en.wikipedia.org/wiki/Multilayer_perceptron

Backpropagation: 1960-, 1986, https://en.wikipedia.org/wiki/Backpropagation


I was just reading 1986 book Parallel Distributed Processing (https://mitpress.mit.edu/books/parallel-distributed-processi...) and was quite surprised by the relevance of their arguments how learning systems should be built and structured, which map quite accurately to how we actually developed them in the following decades after we got more computing power, more data and engineering know-how.


The fastai book actually makes a nice comparison between the systems described in PDP and modern deep learning.

> In fact, the approach laid out in PDP is very similar to the approach used in today's neural networks.

From: https://github.com/fastai/fastbook/blob/master/01_intro.ipyn...


Based on wikipedia it appears the original definition of perceptron was associated with the specific case of a neuron with a heavy-side step function activation output (which maps aggregated weights to either 0 or 1). I've generally adhered to referring to neural networks nodes in my writings as "neurons", which I believe is a catchall for any type of activation output, although have seen others use the term "perceptron" in the same usage in modern practice.


Found a short video of the hardware Mark 1 Perceptron machine in action classifying photos. Fascinating

https://youtu.be/cNxadbrN_aI


Well, NN aren't even remotely similar to real biological brain. Too simplistic theory for complex things like even a single neuron cell is.


“Digital logic design isn’t even remotely similar to real electrical circuits. Too simplistic theory for complex things like even a single CMOS transistor is.”

The problem with your argument is we know too little about higher level brain operation to make any confident comparisons like that.


How about Numenta's work?


Numenta probably has got some things "right", some things wrong and some things missing. The problem with finding the neural code for training a network is that there are so many functional ways to encode data within a network.


Reverse engineering biology is like studying the Terminator brain chip from Terminator 2.

Regular expressions, perceptron, neocognitron, genetic algorithms, deep learning...


Are you saying regular expressions were inspired by biology? How?





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: