I would much rather respond to the students who ask about why they need to know the details, by saying "because you registered for this class."
Essentially the question "if you’re never going to write backward passes once the class is over, why practice writing them?" is another way of asking "why reinvent the wheel?". And the best answer is "because learning is all about reinventing the wheel". The "code reuse as much as possible" mantra applies to when you're "using" a technique to do something else, not when you're "learning" the technique itself. They might as well register for Calculus and ask "why learn integrals and derivatives when mathematica can do them for you", or take an Aerodynamics class and ask "why learn fluid mechanics and dynamics, heck newton's laws, when an airplane can run on autopilot." I doubt "because calculus is a leaky abstraction" or "because fluid dynamics is a leaky abstraction" is a good answer to that.
If every major language provides an O(n log n) sort function, is it still a leaky abstraction? I'd say no. You can use it without worrying much about the details.
But it sounds like the situation with back-propagation is different, since the internal details of the algorithm affect whether you get a usable answer at all.
A borderline case might be something like a SQL database, where in theory, creating an index shouldn't change query results, but in practice, the performance can change so much that it effectively does.
Unlike with back-propagation, you can tune database performance without worrying about queries returning different results (assuming they don't time out). So it's still a useful guarantee.
There are a lot of details to get right: how are elements compared? Is the sort stable? Is it efficient for small N? Is it efficient for nearly-sorted arrays? Is it efficient when almost all the elements compare equal? Is it guaranteed O(n log n) or average? If average, is there an input that reliably triggers n^2 behavior making it a DDoS vector?
Anything is a leaky abstraction when you care enough.
If the box is only labelled with O(nlog n) without specifying constants then there is a leak.
In much the same way, much of the interesting work with machine learning requires deep specialist knowledge in which selecting the best approach requires digging beneath the abstraction. Many specific applications have their own quirks, which is where significant gains are often made. (To be honest, I think the last year was more exciting for the advances in applied ML than in theoretical work.)
A co-worker wrote a library for distributed training of DNNs for a specialized use case. Because of certain quirks of our use case (think sparsity and consistent patterns in the data), there were certain optimizations he could make to the training process that gave nice linear scaling in the number of training nodes.
*incompetent - those who think they know something and are not willing to learn anything new
And then worst case? Best case? Average case? Which applies to random input? Nearly-sorted input? What's the overhead for a "short sort", can/should I use this to sort small sequences in a tight-ish loop or is it only for large sequences?
I was under the illusion that Big-O was always asymptotic complexity (ie worst case) and that other notations (little-o, big-omega, big-thetha etc) were used for best/average/etc case. Perhaps I'm wrong, however.
Which applies to random input? Nearly-sorted input? What's the overhead for a "short sort", can/should I use this to sort small sequences in a tight-ish loop or is it only for large sequences?
Indeed. The details matter.
My favourite example is how a naive linear search can perform better than a non-linear search with a better (in O() terms) algorithm if, for example, the linear search can do most of its work in cache (eg small input sizes or otherwise regular access patterns (predictable for prefetch)).
No, asymptotic notation only refers to the behaviour of the function (or algorithm) as you take the limit N->inf, e.g. changes to the size of the input. But the nature of the input often changes the behavior as well.
I was responding to a comment that I felt was needlessly dismissive. The default sort algorithm of any major language's standard library is usually not worth trying to beat, and calling someone "incompetent" for not reinventing the wheel each time they encounter data that needs to be sorted is absurd. Note: I said "usually", not "never". A competent programmer doesn't fix problems that don't exist, and you would absolutely have to profile code before convincing me that the choice of sorting algorithm was the source of a performance problem.
Great example :) You get different results if you use merge sort or quicksort.
The former is stable; the latter is not.
Okay...so you can be more specific with your API and specify whether stability is guaranteed. Eventually you can be specific enough that you essentially characterize the implementation.
Yes, if the performance is still not acceptable and you look into the problem and discover that your scenario could benefit from radix sort, or one of partial sorting, or some kind of intermittent sorting, all of which would require investigating the specific case at a "white box" level, ignoring the existence of a black-box O(n-log-n) sort API call.
I invite you to start sorting data which is often already sorted (or often all identical) and tell me you didn't need to worry about the details.
And neural networks using backpropagation to minimize an error function are definitely the former while the "go to" techniques of mathematics are generally the latter.
I think this is generally acknowledged, in fact.
Sooo it seems logical that students of these modern machine learning techniques should be impelled to "get their hands dirty" with the numericals processes involved since this will add to the data they use to drive the intuitions which allow them to avoid the multiple pitfalls of neural networks.
This is a non-trivial statement, because there are other things which are not leaky. For example, he's not arguing that deep learning practitioners should also learn assembly programming or go into how CUBLAS implements matrix multiplication. Although these things are nice to learn, you probably won't need them 99.9% of the times. Backprop knowledge, however, is much more crucial to design novel deep learning systems.
I would argue that it's not just for that. You need to understand what is happening inside DNN if you want to construct it properly.
The Authoritative Argument is not very convincing when your good authority over the course material being challenged. Maybe there is some irony or deeper truth, lost on me. Are you perhaps a leaky abstraction?
> The "code reuse as much as possible" mantra applies to when ...
... you are able to reasonably compromise, got it.
Neither is a straw man argument.
He wasn't making an argument from authority. He was essentially saying "because you are learning, not doing".
No, implicitly and very subtly. The essence of the argument was the student in a voluntary class. The implications are manifold, but not stating the implications has a meaning on the meta level, that the students understanding is not expected, but the students expectation of an authoritative answer has to be satisfied anyway by stating the obvious to establish or retain authority.
Maybe educational students should only require what's actually practical. That can include things that can be solved by computers if they teach ideas that have practical value.
Yes, but the logic goes the other way: if they are unable to use the technique manually, then one may conclude they do not understand the idea well enough to use it, whether or not Mathematica is available. The two, manual application and understanding/intuition, go hand in hand and can't be separated. My understanding is that this has even been studied more formally, there is a paper I remember reading by Kahneman and somebody else  about the development of intuition and the tension between heuristics-and-biases and naturalistic decision making. In short, don't trust people so much, if they say they understand something but are unable to actually do it, be a bit more skeptical.
 Found it: https://www.ncbi.nlm.nih.gov/pubmed/19739881 On second thought, it's probably not as relevant as I remembered it.
A better analogy might be stable sort versus unstable. If you need a stable sort, and unstable one is wrong.
(Not sure if that counts as an abstraction leak, though! It's just the semantics. Unstable sort doesn't leak that it is unstable; that's its behavior. If a sort is just documented as putting items in order, an actual implementation does leak information about whether or not it is stable.)
No pun intended!
Why do you have to learn to calculate integrals and derivatives in school, or how compilers work internally? Same answer. But seriously, the CS231 class is excellent, and Andrej is an excellent teacher. You can follow along at home (which is what I am doing.) The syllabus (at http://cs231n.stanford.edu/syllabus.html) has the course notes and the assignments. The assignments are self grading, you know when you have it programmed correctly. The lectures are here: https://www.youtube.com/playlist?list=PLlJy-eBtNFt6EuMxFYRiN...
The cs231n course covers a lot of material, including some (or most) of the material from the Andrew Ng's Coursera course. You can totally do them in parallel though.
On the other hand, try implementing a convnet in numpy, especially the backprop, and you might start feeling some of its "rawness" :-)
Something I've always hated about people that say "why do I have to write a backprop when TF does it for me?"
Here's why: you go to a company, they want you to incorporate machine learning into their c++ engine. Have fun using numpy, you said you knew machine learning right? implement backprop for me, you can do that right?
I'm developing a model (not exactly a convnet) that uses a correlation step. Because of the above problem and its resulting pure-python loops, I may have to cythonize or use the NumPy C API for the gradient evaluation. Do you know of any examples I could check out that implement partial derivatives w.r.t. a correlation (or convolution) in "raw numpy"?
Not to mention that computer science and software engineering are such young fields that it seems unhealthy to take readily available abstractions as absolute givens. Everything stands to improve, even products and concepts that have been around for decades and that everyone uses.
Yes there is, watch someone else do it. In fact there are three ways to learn, as the saying goes: trial and error, copying, and insight. I'd be hard pressed to explain the difference of trial and error vs insight, but I wouldn't confuse them either, because only one of them is painful.
This is not as good. There's a lot of evidence from neuroscience that this leads to an illusion of competence. I.e. it's much easier to follow along through a sample solution than to craft a solution yourself. Until you craft the solution yourself, the knowledge won't actually be chunked as firmly in your brain.
I've seen this time and again, as a teacher, and it was mentioned explicitly in Coursera's Learning How To Learn course: https://www.coursera.org/learn/learning-how-to-learn
NOTE: watching someone is a good way to get started, but until you do it on your own, you haven't learned it deeply, and it may be difficult to recall in a real situation.
I agree that there isn't enough time in a life to learn everything you'd want to first hand or from a low level of abstraction, but school should be a place to do as much of it as possible. Just my 2c.
Whether or not you succeed in solving it on your own, it will emotionally invest you in the problem and its solution while showing you what didn't work and having a better handle on the shape of the problem. This lets you get more out of seeing someone else work out the solution.
Wouldn't that suggest that other classes the students take have not been academic enough - i.e. that they focus too much on "things you might use day-to-day" vs "this is how/why things work"?
Error terms represent a sum and product of derivatives. The product of a bunch of terms will tend to get really big or really small.
The rest is detail: are the terms in some interval? Which? how many are we multiplying? how many are we summing over? do we doctor the sum after we get it?
Backprop isn't just a leaky abstraction. There's not really any evidence yet that biological neural networks use anything like backprop. So its important that students be taught the low-level aspects of the current state of the art so that better architectures can be invented in the future.
(Note: at the very end of this comment, I am leaving a link to one hypothesis about how something like backprop may happen at the dendritic level.)
I've been doing a deep dive on ML and bio neurons in the last week or so to try and get up to speed. The feedforward networks with backprop that are doing such awesome work right now share almost nothing in common with actual neurons.
Just some differences: (neuron == bio, node == artificial)
Terms: Neuron resting potential is about -70mV. Action potential (threshold) is about -55mV.
1) Neurons are spikey and don't have weights. A neuron sums all of its inputs to reach a threshold. When that threshold is reached, an electric current (action potential) is spread equally to all synaptic terminals. A node has a weight value for each input node that it multiplies by the value received from the input node. All of these values are then summed and passed to the connected nodes in the next layer.
2) There are multiple types of neurons, but for simplification there are 2 types that called "excitatory" and "inhibatory". A neuron can mostly only emit "positive" or "negative" signals via the synapse. There are different neurotransmitters that either add or subtract to the voltage potential for each dendritic input. The total sums work together to cross the threshold voltage. I say mostly because there is some evidence for a minority of neurons to be able to release neurotransmitters from both groups.
3) Artificial networks are generally summation based with no inclusion of time. Bio networks use summation and summation over time together to determine if they should fire or not. If I recall correctly, a neuron can repeatedly fire about 300 times per second. A dendritic input can last for up to 1.5 milliseconds. So if a neuron gets enough positive inputs at the same time, or collects enough over time, it will fire.
I haven't found any hypotheses or experiments that try to explain how reinforcement learning takes place in neurons yet in the absence of back propagation. Pretty sure that information is out there but I haven't run across it.
Overall I think that there will be multiple engineering approaches to AI, just like there are to construction and flight. We understand how birds and bees fly, but we don't build planes the same way.
Its important to remember than cognition is based in physics just like any other physical system. When the principles are well understood, there are multiple avenues to using them.
Here is a short collection of links that I've been finding helpful.
1. Neurons, or more specifically, connections between neurons ("synapses"), absolutely do have weights, and the "strength" of synapses can be adjusted by a variety of properties that act on a scale of seconds to hours or days. At the "semi-permanent" end of the spectrum, the location of a synapse matters a lot: input arriving far from the cell body has much less influence the cell's spiking. The number (and location?) of receptors on the cell surface can also affect the relative impact of a given input. Receptors can be trafficked to/from the membrane (a fairly slow process), or switched on and off more rapidly by intracellular processes. You may want to read up on long-term potentiation/depression (LTP/LTD), which are activity-dependent changes in synaptic strength. There are a whole host of these processes, and even some (limited) evidence that the electric fields generated by nearby neurons can "ephaptically" affect nearby neurons, even without making direct contact, which would allow for millisecond-scale changes.
2. While you can start by dividing neurons in excitatory and inhibitory populations, there's a lot more going on. On the glutamate (excitatory) side, AMPA receptors let glutamate rapidly excite a cell and make it more likely to fire. However, it also controls NMDA channels that, under certain circumstances, allow calcium into a cell. These calcium ions are involved in all sorts of signaling cascades (and are involved--we think--in tuning synaptic weights). GABA typically hyperpolarizes cells (i.e., makes them less likely to fire) and is secreted by cells called interneurons . However, there's a huge diversity of interneurons. Some seem to "subtract" from excitatory activity, others can affect it more strongly in a divisive sort of way or even cancel it completely. Furthermore, there's a whole host of other neurotransmitters. Dopamine, which is heavily involved in reward, can have excitatory or inhibitory effects, depending on whether it activates D1 or D2 receptors
3. While the textbook feed-forward neural networks certainly have "instant" signal propagation, there are lots of other computational models that do include time. Time-delay neural networks are essentially convnets extended over time instead of space. Reservoir computing methods like liquid state machines also handle time, but in a much more complicated way.
4. I chuckled at the idea of finding a biological correlate analog for reinforcement learning, since reinforcement learning was initially inspired by the idea of reinforcement in psychology/animal behavior. People have shown that brain areas--and individual neurons within them--encode action values, state estimates, and other building blocks of reinforcement learning. Clearly, we have a lot to discover still, but the general idea isn't at all implausible.
Finally, some people are fairly skeptical that the fields have much to learn from each other; Jürgen Schmidhuber said this a lot at NIPS last year. However, other, equally-smart people (e.g., Geoff Hinton) seem to think that there may be a common mechanism, or at least a useful source of inspiration there. But, if you want to work on something like this (and it is awesomely interesting), it really helps to have a solid grounding in both.
There are a couple of standard neurobiology textbooks, like Kandel, Jessel, and Schwartz's Principles of Neural Science, Purve et al.'s Neuroscience and Squire et al.'s Fundamental Neuroscience. These are huge books that cover a bit of everything, and you should know that they exist, but I wouldn't necessarily start there.
If you're specifically interested in computation, I would start with David Marr's Vision. It's quite old, but worth reading for the general approach he takes to problem-solving. He proposes attacking a problem along three lines: at the computational level ("what operations are performed?"), the algorithmic level ("how do we do those operations?"), and the implementation level ("how is the algorithm implemented").
From there, it depends on what you're interested in. At the single-cell level, Cristof Koch has a book called The Biophysics of Computation that "explains the repetoire of computational functions available to single neurons, showing how individual nerve cells can multiply, integrate, and delay synaptic input" (among other things). Michael London and Michael Häusser have a 2005 Annual Reviews in Neuroscience article about dendritic computation that hits on some similar themes (here: https://www.researchgate.net/publication/7712549_Dendritic_c... ), along with this short review (http://www.nature.com/neuro/journal/v3/n11s/full/nn1100_1171...) by Koch and Sergev, and a 2014 review by Brunel, Hakim, and Richardson (http://www.sciencedirect.com/science/article/pii/S0959438814...). Larry Abbott has also done interesting work in this space, as have Haim Sompolinsky and many others. Gordon Shepard and his colleagues maintain NEURON (a simulation package/platform) and a database of associated models (ModelDB) here: https://senselab.med.yale.edu/ if you want something to download and play with (they also do good original work themselves!)
Moving up a bit, the keywords for "weight adjustment" are something like synaptic plasticity, long-term potentiation/depression (LTP/LTD), and perhaps spike-timing dependent plasticity. The scholarpedia article on spike-timing dependant plasticity is pretty good (http://www.scholarpedia.org/article/Spike-timing_dependent_p... Scholarpedia is actually a pretty good resource for most of these topics. The intro books above will have pretty good treatments of this, though maybe not explicitly computational ones.
More to come, however I also just found this class from a bunch of heavy-hitters at NYU: http://www.cns.nyu.edu/~rinzel/CMNSF07/ Those papers are a good place to start!
I probably should have lead with this, but there's been a lot of interest in backprop-like algorithms in the brain
* Geoff Hinton has a talk (and slide deck) about how back-propagation might be implemented in the brain. (Slide deck: https://www.cs.toronto.edu/~hinton/backpropincortex2014.pdf Video: http://sms.cam.ac.uk/media/2017973?format=mpeg4&quality=720p )
* As always, the French part of Canada has its own, slightly different version of things, care of Yoshua Bengio (slide deck from NIPS 2015: https://www.iro.umontreal.ca/~bengioy/talks/NIPS2015_NeuralS... preprint: https://arxiv.org/abs/1502.04156 )
* Here is another late 2015 take on back-prop in the brain by Whittington and Bogacz (http://biorxiv.org/content/early/2015/12/28/035451) This one is interesting because they view the brain as a predictive coding device which is continuously estimating the future state of the world and then updating its predictions. (I think the general predictive coding idea is cool and probably under-explored).
* There's a much older paper by Pietro Mazzioni, Richard Andersen, and Michael I. Jordon attempting to derive a more biologically plausible learning rule (here: http://www.pnas.org/content/88/10/4433.full.pdf) This work is particularly neat because it builds on earlier work by Zipser and Andersen (https://www.vis.caltech.edu/documents/54-v331_88.pdf), who trained a three-layer network (via back-prop) to transform data from gaze-centered ('retinotopic') to head-centered coordinates, and noticed that the hidden units performed transforms that look a lot like the work done by individual neurons in Area 7A. The Mazzioni paper then replaces the backprop with a learning procedure that is more biologically plausible.
For backprop, you need some sort of error signal. Wolfram Schultz's group has done a lot of work demonstrating that dopamine neurons encode something like "reward prediction error." (e.g., this: http://jn.physiology.org/content/80/1/1, but they have lots of similar papers: http://www.neuroscience.cam.ac.uk/directory/profile.php?Schu...). For reinforcement learning, you might also want to maintain some sort of value estimate. There are tons of studies looking at value representation in orbitofrontal cortex (OFC), using mostly humans and monkeys, but occasionally rats. Here's a review from Daeyeol Lee and his postdoc Hyojung Seo describing neural mechanisms for reinforcement learning (http://onlinelibrary.wiley.com/doi/10.1196/annals.1390.007/f... ) The Lee lab has done a lot of interesting value-related things too.
Switching gears slightly, there is also considerable interest around unsupervised learning and related methods for finding "good" representations for things. This is potentially interesting because it would allow for improvements within individuals and even across individuals (e.g., by evolution).
Olshausen and Fields kicked this off by demonstrating that maximizing the sparseness of a linear code for natural images produces a filter bank that resembles the way neurons in primary visual cortex process images. (http://courses.cs.washington.edu/courses/cse528/11sp/Olshaus...)
Michael Lewicki has done similar things in a variety of sensory modalities. Here's a recent paper from him looking at coding in the retina (http://journals.plos.org/ploscompbiol/article?id=10.1371/jou...) but he has similar work in the auditory system and building on the Olshausen and Fields paper linked above to explain complex cells (and more!) in visual cortex. Bill Geisler has also done a lot of work looking at the statistics of natural scenes and how the brain (and behavior) appears to be adapted for them.
This seems reasonable, but I doubt there is any single simple and general abstraction that can describe the learning algorithms used by physical neurons. It seems more likely that the brain uses many highly specialized algorithms for different regions of the brain, each shaped by a ~billion years of evolution.
Something about this article as a whole, though, does raise a question that I've been wondering about, and I want to present it here for a bit of discussion (maybe it needs its own thread?):
Does anyone else here think that the current approach to neural networks has some fundamental flaws?
Now - I'm not an expert; call me an interested student right now, with only the barest of experience (beyond the ML Class, I also took Udacity's CS373 course, and I am also currently enrolled in their Self-Driving Car Engineer nanodegree program).
I understand that what we currently have and know does work. What I mean by that is the basic idea of an artificial neural network using forward and back-prop, multiple layers, etc (and all the derivatives - RNN, CNN, deep learning, etc). I understand the need and reasoning behind using activation functions based around calculus and derivatives and the chain-rule, etc (though I admit I need further education in these items).
But something nags at me.
All of this, despite the fact that it works and works well (provided all your tuning and such is right, etc), just seems like it is over-complicated. Real neurons don't use calculus and activation functions, nor back-propagation, etc in order to learn. All of those things in an ANN are just abstractions and models around what occurs in nature.
Maybe (probably?) I am wrong - but it seems like what nature does is simpler. Much less power is used, for instance, and the package is much more compact. I just have this feeling that in some manner we may have gone down a path that while it has produced a working representation, that representation is overly complex, and had we took another approach (whatever that might be?), our ANNs would look and work much differently - perhaps even more efficiently.
About the only alternatives I have heard about otherwise have been things like spike-train neural networks, and some of the other "closer to nature" simulation (of ion pumps and real synapses, etc). Still, even those, while seemingly closer, also have what appears to be too much complexity.
I'm probably just talking out of my nether regions as a general n00b to the field. I do wonder, though, if there might be another solution, seemingly out in "left-field" that might push things forward, if someone was willing to look and experiment. It is something I plan to look into myself, as I find time and such between lessons and other work for my current learning experience.
This sounds like a (common) failure to understand how abstractions work. Bridges don't do calculus, but the bridge builder uses calculus to understand what bridges do use (the laws of nature), and thus the calculus abstraction is used to encode the behavior of bridges. Thus you can model bridges using calculus.
Similarly, neurons are modeled by calculus. Abstractions are abstract precisely because they are not the concrete thing they model: they are necessarily approximations. They give us the power to simplify at the cost of gaining the capacity to be wrong.
The point being this: you can literally use any abstraction you desire to model anything you like. Some will work better than others, and the better they work, the more closely the structure of your abstraction matches the structure of the concretion being modeled.
Some people cannot understand this, and believe that if you can closely fit the output of a process with a neural network that it implies the process itself is in some way related to neural networks.
At the same time, I wonder if there isn't a simpler way of doing all of this - if there isn't a simpler model for the abstraction of a neuron that doesn't require back-propagation or calculus? In other words, have we become so use to the current concepts and models of ANNs that we have become hesitant or resistant to imagining other possibilities?
Yes - what we have works, and it works well. Honestly, from what I know, and what I have learned (while I may not fully grasp the mathematical and calculus underpinnings of a neural network, I do have a good feel for how both forward and back-prop is implemented), our current general knowledge on ANNs (ie "how to implement a simple neural network") isn't super-complex. I only question whether it could be made simpler (and I'm not talking about a network of perceptrons or RELUs), or if, because of the early work (Pitts and McCulloch mainly), we are going down a less-than-optimum path (to use an ML analogy, we are stuck in a local minima) - one that fortunately works, but maybe there's something better out there?
Again - I recognize as a non-expert in this field (I only consider myself a student and hobbyist so far) - I am likely very well off in la-la land. Even so, we know that quite often in engineering, the optimum solution tends to be the simplest solution; sometimes, that solution is simpler than nature (an airplane vs a bird, for instance). I have a feeling that may be the case with ANNs as well.
I am not wedded to this idea, though - I just want to put it out there for consideration and maybe discussion. As I noted, what we currently have works well, and isn't a complexity nightmare, and is understandable to anyone willing to understand it. It very well could be it -is- the simplest explanation.
Adult humans seem to learn faster when there is some combination of theory, examples, and experience (aka feedback). I'm a scientist and I have very little interest in neural nets for science because it contorts away the kind of equation-based systematics that we rely on to understand our world. The theory component is missing.
I'm more interested in expert system-type AI, where the learning may help us infer more about the world in the systematic way that scientists try to grapple with it--logical rules and frameworks, physics equations, statistics, etc. But these seem to be far less effective or efficient than ANNs, at least as currently implemented.
That being said, I did see a great talk last week on using ANNs to accelerate viscoelasticity calculations--those sorts of things are definitely valuable for science by reducing the time required for simulations by orders of magnitude. And some of the discussion after the talk had to do with how a graph of biases and weights is, philosophically and practically, a fundamentally different way of representing the problem and its solution than a set of differential equations. There is no doubt that it's a useful way of solving the calculations, but it's unclear how we can build upon those weights and biases in a theoretically-meaningful way. How do we learn from machine learning?
Now replace the neurons with some tangentially related abstraction that is massively different from real neurons, well...
I'd love a reference for this. Link, please?
That said, don't underestimate the complexity of what is happening in real neurons. It's true that an individual neuron is fairly simple (more or less! some may perform very complex encodings of sensory data!), but the complexity of the "computations" they perform goes up very quickly, and you have a _lot_ of them that are highly connected, so it may be that ANN are a good abstraction in the sense that they reduce the complexity of spike trains to a more abstract but more powerful representation.
It's true though that they probably don't perform back-propagation, at least in the mathematical sense; I don't know the literature but I think an inhibitory mechanism is probably a better biological model for how "error detectors" can be used to suppress useless or noisy output.
hmm, I did the cs231n homework and had the opposite experience. It was really easy to complete it by ignoring numpy's provided matrix methods (just write a bunch of for loops in python) but that solution was really slow. If you could use numpy's matrix methods, however, the code executed a lot faster (the loops and control flow now run in C instead of python, so to speak). The hard part of the assignments was expressing my python code, with all its ad-hoc mixture of nested loops and conditionals, in numpy.
But I think Ng wanted us to understand what was going on under the hood in Octave when you used its in-built vector primitives, and how to think about the problems in such a way to understand how to "vectorise" them so that the solutions would be amenable to using those primitives.
There was a time where I wasn't "getting it" - and proceeded with my own implementation. In time though, it "clicked" for me, and I could put away those routines and use the vector math more fully (and quickly). That said, I wouldn't have wanted it to be left as a "black box" - it was nice to have the understanding of the operations it was doing behind the curtain.
Which is also why I was disappointed that the math wasn't described more in either of those courses; that part was left as "black boxes" and only hinted at a bit (ie "here's the derivative of the function - but you don't have to worry about it, but if you know about the math, you might find it interesting").
In this latest course I am taking, though, they are diving right into the math - and I have found that they assume you already know what a derivative is and how it is formed from the initial function. Unfortunately, I don't have the education needed, so I am fumbling through it (and doing what I can to read up on the relevant topics - I even bought a book on teaching yourself calculus which was rec'd for me by a helpful individual).
In fact, now that I think about it, the first assignment asked us to write a 2-loops-in-python version of some function (batch linear classifier, I think), then a 1-loop version, then a 0-loop version, and I often repeated this procedure on subsequent harder questions.
That was an useful thing I wasn't expecting to learn from the course - how to vectorize code for numpy (including using strange features like broadcast and reshape)
But do you think that real neurons are less complicated than artificial neurons? Look at the molecular structure of a single ion channel. It's crazy complicated!
How do we know that neurons don't do calculus? Chemical gradients, summation, etc.
Feynmann pretty much settles it.
How many more times do you need to see the same phenomenon under different guises before you stop asking stupid questions? "Hey teach, why do I need to learn how to multiply if I can just use a calculator?"
At that level, they assume you are pretty smart and capable of figuring out something like an API on your own time as needed. They'd rather you know what all these funky things in these APIs are doing at a core level so that you can employ them in an effective manner.
With a framework it's hard to think outside the frame. With a low level core it's hard to do anything really. It's a matter of compromise. Noone starts writing asm to begin with, although it is interesting, e.g. nand tetris being a famous example.
The other is to learn the foundations of those algorithms so you can best understand how to tune, apply and extend them. This is the path to mastery.
Ironically, mere exposure to data is enouhg to learn from in ML, so why not here. Although, I'm not sure about the pedagogic aspect. I'd assume calculation by heart would be learned along the way, despite sending the initial message, it wasn't needed. Maybe starting slow is important, because it's that fundamental. But in hindsight, I really was good after half the elementary training.
I have a similar anecdote: I wasn't good at handwriting and always claimed I wouldn't need to. Now I don't need to, indeed, except for exams. But I actually have a hard time with caligraphy and that's a shame.
Your anecdote is irrelevant because there is no _understanding_ to be gained by handwriting as opposed to typing.
I really can't be bothered with this conversation, sorry.