
Building the Software 2.0 Stack by Andrej Karpathy [video] - fmihaila
https://www.figure-eight.com/building-the-software-2-0-stack-by-andrej-karpathy-from-tesla/
======
Animats
Be afraid. Be very afraid.

If this guy was working on adtech, that would be fine. That's very error-
tolerant. But this guy is working on automatic driving.

The basic mindset here is to run image classifiers to classify the objects in
an image, then use the classifier output to decide what to do. There's no
geometric analysis. That's scary. Classifiers just aren't that good. See the
earlier article today about adversarial attacks on classifiers. Classifiers
pick obscure details of images and use them to make decisions. Nobody seems to
know yet how to prevent that. This problem shrinks with larger data sets,
where hopefully the irrelevant details cancel out as noise, but, as the
speaker points out, that breaks down when you have few training cases of
certain situations.

The Google/Waymo approach is to get a point cloud with LIDAR and radar,
profile the terrain and obstacles, and figure out where it's physically
possible to go. That's geometry based. In parallel, a classifier system is
trying to tag objects in the scene, which feeds into a system which tries to
predict what other road users are going to do.

With that approach, a classifier result of "not identified" is fine. The
system will detect and avoid it, or stop for it, and make conservative
assumptions about its expected behavior. Chris Urmson, in his SXSW talk,
showed video of a woman in a powered wheelchair chasing a turkey with a broom.
This was not identified by the classifier, but it was clearly an obstruction,
so the vehicle stopped for it. That's essential here. It has to do something
safe with unidentified or mis-identified objects.

At Tesla, Musk insisted that this could be done with a camera alone because
humans can drive on vision alone.[1] So Tesla has people trying to make
camera-only driving work. Not very successfully so far.

"November or December of this year (2017), we should be able to go from a
parking lot in California to a parking lot in New York, no controls touched at
any point during the entire journey." \- Musk, in April 2017. This guy is
saying what Musk wants to hear.

[1] [https://blog.ted.com/what-will-the-future-look-like-elon-
mus...](https://blog.ted.com/what-will-the-future-look-like-elon-musk-speaks-
at-ted2017/)

~~~
ckastner
> This guy is saying what Musk wants to hear.

Apparently Musk heard it (or something similar), because Tesla is rolling out
the first self-driving features in August:
[https://news.ycombinator.com/item?id=17282006](https://news.ycombinator.com/item?id=17282006)

~~~
chronic288
> Apparently Musk heard it (or something similar), because Tesla is rolling
> out the first self-driving features in August

Self-driving _features?_ Sorry, there are no _some_ self driving features. You
either have full self-driving or you don't. If you don't (which Tesla
doesn't), how the hell can you call it self-driving?

~~~
CYHollander
Your comment appears to contradict itself. Using the term "full self-driving"
implies that you'd recognize something short of that as "partial self-
driving". Such a system would presumably have some, but not all the features
of a "full self-driving" car.

In any case, it's quite easy to imagine a plausible meaning "some self-driving
features": perhaps these features enable the car to drive itself [without
oversight] in some, but not all situations (<i>e.g.</i> on highways, but not
in cities).

------
dkislyuk
Treating training datasets as dynamic components of ML systems, along with the
corresponding tooling and infrastructure, is one of the most under-appreciated
points in the field today. Part of the cause, I think, is due to stigma coming
from the academic side that dataset collection is a low-level problem not
worthy of serious algorithmic investment (good luck submitting a CVPR paper
which improves on labeling speed or taxonomy management in your dataset
warehouse). Datasets are considered a given, which also has the interesting
side effect of massive hyperparameter and architecture overfitting, evidenced
by the recent analysis of CIFAR-{10,100} [1]. In more applied and engineering
settings though, it's good to see this area seeing a lot more investment,
especially on the tooling side.

[1] [https://arxiv.org/abs/1806.00451](https://arxiv.org/abs/1806.00451)

~~~
mousetraps
> stigma coming from the academic side that dataset collection is a low-level
> problem not worthy of serious algorithmic investment

Agreed it needs more attention, but - for academia - I think it's more of an
incentive issue than a stigma issue. E.g. harder to benchmark the performance
of two algorithms if they don't operate on the same dataset. Also to be fair,
research into things like synthetic data mitigates the problem, just in a
different way.

The paper you cited is interesting. Thanks for sharing. Hopefully that spawns
more focus into understanding the subtleties of each dataset. IIRC Kaggle also
had issues around generalizability, but for different reasons.

Anyways it's still early on... but we're currently building tools to help
solve this problem. In particular simplifying the data collection / labeling
process for vision systems. Would love to chat further w/ anyone interested in
providing feedback. Email is sara@viewpointrobotics.com

~~~
avip
It's indeed "an incentive issue" only not the one you've mentioned, but the
one OP hinted at. Research is focused on what's publishable, hence tenure-
trackable, and not on what's useful to solve real-world problems (of course,
the two occasionally coincide)

~~~
mousetraps
Why so black and white? There are many incentives at play, and many ways to
contribute to solving real world problems.

I’ve spent time in both academic research and industry.

Research is not supposed to be immediately applicable. The goal is to produce
new knowledge - more importantly shared knowledge. Publishing is not a bad
measure of that. Additionally, ability to secure grants provides incentive to
focus on problems others want solved.

No incentive system is perfect, but I don’t really see how this is any
different from any organization. And I don’t think it’s fair to judge an
entire discipline by the negative examples.

~~~
avip
I didn’t judge anything nor said something “negative”

~~~
mousetraps
Okay fair enough, maybe we’re just talking past each other :).

------
hyperbole
The argument the speaker makes is extremely weak - 1.0 software programs are
built of building blocks modules, 2.0 software is machine learning - what
point is trying to be made here?

We still break problems down when solving problems that can apply machine
learning. There's no single "drive the car" neural net, rather the task of
driving a car has been broken down into subcomponents, sign detection,
pedestrian detection/object in front of the car detection - and then there's
logic that encapsulates these classifiers using them as inputs to determine
how best to steer and power the cars drive wheels.

Its a bit far fetched to believe programming has fundamentally changed, at
least not yet

~~~
fredguth
I think the point of the talk was to say it is very different to engineer
features vs to make features learnable. This is indeed very different than
"software 1.0".

He also points that the challenges are not where academia is focusing.

------
tejohnso
Software 1.0:

    
    
        The car is parked IF it is on the side of the road AND it hasn't moved in X time, AND ... But not if... etc.
    

Software 2.0:

    
    
        The car is parked if the neural net says so.
    

Okay, if software 2.0 is all about thinking at a higher level, and training
the neural net to deal with the details, why is the focus on detail like "is
the car parked", or "is it raining", or "where is the lane marker?" Why can't
we train "this is good driving" / "this is bad driving".

As a software 1.0 programmer I can see how that seems completely unreasonable,
but it does seem to follow the logical direction of the talk.

~~~
halflings
What you're describing would be called "end-to-end learning", and the way
things have been progressing is that a lot of systems cobbled together (e.g
speech synthesis, translation, image recognition) have been converted to
models (mostly neural nets) that are learned end-to-end. Autonomous driving is
not an exception, and they might still get there, but you have to put your
pragmatic hat on and do whatever works right now. Benefit of this approach
include more interpretable results, potentially improved safety guarantees
(e.g you can limit the failure to a subsystem, things like that)

~~~
Gravityloss
Most humans generally don't learn driving only end to end. They are taught how
the various components of the car work, traffic rules and conventions etc.

Most people also do some experimentation and calibration. Check how much empty
space there is after parking, drive a circle on a snowy parking lot until you
spin etc.

------
rasmi
One downside of Software 2.0 as compared to 1.0, at least as of today: it is
incredibly hard to debug in the conventional sense. The focus of this talk was
mostly on data-labelling challenges. For a company with software as mission-
critical as Tesla, I'm disappointed Andrej did not bring up any of the
practical challenges around debugging complex models.

~~~
icc97
He touched on this at the end where he spoke about trying to write an IDE for
Software 2.0.

He did talk about the problems of complex models. They mostly treat the models
as fairly fixed (see piechart slide of PhD vs Tesla). Most of the challenges
are in labelling data.

~~~
rasmi
I watched the whole talk, so I heard the bit about the IDE, but I still think
there's a really fundamental ability of being able to walk through the
"decision-making logic" of your "code" (in this case, model) that wasn't
touched upon. For example, suppose your model misclassifies a barrier and a
car crashes into it as a result [1]. How do you debug this? You can say,
"Well, it's a data-labelling problem" and go get more data on barriers, but in
the meantime people have died. Model testing and debugging should be an
incredibly high priority for use cases like Tesla's. That means some degree of
interpretability, testing edge cases, simulation, anything to find flaws like
this before they occur in real life.

See here [2] for an example of production ML testing practices. I wonder how
much of this is in place at Tesla? I would argue they should be at the
forefront of work like this. Something tells me they aren't.

[1]
[https://news.ycombinator.com/item?id=17257239](https://news.ycombinator.com/item?id=17257239)

[2]
[https://ai.google/research/pubs/pub46555](https://ai.google/research/pubs/pub46555)

------
telltruth
I love Karpathy but this is Karpathy’s worst talk. People have been doing
applied ML for a long time in many real world settings and everyone who has
done it has gone through the experiences of labeling guidelines getting
ballooned, uncleared data, long tail surprises and so on. This is nothing
specific to deep learning and nobody called it “Software 2.0”. Most ML
practitioners already know it’s all about data and so called IDE to manage the
data and predictions have taken many different forms. The talk would be more
interesting if Kapathy had something new to say, for example, how to avoid
need for long tail outliers by generalizing on more higher level concepts.

------
syllogism
If you're interested in implementing this sort of workflow, you might want to
have a look at our product Prodigy: [https://prodi.gy](https://prodi.gy)

Prodigy is an annotation tool that makes it easy to use active learning or
other model-in-the-loop features. It's a downloadable library, that can start
the web server on your local network, allowing 100% data privacy. We've just
rolled out experimental image support in v1.5.0.

------
TimTheTinker
I know this might sound like I’m still in the 1970s, but... it occurs to me
that marrying an expert system to these neural systems might help
significantly with the long tail of unusual events.

Fundamentally, the problem with these systems (and note, sometimes we say this
about people too) seems to be a failure to think logically. Perhaps expert
systems with sufficiently detailed logical data sets could enable more complex
frameworks for decision making, and allow systems to dynamically create and
run judgment calls with the NN classifier IDs and confidence levels as input
sources.

~~~
colordrops
I'm not involved in automated driving software but that's how I always guessed
it worked (in addition to state machines). Is that not the case? Can anyone
with experience confirm or deny?

~~~
TimTheTinker
A quick google search shows there are scholarly papers (IEEE, etc.) from
earlier this year about the use of expert systems in automated driving. Ha,
should have googled that idea first.

------
ccorda
Related blog post from last November:
[https://medium.com/@karpathy/software-2-0-a64152b37c35](https://medium.com/@karpathy/software-2-0-a64152b37c35)

~~~
icc97
It's not just related, it's pretty much verbatim taken from that article,
images and all.

The only bits added on are references to Tesla.

------
giacaglia
Thanks for sharing this. This is awesome. It reminds me of something that
Minsky said in the 70s: "Computer languages of the future will be more
concerned with goals and less with procedures specified by the programmer"

------
mosselman
Is 'Software 2.0' something that is introduced in the video or is it some
bullshit-bingo term that I have missed?

~~~
MasterScrat
He introduced it himself end of last year in this article:
[https://medium.com/@karpathy/software-2-0-a64152b37c35](https://medium.com/@karpathy/software-2-0-a64152b37c35)

~~~
icc97
Ah basically this entire talk execept for the bits at Tesla was taken from
that article.

------
sheeshkebab
It seems this guy’s software 2.0 looks like some specialized toolset for auto
driving/navigation than a general purpose shift.

If all that we get out of it is fancy data labeling tools, incapable of
learning anything new by themselves, it’s going to get old real quick.

------
zawerf
Why can't the process of "iterating on datasets" in the second half of the
talk be automated?

For example:

\- automatically learning that trolley is a not a great fit for a "car"
because they are behaviorally different

\- reclassify that cluster as a new entity even if it doesn't know the english
term "trolley".

\- If it finds the new distinction useful, ping the human that it needs more
training examples for that situation

Similar to how a human learner can identify that he's bad at something, figure
out what the common problem is, and use that information to focus on what to
practice on next.

I am sure he alluded to doing this in his talk but what's the technical term
for it?

~~~
nl
This is what Active Training is. Basically you trace the boundaries in a
classification model and provide more examples along those boundaries.

~~~
zawerf
Does this work even if the output format has to change? I can see that helping
with class imbalance but not when a useful class is completely missing or in
the wrong "format".

For example in the bright smudges vs raindrops case, it might not have a label
for the sun but it should be able to identify it as a important dimension in
the cases it is getting wrong. Better yet, something more abstract like
"smudge illuminated by light" or "bright background" that will be hard to
annotate (e.g., how bright is bright?).

~~~
nl
Yes(ish).

There are some practical software issues around not knowing the number of
classes in advance, but those are "just coding".

There is no reason why introducing a new class shouldn't be as simple as
providing additional examples to an existing class.

------
sjg007
Great talk.

I thought their approach to rain sensors was interesting. The vision AI wiper
function seems like overkill when a different system is capable of performing
it almost flawlessly. I'd guess that the AI has a dedicated circuit though and
becomes upgrade-able.. so those are pluses. But the rain AI system is a good
test case and learning task for both the humans and the AI. Hypothetically, if
you can't recognize rain drops, how can you recognize cars? It sounds like
they learnt a lot trying to make that function so hopefully a lot of the
knowledge in building that system generalized/translates over to the rest.

Besides that, modularity is an important design principle. It would be
interesting to see how people combine different NN modules and integrate them
with 1.0 code. Do you have a NN 2.0 controller? Some kind of self learning
system that you train? I would imagine you'd want to take feedback into
account at some point probably in Bayesian way.

------
meken
Interesting thoughts about Software 2.0 IDEs at the end. A desirable property
of such a tool (I think) is having a short feedback loop. However, it seems to
me that common changes would include re-labeling a significant portion of your
data set (e.g. because you realize you need a new class that you didn't think
of when you began labeling), and retraining the model. It seems like transfer
learning/fine-tuning can alleviate the latter problem some.

Also an interesting take on how complexity has shifted from architecture
selection to labeling data. It makes me wonder if there won't be a "Software
3.0" where most of the complexity shifts from creating a good labeling schema
to, say, deciding on a good evaluation metric (I think the buck stops here as
I can't imagine an AI silver-bullet automatically determining the evaluation
metric). Perhaps unsupervised learning will come to the rescue and free us
from the complexities of label schema design.

~~~
FLUX-YOU
>deciding on a good evaluation metric

Have two or more labeling teams labeling the same stuff so you can reach a
consensus or flag the differences and review and figure out why there was a
difference.

Humans will be doing this for a while, I think it's worth having large
companies (as large as Goog/Amzn/MS/FB) dedicated to the task.

------
acoye
This reminds me of a relevant XKCD,
[https://xkcd.com/1838/](https://xkcd.com/1838/)

Given the shown technique to build a state of the art neural net, I wonder
what QA will look like and if we will be able to reach a sufficiently low
probability of failure.

5 sigma reliability will be necessary at least in some fields for humans to
accept to rely on it (like autonomous driving)

------
mst
I. um. 52 "required" cookies and video only.

What the hell? 52? Really?

~~~
tzahola
They need training examples for their neural nets. ¯\\_(ツ)_/¯

------
imranq
I see a lot of ML startups on a daily basis, and most of this rings true.
Software 1.0 is for accomplishing tasks by going from need -> logic ->
solution.

Software 2.0 is translating human intuition into machine code directly through
advances in machine learning. How well someone’s dataset has been labeled will
determine how well a “software 2.0” program will work, since that’s where
where the human intuition lies.

------
w_t_payne
'labelling is an iterative process' \-- I learned that the hard way in 2008.

Fortunately, I now have a fairly refined method for managing data, labels
etc...

~~~
meken
I'm glad you found a workflow that works for you. I'd be interested in hearing
about what it looks like.

~~~
w_t_payne
Rigorous configuration management of data and metadata mostly. (Primary
records kept as text in a version control system, with a copy loaded into a DB
for searching).

------
victorai
Machine Learning might not be the answer:
[https://mobile.nytimes.com/2018/06/01/business/dealbook/revi...](https://mobile.nytimes.com/2018/06/01/business/dealbook/review-
the-book-of-why-examines-the-science-of-cause-and-effect.html)

------
msoad
If you think about AV systems it's a bunch of input sensors and literally two
output numbers (steering and acceleration). This makes a good candidate for a
big black box deep learning system.

Google tried end-to-end deep learning AV systems and failed exactly because of
the reasons he went through at the end of his talk.

------
cicero19
Great talk. Excited to watch what Andrej achieves in his career. Keep up the
great work!

------
rich-w-big-ego
As usual the comments around Tesla software really miss some critical points:

    
    
      You can't debug a neural net
    

Obviously not, but you also cannot hand-write code to do what neural nets do,
so what's your point? If you can make your neural net 1000x better than a
hand-written algorithm, or if Tesla Autopilot is 1000x better than human
drivers, that doesn't matter. It's not "playing with human lives" if the
humans around the car are 1000x times safer than sleeply, distracted, or
violent human drivers.

~~~
dboreham
Kool aid being drunk...

~~~
rich-w-big-ego
I'm drinking Kool Aid? OK. I'm mostly arguing about the use of Static Analysis
and other typical means of verifying programs, and the ineffectiveness of
those verification methods on neural networks. People use that as an argument
against the safety of NNs.

Set of all computer programs has two subsets.

    
    
      S = { s | Static Analysis can be performed on s }
    
      N = { n | n makes use of a neural network } with N ⊄ S
    

Let n ∈ N and s ∈ S. There are a certain set of programming tasks

    
    
      T = { t | t can be solved with n but not s }
    

Thus any claim that using n to solve t is "unsafe" because you cannot perform
Static Analysis on it is absolute BS, because programs s ∈ S can't even solve
the damn problem!

~~~
seanhunter
Just using set notation doesn't make an argument precise. You are begging some
questions in your definition here.

For example, how do we know that a task solves a particular problem if we
can't perform static analysis on it? It may give the appearance of working and
then degrade radically under certain conditions. That really matters if you're
using it for safety-critical applications and the problem space is large
enough that it can't be exhaustively tested.

------
bsaul
It's a very interesting video, yet am i the only one that think the whole
presentation look extremely naÏve and light for something that's playing with
our lives ?

I mean, there are definitely huge shortcomings in the "let's have the computer
build the model from examples" approaches, and some of them are talked in the
video, others are not :

\- rare events are hard to train (that's talked about). The problem is that
it's a long tail of unusual events.

\- models generated can't be statically analyzed. You can't predict what's
going to work and what's not. You can only hope. One very striking recent
example is in this video :
[https://youtu.be/w2BWmSBog_0?t=220](https://youtu.be/w2BWmSBog_0?t=220) .
Here you can see that an AI trained on the model of AlphaZero managed to reach
3223 elo rating (so, far beyond human), yet it blundered its queen. And that's
just chess, where every rules are written in advance.

\- Models don't build human knowledge. That's more of a philosophical point,
but imagine a perfect AI built on neural networks after having read all human
knowledge. What can it teaches us ? AlphaZero chess isn't able to provide any
clue or explanation on why it favors one move instead of another. You can only
learn buy playing against it, but that's all. Not even the developpers can
tell you what advances in chess theory has AlphaZero made.

~~~
rich-w-big-ego
The main point you offer is that models can't be statically analysed and will
do things unexpected that result in loss of human life. Here's my response:

\- Humans also do unexpected things, like stepping on the gas instead of the
brake. If Autopilot does unexpected things at 0.01% the rate that humans do
unexpected things, then it is a huge safety bonus to use autopilot.

\- How can we solve image recognition any other way? We must move the needle
forward on our technology. If we do not struggle against the adverse side-
effects of our software and make it better, we will never advance it and we
will be stuck requiring human drivers for all driving tasks.

~~~
TimTheTinker
> If Autopilot does unexpected things at 0.01% the rate that humans do
> unexpected things, then it is a huge safety bonus to use autopilot.

You’re forgetting a variable - the frequency or likelihood of a situation to
occur. Go far enough down the long tail and neural AI can get far more deadly
than human drivers.

~~~
rich-w-big-ego
It's unclear what your argument is

~~~
TimTheTinker
I’ll try to clarify. In my opinion, the weakness of neural systems is their
inability to deal with input for which they have comparatively little or no
training. There’s no way around that, except by introducing structures or
systems outside the neural nets themselves that provide _logical_ frameworks
for dealing with rare events. (And I don’t just mean a bunch of logic in code.
Expert systems are one potential approach, for example.)

Your point was that mosern autonomous driving systems can drastically reduce
fatalities over human drivers—-and I agree, but only for circumstances for
which the car’s neural systems have been well-trained. But the systems ought
to be able to handle ~99.99% of the _types_ of circumstances gracefully before
most of us will trust them to safely drive us around.

------
mrfusion
What is the best practice for handling infrequent data points like the blue
stoplights he mentions?

~~~
isaac_burbank
Two that come to mind are:

\- Using data augmentation to turn the smaller amount of examples into enough
samples for appropriate representation within the dataset.

\- Add a weighting coefficient to the model's cost function to make
misclassifying these examples more expensive.

Note: you can do serious harm to your model with either of these approaches if
you don't know what you're doing. The safest solution is to collect more
examples of the infrequent class.

------
iosDrone
Such a shame to see someone so talented working for a company that is
lightyears behind Waymo and GM in autonomous driving and that is going to go
bankrupt.

I believe the last three directors of autopilot have quit in the last 3 years
or so. And that's in addition to the mass exodus of executives, some of whom
left millions of dollars of stock options on the table.

~~~
justicezyx
Not sure what you mean.

Talent is more pricy in places that lack them. Waymo can tap into googles vast
amount of talent pool, this talented person would be worth less in waymo for
sure.

------
outside1234
Saw this at Spark+AI - highly recommended

