
How should we evaluate progress in AI? - amasad
https://meaningness.com/metablog/artificial-intelligence-progress
======
YeGoblynQueenne
>> GOFAI had several defects, but… the main thing is, nearly all of it was
false.

It would be nice to see some kind of substantial examples of how "(nearly all
of) GOFAI was false", accompanying statements like the one above. The problem
of course is - those are very hard to come by.

That is so because logic-based AI was _abandoned_. And it was abandoned
because funding was cut repeatedly, not because of its failure to prove this
theory or achieve that aim, but because the ones holding the purse strings
were administrators and military pencil-pushers, who had no way to know a
successful, or failed, program if one came up and bit them in the boogies.

And just to substantiate my comment- what, exactly, was "false" about logic
programming, one of the major research subjects in GOFAI? It worked just fine
back then, it works just fine right now. In very practical, down to earth
terms, you can prove a proposition, or a predicate, true or false by automatic
means, sure as you can answer "2 + 2 = ?".

So, really- more substance, less assertiveness, would do a world of good to
those for whom "AI" means everything that they read online after 2012 and who
may end up missing a hell of a lot of the history of the field if they take
that sort of "GOFAI failed" statements at face value.

~~~
colorint
As a general matter, you can't prove a predicate true or false by automatic
means. Logic programming is just as "artful" as imperative or functional
programming, because they all run into the same problem: it's impossible to
tell the difference between long-running and infinite computation. The
question of algorithms for general logic was explicitly addressed as the
Entscheidungsproblem, meaning decision problem, which was independently proven
undecidable by both Turing and Church:

[https://en.wikipedia.org/wiki/Entscheidungsproblem](https://en.wikipedia.org/wiki/Entscheidungsproblem)

~~~
YeGoblynQueenne
>> As a general matter, you can't prove a predicate true or false by automatic
means.

Not in the general case, sure, yet in practice I'm sure we've all written
plenty of code that terminates just fine. [Edit: I'm talking about imperative
as well as logic or functional programming code].

The question then is- what does it mean when a program terminates? In the case
of principled approaches like logic or functional programming, you have a
pretty good idea what that means (e.g. a logic program proves a theory true or
false). When an imperative program terminates, it's a very hairy affair to say
what, exactly, termination means.

[Edit 2: Actually, if you think about it, there's nothing we can really
achieve in the general case (including machine learning; see language learning
in the limit). In practice, on the other hand, we're doing things, alright -
by continuously relaxing principles and fudging limits as necessary (see PAC
learning)].

------
taeric
I confess I was tempted to take a pass on this one. In large, because I've
fatigued a lot on reading about AI and Machine Learning. Didn't help that this
is a large article.

That said, I encourage everyone to give this more than a single pass. There is
irony that we are, to this day, still quoting and agreeing with Feynman's
Cargo Cult Science piece. It is almost disheartening to see that we still have
a hard time listening to advice from the early 1970s. By and large, though,
this piece does a great job really laying out what makes it so hard to level
most criticisms at AI related studies. The cross disciplinary look is one I
wouldn't have thought to do, but really does explain a lot.

I'm torn, because I'm sympathetic to most of the defenses. It is hard to
really believe we are making meaningful progress, though. Even if I am
enjoying many of the small practical improvements we have managed to get out
of things.

~~~
Erlich_Bachman
What would "meaningful progress" entail for you? How would your life change,
how would the life on earth in general change? Great achievements in science
(aided by AI)?

~~~
taeric
That is a different question, though. The question if if there is meaningful
progress in ai. The amount of progress explained solely by percentage
improvements against a benchmark is pretty high.

I don't think this is particularly damning. Nor do I think it should be
halted. However, I agree it is hard to call progress.

~~~
Erlich_Bachman
What specifically would be easy for you to call progress?

~~~
taeric
An equation that could predict a better ML model. Honestly, I think claiming I
want an equation is a touch too much. However, a falsifiable prediction would
be nice.

Imagine if the only way we got accurate ballistics was by requiring faster
more powerful guns all of the time. "We can hit the target, but only if we
upgrade our guns to railguns and limit ourselves to large targets."

------
andyidsinga
As a person that works in software development and trying to make sense of
data - I found the article pretty interesting after the paragraph below. The
discussion around science, engineering and adjacencies are good to keep in
mind.

> AI researchers often say they are doing engineering. This can sound
> defensive, when you point out that they aren’t doing science: “Yeah, well,
> I’m just doing engineering, making this widget work better.” It can also
> sound derisive, when you suggest that philosophical considerations are
> relevant: “I’m doing real work, so that airy-fairy stuff is irrelevant. As
> an engineer, I think metaphysics is b.s.”

~~~
aisofteng
Any engineer that thinks metaphysics is “b.s.” is uneducated in the schools of
thought that made engineering of any sort possible in the first place.

~~~
j88439h84
Can you explain what you mean by this?

------
stared
Please, no. While he provides some food for thought, asks a few important
questions (and provokes even more), this text is full of
pseudointellectualism. As in: fancy words but utter lack of understanding of
the core subjects one writes about. Most importantly: what is science (no, it
is not a trivial question; I recommend going Ludwik Fleck's "Genesis and
development of a scientific fact" route,
[http://www.evolocus.com/Textbooks/Fleck1979.pdf](http://www.evolocus.com/Textbooks/Fleck1979.pdf))
and what is AI (it is a vast field; some of it IS math/CS (as in: proving
things), but it is a small part).

For example, many of things he says would fit other practical disciplines,
e.g. medicine. Yes, experiments are on a group of people. Yes, criterion that
"drug X works better than drug Y, but we don't know why" is sufficient.

> I don’t know data science folks well, but my impression is that they find
> the inexplicability and unreliability of AI methods frustrating.

Well, it is not the main issue (speaking as a data scientist working with AI).
At least he acknowledges his lack of expertise.

> These failures of scientific practice seem as common in AI research now as
> they were in social psychology a decade ago. From psychology’s experience,
> we should expect that many supposed AI results are scientifically false.

Also - no, it is not at level of psychology when it comes to the replication
crisis. A lot of code (though, unfortunately, not all) is shared online, by
the authors or other contributors, and people do replicate it (or there is an
absence of replication, which also conveys a message).

~~~
Radim
Speaking as both a ML researcher and an applied ML business owner (one who
hired _stared_ at one point — hi Piotr :), I respectfully disagree.

The replication crisis in "AI" may not be as bad as psychology (I wouldn't
know), but it's not great. Sadly, my brain has somehow learned to equate
"SOTA" with "hot-stitched crap, stay away". Too many painful lessons.

On the subject of publishing code: this is useful to the degree that it
removes bad faith as the possible reason for the lack of replicability. But
otherwise helps little in practical terms. You just have the privilege to sift
through the bugs and bad design in more close-up.

 _" I am afraid you are right. I used to reach ~72% via the given random seed
on an old version of pytorch, but now with the new version of pytorch, I
wasn't able to reproduce the result. My personal opinion is that the model is
neither deep or sophisticated, and usually for such kind of model, tuning
hyper parameters will change the results a lot (although I don't think it's
worthy to invest time tweaking an unstable model structure)."_

= quote [1] for one of the "new SOTA" papers from NLP (WikiQA question
answering), where the replication scores came out 62% instead of claimed 72%.

I generally call this the "AI Mummy Effect" — looks great but crumbles to dust
on touch.

[1]
[https://github.com/pcgreat/SeqMatchSeq/issues/1](https://github.com/pcgreat/SeqMatchSeq/issues/1)

~~~
stared
Hi Radim!

To make it clear, I am not happy with the current state of reproducibility in
AI. Yet, it is still better that in all disciplines I interacted with (quantum
physics, mathematical psychology). There the standard prectice was to not
include any code, even if the paper bases on it.

Vide my answer to "Why are papers without code but with results accepted?"
([https://academia.stackexchange.com/questions/23237/why-
are-p...](https://academia.stackexchange.com/questions/23237/why-are-papers-
without-code-but-with-results-accepted/23238#23238)).

So, I was so happy to see that in Deep Learninig a lot of code appears on
GitHub (I am the most happy if it appears in different frameworks, implemented
y different people).

Dirty code provides limited value. It's hard to learn from it, it's hard to
re-use it, and its performance may depend on the phase of the Moon (and system
setting, software versions, etc). Yet, IMHO, is much better than no code. It
is not only about good faith, but about including all details. Some of them
may seem unimportant (even to the author), yet crucial for the results.

The next level is resonably well written code, with clear environment setting
(e.g. Dockerfile/requirements.txt), and the dataset. Otherwise it is hard to
proof it against "on my environment it works":

> where the replication scores came out 62% instead of claimed 72%

------
laichzeit0
> “This year, we’re getting Z% correct, whereas last year we could only get
> (Z-ε)%” does sound like progress. But is it meaningful?

This is one thing that bother's me a lot when I read published work. I have a
feeling this is a result of everyone using the same benchmark datasets, so it
inevitably becomes more of an _engineering_ exercise rather than scientific
progress.

In NLP the difference between a publishable result and one that is not is
often to squeeze out a few extra (Z-ε)% by throwing in an attention mechanism
and ensembles to your new super duper improved SOTA architecture.

This is the problem of "replicability", which really requires more than just
the same benchmark dataset used over and over again. The author seems to touch
on this point later on.

Then there's the issue of "reproducibility". Very few researchers seem to
publish their code with instructions how to build and re-create their results.
What an awful lot of time is wasted trying to reproduce results. Here's a good
example: [https://groups.google.com/forum/#!topic/word2vec-
toolkit/Q49...](https://groups.google.com/forum/#!topic/word2vec-
toolkit/Q49FIrNOQRo)

------
joejerryronnie
I am not trained in any type of field that is remotely related to AI research
or engineering. Outside of some basic ML projects at work, I am not well
versed in the practical application of AI technologies. But I do wonder a few
things about AI:

\- Have we made real, technological progress over the last 50 years or are we
just leveraging far greater computing power and the ability to collect much
larger data sets to run statistical analysis on?

\- Will general purpose AI consist of essentially layers and layers of AI's
that can handle progressively more abstract inputs, models, and patterns? For
instance, the lowest level AI is what we see today - a very powerful tool but
bound to a specific use case. One layer up may be able to combine inputs from
a dozen 1st tier AI's to generalize a tiny bit more on the individual use
cases and can deal with a tiny bit more ambiguity. One level up will evaluate
inputs from a dozen level 2 AI's and so on. With the final top layer (perhaps
millions of levels up) resembling general purpose processing similar to a
human brain. What if this model ended up producing true general purpose AI,
but the amount of input synthesizing and modeling required so much processing
power that the speed at which general purpose AI could operate is no faster
than a human brain?

\- Can we achieve general purpose AI through purely algorithmic means, or will
we need to implement a hybrid biological model to achieve real breakthroughs?
If we could accomplish this, would we understand the detailed mechanisms of
the biological component of the hybrid or would it forever remain a black box
that we just tap into?

Anyway, not sure this adds a whole lot to the specific discussion on how best
to measure AI progress, but they're questions I've been pondering lately.

~~~
pnloyd
Ya that's kind of an interesting question. With Moore's law starting to
approach it's physical limitations it would seem AGI wouldn't be feasible with
current algorithms.

Not to mentions as your describing.. those "millions" of layers of narrow AI's
sounds like impossible amount of work to do..

I don't think very many machine learning experts really believe that those
techniques will lead to AGI.

~~~
taeric
I'm curious just how true it is that Moore's law is starting to approach
physical limitations. I just recently listened to some of Feynman's speeches
collected in [1]. One of them was about how to place all of the works of an
encyclopedia onto the tip of a pin.

Did it cover anything we couldn't do today? No. But that was the point. Just
using a somewhat naive view of the physical matter that you would be putting
something on, it was possible to go quite dense. Imagine if we started going
even denser.

Do I suspect we are approaching limits? Certainly. Question is more of just
how much further we can go. And will we need a dramatic shift of any sort
before we could realize some extra distance?

[1] [https://www.audible.com/pd/Science-Technology/The-
Pleasure-o...](https://www.audible.com/pd/Science-Technology/The-Pleasure-of-
Finding-Things-Out-Audiobook/B00BSU83HI)

~~~
ghaff
>I'm curious just how true it is that Moore's law is starting to approach
physical limitations.

Well, Moore's "Law" in the narrow sense has been, to a large degree, about
CMOS process scaling and that's clearly running into physical limits.

There are other levers to get better economical performance--some of which
come at the cost of extra work in software. For certain workloads, GPUs and
TPUs have been an important work around. There almost certainly are further
optimizations involving stacking and interconnects. Probably other
application-tailored designs (which then have to have software tailored for
them individually).

But CMOS scaling has been such a powerful lever that there's legitimate
concern that it may not be possible to replicate that kind of advance using
other techniques.

~~~
taeric
Fair. In the original context, it is clear they were on CMOS, and yes we do
seem to be nearing these limits quite quickly.

I'm curious if/when we could/should move off of current CMOS techniques.

------
mlthoughts2018
> “It’s not scientific progress unless you understand where the improvement is
> coming from.”

I don’t agree with this. If you can chronicle improvement, that is progress.
Giving a satisfying linguistic description of that improvement, when possible,
might be _more progress_ , but merely documenting it is extremely important
scientific progress in its own right.

Overall this essay was extremely hard to read and should cut down about 75% of
the content. The whole wolpertinger thing is nothing but a distraction. Just
say AI is a mixture of disciplines and serves a mixture of outcomes. It only
takes away from the essay to act like you’re being literary or nuanced with
the wolpertinger thing when all it does is subtract from the arguments.

And to boot, after so many words, the final advice is extremely hollow...
literally just saying,

> “And so we should try to do better along lots of axes.”

How should we improve? I guess by “doing better” on “multiple axes.”

The section on “antidotes” is hardly better, saying:

> “I will suggest two antidotes. The first is the design practice of
> maintaining continuous contact with the concrete, nebulous real-world
> problem. Retreating into abstract problem-solving is tidier but usually
> doesn’t work well.“

Except this is already what basically everyone tries to do. Research labs try
to maintain direct contact with state of the art benchmark tasks on a wide
variety of data sets. And often they work extremely hard to produce results
robust across several tasks and several data sets.

And in various other fractured or specific cases, the researchers are very
clear up-front they are solving one particular, ultraspecific problem in the
scope of the paper.

(Unfortunately the second antidote is more “wolpertinger”... ugh.)

~~~
amasad
> And to boot, after so many words, the final advice is extremely hollow...
> literally just saying, > > “And so we should try to do better along lots of
> axes.” > How should we improve? I guess by “doing better” on “multiple
> axes.”

That's not what the final advice is, the author is suggesting the use of
"meta-rationality":

> "AI is a wolpertinger: not a coherent, unified technical discipline, but a
> peculiar hybrid of fields with diverse ways of seeing, diverse criteria for
> progress, and diverse rational and non-rational methods. Characteristically,
> meta-rationality evaluates, selects, combines, modifies, discovers, creates,
> and monitors multiple frameworks."

Although not expanded on in this essay, it seems like the whole blog is
dedicated to the topic.

~~~
mlthoughts2018
> "That's not what the final advice is, the author is suggesting the use of
> "meta-rationality""

I think you mis-read that section of the essay, because the whole conclusion
of the meta-rationality section was the quote that I already gave in my
comment, “And so we should try to do better along lots of axes.”

Literally, that is the sum-up of advice in the lone section of the essay that
possibly has any call to action or advice. It gives a fairly quick and
superficial overview of meta-rationality (which is OK), but does not say
anything at all about putting it into practice except for "doing better" on
"multiple axes" (literally, this is all it says).

So when you say the "final advice" is meta-rationality -- that's already what
I was talking about. That's exactly the part where the essay fails to give any
type of actionable payoff at all.

------
sgt101
If the subject is seen as a system rather than a thing you can make some
progress. On the one hand we have Artificial Intelligence, where insights from
cognition and biology are used to model and explain reasoning and behaviour.
On the other we have AI where people use the outcome of Artificial
Intelligence research with other pragmatically selected technical components
to develop technology. I think that there is interdependence and exchange
between the two, but they have different methodologies and processes. I also
think that huge trouble is created when members of one community use the other
communities clothes and achievements. For example Artificial Intelligence
researchers talking about the practical impact of AI and near term application
of their work and conversely AI researchers claiming that their work "works
like a brain".

------
ggm
Most (human) languages have a set of rules and a variant set of variances ("i
before e except after c, and this catalog of things which we pretend don't
exist")

so ML and NLP are cases of things which are ameanable to rules based systems,
and because english corpus exists widely can be tested openly, against each
other and a common norm of comprehension (for english speakers)

generalized AI does not lie here: systems which uncover the grammer rules and
exceptions do not generalize to systems which uncover rules in Law, or Equity,
or financial trading, or other things. Yes, you can train nets. But the
commonality here, is you can train, not that emergent AI is found.

(not an AI person, strongly anti-AI perspective from life experience in
compsci)

~~~
fny
Is there such a thing as AGI really though instead of collections of modules
that are trained and may have no meaningful mode of interaction? What does NLP
have to do with the mathematics of financial trading?

AGI has always felt to me as a large scale
interdisciplinary/intercomputational activity, much in the same way that most
human intelligence derived from years of intergenerational and interpersonal
intellectual development.

AGI will never be one but many: many intelligences and systems interacting to
produce something of utlitarian value.

~~~
ggm
_much in the same way that most human intelligence derived from years of
intergenerational and interpersonal intellectual development._

wow! I had no idea the archeology on early hominids was that good... (yes,
implicit /s)

------
toolslive
For typical problems (playing chess, tissue segmentation, translation, ...)
you have the quality of the solution (sometimes difficult to measure) versus
the cost of achieving it (energy/enthropy and maybe time). The cost is
important.

------
jmickey
Relevant - AI Progress Measurement page by the EFF:
[https://www.eff.org/ai/metrics](https://www.eff.org/ai/metrics)

------
tim333
This seems to an article written by a philosophical type who doesn't really
understand contemporary AI from a technical point of view. I'm not sure it's
terribly useful.

~~~
rrherr
About the author:

“I did a PhD in artificial intelligence at MIT. My undergraduate degree was in
math. I’ve also studied cognitive science, biochemistry, Old English and
Ancient Greek literature. None of that qualifies me to write Meaningness, but
it may explain a certain STEM-ish orientation, decorated with occasional
literary jokes. ...

I have founded, managed, grown, and sold a successful biotech informatics
company. That may explain a certain practical orientation, and lack of
interest in philosophical theories that depend on the world being very unlike
the way it appears.”

[https://meaningness.com/about-my-sites](https://meaningness.com/about-my-
sites)

~~~
tim333
Ah maybe not then.

~~~
nyolfen
the site this blog is hosted on is really great for explaining a lot of
aspects of postmodernism to engineering-oriented or analytical thinkers. i got
totally sucked in by it over my christmas vacation two years ago and read the
whole thing.

