
New Top Score on AI2 Reasoning Challenge (ARC) Is 53.84% - alexwg
https://leaderboard.allenai.org/arc/submissions/public
======
vxl
The challenge appears to be to create an AI that can answer the most multiple
choice questions correctly.

Here's some of the questions from the dataset:

• Which of these is inherited by a person from his or her parents? (A) short
hair (B) long arms (C) pierced ears (D) scar on the leg

• Which object occupies the greatest amount of space? (A) a galaxy (B) a black
hole (C) a neutron star (D) a solar system

• What is a similarity between sound waves and light waves? (A) Both carry
energy. (B) Both travel in vacuums. (C) Both are caused by vibrations. (D)
Both are traveling at the same speed.

There's an entry called "Guess All" that scored 25% as you might expect.

They provide a list of 14 million science-related sentences (presumably for
training) but there's no requirement to rely solely on them to solve the
challenge. The list has been scraped from web search results so looks quite
noisy.

~~~
LeifCarrotson
4-choose-1 makes 53% a lot more impressive than when I assumed it was
true/false. I had, like the other poster, assumed it was a sum of random
chance, statistical deviation, and publication bias, but 53% suggests it's
doing a little better than that.

~~~
stakhanov
I oversimplified in my previous post.

You're usually going to get to a result that's slightly better than random
agreement by applying some really-not-all-that-impressive baseline strategy
like looking for bag-of-words overlap between a question and a known factoid
for which you have a stored answer. This strategy is easy to fool: If you have
a factoid like "Dogs that chase cats are dangerous" and a question is asked
like "Are cats that chase dogs dangerous?", then it might answer "Yes" because
it matches the stored fact: But the answer will actually _be_ "Yes" in more
than 50% of all cases. Since dogs tend to chase cats, and cats tend not to
chase dogs, the concept of cats chasing dogs is less likely to come up in a
question answering context than the concept of dogs chasing cats. There are
obviously numerous other ways that this is going to fail, like "Trump believes
the world is flat" might be a factoid and the question might be "Is the world
flat?".

The result I previously oversimplified should more correctly be stated as
follows: Anwers coming from the "artificial intelligence" systems submitted
for the conference were analyzed to see if any answer they gave represented a
deviation from the baseline that fixed an error in the baseline, versus
introduced an error that the baseline wouldn't have made. The submissions were
then analyzed to see if there was a pattern whereby more errors would be fixed
than introduced. It was found that the statistical pattern was basically that
random deviation was introduced into the baseline, and systems that, as a
result of such random deviations scored very badly weren't submitted at the
conference, thus generating a scores that were consistently more likely to be
better than baseline rather than worse-than-baseline, but not in a meaningful
way.

To reiterate the source: It can be found on page 43 here:

[http://richard.bergmair.eu/pub/thesis.pdf](http://richard.bergmair.eu/pub/thesis.pdf)

But that applies to the old RTE conference. So it would be interesting to see
if the same holds true here.

~~~
yorwba
This leaderboard tries to prevent that kind of problem:

> The dataset is partitioned into a Challenge Set and an Easy Set, where the
> former contains only questions answered incorrectly by both a retrieval-
> based algorithm and a word co-occurrence algorithm. This leaderboard is for
> the Challenge Set.

Additionally, I don't think they let you try often enough to get a meaningful
chance at significantly beating the baseline with just pure randomness.

------
stakhanov
A bit reminiscent of the "Recognizing Textual Entailment Challenge (RTE)"
which was run under the "Text Analysis Conference" umbrella and hosted by NIST
until it was discontinued a few years back. An interesting insight from a
qualitative analysis of the deviations of submitted answers versus gold
standard answers is that it can be explained surprisingly well by: random
choice minus publication bias. See here:

[http://richard.bergmair.eu/pub/thesis.pdf](http://richard.bergmair.eu/pub/thesis.pdf)
[page 43]

That's what the number "53.84%" sounds like to me.

