
Neural programmer better than Quicksort - nl
https://arxiv.org/abs/2007.03629
======
ghj
It's really hard to write correct code.

Sorting was broken in java and nobody noticed for a long time:
[http://envisage-project.eu/wp-
content/uploads/2015/02/sortin...](http://envisage-project.eu/wp-
content/uploads/2015/02/sorting.pdf)

The same was true for java's binary search:
[https://ai.googleblog.com/2006/06/extra-extra-read-all-
about...](https://ai.googleblog.com/2006/06/extra-extra-read-all-about-it-
nearly.html)

So I am not sure I will ever trust an ML algorithm trained on inputs/outputs
only (which is what I think "neural program induction" means). The above bugs
are only hit because the programmers who wrote it didn't think about overflow.
What is this "neural programmer" assuming about their inputs? We'll never
know.

OTOH the standard library sort can be improved if you know the distribution of
the numbers you're sorting (e.g., small numbers can bucket sort, small lists
can use a sorting network, etc). If this thing can efficiently almost-
correctly-sort because it's better at these types of pattern matching, we can
just run a final pass of insertion sort to make it useful!

~~~
nardi
Next step is to prove the ML-produced algorithm is correct. I can see a world
where we generate algorithms and then automatically prove they're correct
based on a formal description of the problem being solved.

~~~
bufferoverflow
It doesn't need to be 100% correct. There's a place for fast sorting
algorithms which are correct most of the time - basically anything non-
critical, like sorting comments by upvotes.

According to the article, they couldn't find a case where it wasn't correct.

~~~
jcelerier
> There's a place for fast sorting algorithms which are correct most of the
> time - basically anything non-critical, like sorting comments by upvotes.

Nothing saddens me more than this trend of the modern web where everything
works semi-probabilistically (even if it's likely for "good" technical
reasons, such as, "we wrote our server backend in Ruby which is slow as
molasses so now we need 230 CDN and 800 databases instances around the whole
world and transformed our simple centralized problem into an horrendous
decentralized one).

The central reason for me to use computers is that they are (or at least were)
deterministic to a much higher degree that normal life, and so many things in
the 10 last years becoming much more non-deterministic in particular on social
websites is something that frustrates me every single day as it just makes the
whole experience and process of using computers & the web very unreliable
compared to what it used to be.

~~~
diegoperini
Today's perfections are yesterday's "good enoughs". Don't be sad for the
trend. In 10 years, you will have new perfections to enjoy.

~~~
UncleOxidant
Like the probabilistic bank account balance. "You have between $10 and $1000
with the greatest likelihood being $537 (52% chance)."

~~~
FabHK
There was this thread a while ago with HSBC switching to MongoDB... so, yeah,
distinct possibility :-)

[https://news.ycombinator.com/item?id=23507197](https://news.ycombinator.com/item?id=23507197)

(The article was very light on details though, and it was probably just one
team that consolidated to MongoDB, not the accounts itself... one hopes.)

------
xiphias2
The code (which is a few percent faster than quicksort) is on the last page:

    
    
      procedure QUICKSORTAGENT(input state)
      2: Let i = 1, j = 2, l = 3, h = 4
      3: if FunctionID = None then
      4: return vh ← Function1(vl ← vl, vh ← vh)
      5: else if FunctionID = 1 then . QuickSort
      6: if vl < vh then
      7: if prev = None then
      8: return vi ← Function2(vl ← vl, vh ← vh)
      9: else if prev = (vi ← Function2(vl ← vl, vh ← vh)) then
      10: return AssignVar(j, i)
      11: else if prev = AssignVar(j, i) then
      12: return MoveVar(i, -1)
      13: else if prev = MoveVar(i, -1) then
      14: if vi > vl then
      15: return vi ← Function1(vl ← vl, vh ← vi)
      16: else
      17: return MoveVar(j, +1)
      18: end if
      19: else if prev = (vi ← Function1(vl ← vl, vh ← vi)) t hen 
      20: return MoveVar(j, +1)
      21: else if prev = MoveVar(j, +1) and vj < vh then
      22: return vh ← Function1(vl ← vj , vh ← vh)
      23: else
      24: return Return(h)
      25: end if
      26: else
      27: return Return(h)
      28: end if
      29: else . Function ID = 2, Partition
      30: if prev = None then
      31: return AssignVar(i, l)
      32: else if prev = AssignVar(i, l) then
      33: return AssignVar(j, l)
      34: else if vj < vh then
      35: if prev = Swap(i, j) then
      36: return MoveVar(i, +1)
      37: else if (prev = AssignVar(j, l) or prev = MoveVar(j, 
      +1)) and A[vj ] < A[vh] then
      38: if vi 6= vj then
      39: return Swap(i, j)
      40: else
      41: return MoveVar(i, +1)
      42: end if
      43: else
      44: return MoveVar(j, +1)
      45: end if
      46: else if prev = MoveVar(j, +1) then
      47: return Swap(i, h)
      48: else
      49: return Return(i)
      50: end if
      51: end if
      52: end procedure

~~~
collyw
This reminds me of programming basic on an Acorn Electron. Line numbers and I
don't remember indentation, though it was a few decades ago.

~~~
dspillett
The Electron had a new enough version of BBC Basic that "proper" procedures
and UDFs were supported (in a limited fashion, but well enough considering the
constraints of the machine) so you could pretty much ignore line numbers, in
fact by using a text editor and TYPEing the result into BASIC you could write
code without them (the interpreter needed them, the TYPE trick made it put
them in for itself).

But yes, this was uncommon so line numbers were a major thing.

And indentation was optional, you could add extra spaces to the start of lines
and they would be kept and would have no effect on execution, but you would be
wasting precious bytes of RAM and if using one of the heavier display modes
(0, 1 or 2) you would only have 8.5Kbyte to play with for _all_ the code and
run-time storage (your variables, BASIC's call stack).

------
rightbyte
Quite dull research wrapped in fancy words.

Essentially they generate functions that map a program state to another
program state and counts those function calls and compares with e.g. the
number of function calls in quick sort. They "cheat" by feeding how all
elements compares the their neighbours at each step. Like, if you give the
algorithm that input for free what's the point.

 _Notably, none of the popular sorting algorithms decide which elements to
swap by looking at the whole input at each execution step. On the contrary,
the decisions are typically made based on local evidence only. ... We
implement this principle for the sorting task by employing a small set of
k(independent of n) index variables, and include in the input the information
about how A[i] compares with A[i+ 1] and A[i−1]for each index variable i._

~~~
EchoAce
This is absolutely not cheating; every hand designed algorithm can access and
compare everything in the array!

The real point of what they’re saying in your italicized quote is actually
that giving the net full access hinders efficiency, so they actually restrict
it. Almost like the opposite of cheating.

~~~
rightbyte
It is not a pointer to an array I'm concerned about but the "neighbour diff
vector" or what you should call it that provided by the "environment". See
A.1.2.

Doing so many comparisons and storing them has a cost. Also the model can't
decide if it is done so at each step the array has to be iterated to see if it
is sorted by the "environment". Are they only counting function calls? I guess
so. The paper is really hard to follow and the pseudocode syntax is quite
madding.

If I understand the paper correctly of course I could be wrong.

If so, _" our approach can learn to outperform custom-written solutions for a
variety of problems"_, is bogus.

~~~
YeGoblynQueenne
>> Are they only counting function calls? I guess so.

Oh that, yes, it's true. They're listing "average episode lengths" in tables
1-3 and those are their main support for their claim of efficiency. By
"episode length" they mean instruction or function calls made during training
by the student agent which they compare to the instructions/function calls by
the teacher agent. So, no asymptotic analysis, just a count of concrete
operations performed to solve e.g. a sorting task.

------
me551ah
According to the limitations stated in the doc.

1\. It runs slower on modern CPUs and you need neural logic units to see the
speed benefits

2\. Model needs the be fined tuned to the data for best performance.

While it may be better than quicksort in certain usecases it isn't going to
replace it anytime soon.

~~~
sinuhe69
"Model needs the be fined tuned to the data for best performance." -> typical
:D

------
spyckie2
This is pretty cool, pardon my limited understanding, trying to see if I get
this correctly:

1) they train a model to do sorting, it actually sorts correctly. 2) they
optimize the model on efficiency and it becomes better than many custom sort
functions?

If I remember correctly from school, you can basically speed up any kind of
sorting by using more space (basically by using a hash table). Is the neural
network just using the normal sort algo until it sees a pattern in the input
and then skips the sort algo for the pattern output? Or to put it plainly, is
it just a smarter (or dumber, depending on the size of the neural network)
hash table?

~~~
radicalbyte
The question is how does the algorithm perform on average, what are the
pathological cases and how slow are they?

How do you prove that with an algorithm that no human can reason about?

Basically this paper looks like they've found an algorithm which is efficient
for given sets or given classes of sets. Whether that generalizes is a
different problem.

It's basically a cool automatic heuristic generator.

~~~
seek3r00
Exactly, that’s what I thought. Although, they should be able to look at the
set of instructions that generated the output, which is basically an algorithm
in itself. Then they could try to prove whether that algorithm would really
generalise.

------
woeirua
Eh... Look I get that everyone wants to publish exciting results, but to call
this unequivocally "better" than quick sort is a big stretch. The profiling
results are unconvincing at best. A small, < 10% improvement, could easily
(and much more likely) be explained by any number of possible differences in
optimization between the quicksort implementation they're using and the neural
network library.

Table 3 is very suspicious too. If they're doing what they claim, then why is
Binary Search all of a sudden significantly better than the learned agent for
some problem sizes? It really feels like they're running up against some kind
of inherent optimization boundary inside their neural network library...

------
curiousgal
Better as in lower number of instructions not lower wall clock time.

------
seek3r00
Some interesting extracts:

“The instruction set together with the input representations jointly determine
the class of algorithms that are learnable by the neural controller, and we
see this as a fruitful avenue for future research akin to how current
instruction sets shaped microprocessors.”

“The generalization or correctness we aim for is mostly about generalization
to instances of arbitrary sizes.”

“[...] computing a = f(s) can be more expensive on current CPUs than executing
typical computation employed in the algorithms studied here. We thus hope that
this research will motivate future CPUs to have “Neural Logic Units” to
implement such functions f fast and efficiently, effectively extending their
instruction set, and making such approaches feasible.”

------
richdougherty
Pretty cool.

"We include videos in the supplementary material showing the execution traces
of the learned algorithms compared with the teachers, and observe
qualitatively different behaviors."

Anyone know where we can find these?

~~~
magnio
I think it's this one.
[https://mobile.twitter.com/liyuajia/status/12812613378307112...](https://mobile.twitter.com/liyuajia/status/1281261337830711299?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Etweet)

------
mabbo
> As highlights, our learned model can perform sorting perfectly _on any input
> data size we tested on_

I'm not going to drop something into my production code that works on just the
inputs you tested with. As ghj pointed out- Java's highly analyzed sort method
had a bug for years and we didn't notice.

It's neat research and I won't discourage it at all, but give me some proofs
of correctness.

------
pmarreck
Can anyone link to experiments in the space of evolving code by mutating it in
a syntactically-correct way, running it, evaluating its results or killing it
if it doesn't complete?

~~~
caballeto
There is a project along these lines at Google, which is used to test graphic
drivers. The generated code is randomly (but equivalently) transformated, and
the result is verified to be the same. If it doesn't match, then there is a
bug. Link to the project [https://github.com/KhronosGroup/SPIRV-
Tools](https://github.com/KhronosGroup/SPIRV-Tools).

------
ummonk
Does this work for arbitrary data sorts? If not, beating quick sort is really
easy with bucket sort.

~~~
ogogmad
> Does this work for arbitrary data sorts?

It's a Comparison Sort, as they say on page 5 - so I think the answer should
be yes.

------
YeGoblynQueenne
Short version:

In summary, this paper makes strong claims of generalisation and efficiency
but does not seem to provide sufficently strong evidence to back them up, e.g.
it does not perform an asymptotic analysis to support the efficiency claim, it
doesn't show actual program output to allow the evaluation of what is actually
learned and the claim about strong generalisation is based on a very vaguely
described experimental setup.

Longer version below.

Interesting paper. It describes a Reinforcement Learning method to learn
neural network models that imitate the behaviour of agents performing a)
sorting, b) binary search and c) solutions to the NP-complete knapsack
problem. The technique is notable because it learns from program traces also,
not only from input-output pairs. The technique relies on a hard-coded set of
problem-dependent a) input states and b) instruction sets. The latter are what
it sounds like, sets of instructions that form the building blocks of the
learned programs, although in an enhanced version of the technique, functions
are also added to the (less complex?) instructions. Input states basically
include all other information that may be useful to a learner, other than the
instructions and functions from which to build a program. Two different setups
are tested: a) one using the single instruction "swap" (that swaps two
elements) and where the input state includes information of the full list, and
b) one with more instructions and localised information that is reported to
generalise better and be more efficient. The latter setup is also augmented
with "functions" which are stated to be more complex than operations, though
the distinction is rather on the vague side (there is some more information in
an appendix but it's not very helpful, I'm still left wondering why an
"instruction" is qualitatively different to a "function").

The paper is poorly organised and requires long appendices to elucidate all
manner of poorly defined concepts in the main paper. This doesn't help when it
comes time to evaluate the two most striking claims in the paper, namely that
the proposed method(s?) a) show strong generalisation and b) outperform hand-
crafted sorting algorithm quick-sort in sorting lists of numbers.

The first claim is difficult to evaluate because the paper presents results of
sorting lists of variable _size_ but does not say anything clear about list
_contents_. For example, while training lists are chosen to have a random
length in [10,20] and with elements in the same range, testing lists have
their length incrased, but there is no mention of the values of the elements
in those litss and, going by what is described in appendix C, lists can have
duplicate values. In other words, it's possible that the "strong
generalisation" results refer to sorting lists of variable size with the same
unique elements as in the training lists. In that case, the claim of "strong"
genearlisation is not supported.

The second claim is difficult to evaluate for two reasons: a) there is no
example of the training output, i.e. we never get to see what progams,
exactly, the proposed approach is learning; and b) there is no attempt to
asymptotic analysis of the learned models (which would be hard without an
example thereof). Instead, the claim of efficiency seems to rest on two
presmises: a) that sorting programs of O(n log n) time complexity are in the
hypothesis space defined by the input states and instruction sets and
functions for the relevant experimental setup, i.e. the learner _could_ learn
an O(n log n) sorting algorithm; and, b) that the number of training steps
taken by the learning agents in imitation of the teacher agents is in the
ballparck of n log n of the input. In other words, the strongest evidence for
the efficiency claim seems to be a count of instructions (or function) calls
during training. So, this very strong claim is also very poorly supported.

~~~
ogogmad
> I'm still left wondering why an "instruction" is qualitatively different to
> a "function"

Functions can be recursive. So the algorithm that the agent generates is able
to use divide-and-conquer.

~~~
YeGoblynQueenne
Yes, but the "functions" in the paper are implemented as two extra types of
_instruction_ \- one representing a function call and one representing the
"return" point of the function. So again, the difference is not clear. An
instruction is an instruction, but some instructions are function calls (or
returns). Why not just call them all "instructions", or all "functions"?

------
distdev89
Coming to a leetcode style interview near you!

------
Marazan
Evolving sorting networks was cool.

This is not cool.

~~~
somjeed
"We don't have a formal proof for this, and only have empirical evidence,
measured on a large number of test instances. More on this in the paper. The
learned algorithm is not entirely opaque. I'd argue it is easier to understand
its behavior than that of a neural net."
[https://mobile.twitter.com/liyuajia/status/12815201991083089...](https://mobile.twitter.com/liyuajia/status/1281520199108308994)

agreed, this is kind of backwards research

------
m3kw9
How will you debug this?

~~~
bokbok8379
With a neural network trained debugger, of course.

------
ultrablack
There are sorting methods faster than quicksort though. O(n loglog n) for
instance. Or for randomized its O(n sqrt log log n).

~~~
ogogmad
Omega(n log n) is the fastest possible for a comparison sort, by a simple
decision tree argument. This is true even for the average case.

Their algorithm is a comparison sort, as they make clear on page 5.

