Neural programmer better than Quicksort 282 points by nl on July 10, 2020 | hide | past | favorite | 126 comments

 It's really hard to write correct code.Sorting was broken in java and nobody noticed for a long time: http://envisage-project.eu/wp-content/uploads/2015/02/sortin...The same was true for java's binary search: https://ai.googleblog.com/2006/06/extra-extra-read-all-about...So I am not sure I will ever trust an ML algorithm trained on inputs/outputs only (which is what I think "neural program induction" means). The above bugs are only hit because the programmers who wrote it didn't think about overflow. What is this "neural programmer" assuming about their inputs? We'll never know.OTOH the standard library sort can be improved if you know the distribution of the numbers you're sorting (e.g., small numbers can bucket sort, small lists can use a sorting network, etc). If this thing can efficiently almost-correctly-sort because it's better at these types of pattern matching, we can just run a final pass of insertion sort to make it useful!
 > OTOH the standard library sort can be improved if you know the distribution of the numbers you're sorting (e.g., small numbers can bucket sort, small lists can use a sorting network, etc). If this thing can efficiently almost-correctly-sort because it's better at these types of pattern matching, we can just run a final pass of insertion sort to make it useful!I haven't read this paper, but that strategy was exactly the strategy behind "The Case for Learned Index Structures"[0]. The idea was you could train a model to predict the location of a row based on a primary key. After computing the model, you could then go through every row and compute a bound on how far off the model is from the actual row. To perform a lookup, you start off at the model's guess, then look around based on the bound on the model for a row with the given primary key.
 > If this thing can efficiently almost-correctly-sort because it's better at these types of pattern matching, we can just run a final pass of insertion sort to make it useful!Depends on what kinds of mistakes the almost-correct approach makes. If it just puts elements in the wrong order, your suggestion works. But if it makes mistakes like duplicating elements, or dropping them, or completely making up new entries, no post-processing will help.> OTOH the standard library sort can be improved if you know the distribution of the numbers you're sorting (e.g., small numbers can bucket sort, small lists can use a sorting network, etc).Even better: the standard library sort usually assumes only comparison. For special cases like sorting on integer keys you can conceivably do better.
 > Depends on what kinds of mistakes the almost-correct approach makes. If it just puts elements in the wrong order, your suggestion works. But if it makes mistakes like duplicating elements, or dropping them, or completely making up new entries, no post-processing will help.The algorithm modifies the original list only by swapping entries. As such, it's guaranteed to result in a permutation.
 Yes, in that case the post-processing would work.But it also limits their system to in-place algorithms only.
 Why is this limitation? Do you think the algorithm can be improved if it was doing a copy to a new location?
 If you let the number of objects to sort go to infinity while keeping size of objects in bits capped, then they will behave like integers.
 I thought more of being able to do bit manipulation or hashing etc on your keys. Those things don't necessarily require a fixed key length. (And in some languages, even integers don't have a fixed length.)You are right in some sense, but you'd still have trouble implementing something like bucket sort in your setting, if all you have are comparisons.
 IMO its not about replacing deterministic algorithms used in van Neuman machines (even though that would be a nice side effect).I feel that if we ever want to build an intelligent problem solver machine, we need some facility for abstract reasoning and work like this papet is a major contributor towards that goal. (if its reproducible and valid ofc...)
 > I am not sure I will ever trust an ML algorithm trained on inputs/outputs onlyIt’s doing TDD!
 Imagine a world where all programmers do is write the tests and the generalized learning algos write all the implementations.
 No need to imagine. That would just be Declarative Programming going mainstream?Its just that algorithms like the Prolog’s WAM or the reet used in rules are replaced with neural nets and the like.Rules use the same When Then format acting on a Given fact base. Prologs Horn Clauses can easily be translated into that form.BDD writes all tests in the format Given When Then.It’s already here.
 makapuf on July 10, 2020 [–] Welcome to my company a few years ago. Except they were not programmers but PM and architects, and algos were subcontractors. Now we have real developers (I'm in one of those teams)
 That's such a big pile of yikes, I'm morbidly drawn to see that codebase.
 I'm pretty sure the generated code will be an incomprehensible mess that passes precisely those tests and nothing more, e.g. it would correctly sort arrays [1,3,2] and [3,2,1], but not [2,1,3]; and obviously not [1,4,2], because the specs didn't mention number 4.
 Next step is to prove the ML-produced algorithm is correct. I can see a world where we generate algorithms and then automatically prove they're correct based on a formal description of the problem being solved.
 It doesn't need to be 100% correct. There's a place for fast sorting algorithms which are correct most of the time - basically anything non-critical, like sorting comments by upvotes.According to the article, they couldn't find a case where it wasn't correct.
 > There's a place for fast sorting algorithms which are correct most of the time - basically anything non-critical, like sorting comments by upvotes.Nothing saddens me more than this trend of the modern web where everything works semi-probabilistically (even if it's likely for "good" technical reasons, such as, "we wrote our server backend in Ruby which is slow as molasses so now we need 230 CDN and 800 databases instances around the whole world and transformed our simple centralized problem into an horrendous decentralized one).The central reason for me to use computers is that they are (or at least were) deterministic to a much higher degree that normal life, and so many things in the 10 last years becoming much more non-deterministic in particular on social websites is something that frustrates me every single day as it just makes the whole experience and process of using computers & the web very unreliable compared to what it used to be.
 Jim Keller once said randomness was baked in the modern computer chip design. The execution routes were almost always different even if you ran the same deterministic code for multiple times. But in the end you got the same result almost every single time (“almost” has many 9s after the decimal point). But this layer of operations has been abstracted away from most of us. So don’t feel too sad... maybe the world has been like this for a long time
 Today's perfections are yesterday's "good enoughs". Don't be sad for the trend. In 10 years, you will have new perfections to enjoy.
 Like the probabilistic bank account balance. "You have between \$10 and \$1000 with the greatest likelihood being \$537 (52% chance)."
 There was this thread a while ago with HSBC switching to MongoDB... so, yeah, distinct possibility :-)https://news.ycombinator.com/item?id=23507197(The article was very light on details though, and it was probably just one team that consolidated to MongoDB, not the accounts itself... one hopes.)
 adonese on July 11, 2020 [–] A friend of mine went to a doctor, and after the medical lab the doctor told him you either have malaria or typhoid.This trend seems to be everwhere
 ojbyrne on July 10, 2020 [–] I think the good technical reasons are quite often the CAP theorem.
 I know, but my argument is that a lot of time (not always, of course) the "distributed" issues could just be solved by not building a distributed system in the first place and just having more efficient tech stacks & a few big servers.Stackoverflow was just using 4 db servers in 2016 (https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...)
 adrianN on July 10, 2020 [–] We suck really hard at proving code correct that was written explicitly with the goal in mind to be formally verified. The techniques there are still quite immature. I wonder how long it will take until we have a reasonably working pipeline of ML+formal verification. Maybe it's just easier to generate correct code from the spec than to learn it from examples and then verifying it with the specification?
 "We suck really hard at proving code correct that was written explicitly with the goal in mind to be formally verified."That is why I am not a huge fan of proving code in the first place. If you have a really complicated proof of formal correctness, then who guarantees you, there is no misstake in the proof itself? At least this is what I have seen in university, lots of complicated stuff on the whiteboard, with a result. And then someone figured out, it was wrong.So I rather do lots of testing, for all known test cases.
 Your automatic proof checker guarantees that there is no mistake in the proof. That's the easy part! The real question is who guarantees that your specification is what you actually want?
 Automatic proof checkers still have bugs in them. However, one nice characteristic is that, since its algorithms are so general, i.e. about the language rather than about the problem you're solving (e.g. sorting), any bugs in the checker are either so common they hit every problem solution and are easily discovered and fixed, or so rare that it affects no real-world problem solutions.
 The kernel of an automatic proof checker is much easier to test and verify than the programs you verify using a proof checker though.
 +1 verification tends to be the easy part.
 Sure, not all bugs are kernel bugs though.
 louthy on July 10, 2020 [–] > So I rather do lots of testing, for all known test cases.How is this any different? You can still make mistakes in your test code, test data, omit cases that would fail, etc. I think I'd be less confident in a set of unit tests than I would with an automated proof checker.
 Tests are simple to unterstand.Complicated proofs I do not understand, without much effort.And I surely know that you can make mistakes with test cases as well. So I surely do not claim my way to be superior. But it works way better for me.
 Test cases are not necessarily simple to understand.Especially the test cases needed to trigger some deep bug.
 And so aphyr gets paid, for instance. You may not even be able realistically to run a test case. The laws of life say your 10000 core compute job will fail non-deterministically only after a day or so. It would clearly be helpful to have at least some formal guarantees, such as liveness at some level.
 Well designed ML techniques are always correct 99% of the time. Its the 1% that is the problem.
 99% of the time it works all of the time.Like humans.How do we deal with problematic humans?Retraining or replacement.
 No, we deal with 99% accuracy in humans by designing all our human-based algorithms to be resilient against mistake (and still mistakes in the end result happen very frequently). This is completely different from the way we have designed our computer systems, because it's way easier to do this with flexible agents like humans than with inflexible agents like computers.
 So, the mistakes algorithms make are different to the mistakes humans make?That’s... equivocation.Both humans and ML algorithms are flexible. That is the point of “learning”.Adaptation.
 But the problems where we apply human labour are vastly different from the ones where we apply machine labour. In (most) tasks where we apply human labour a few errors are tolerated.
 This seems irrelevant.Neither humans nor ML make zero errors.Ceteris paribus, if an ML algorithm makes fewer errors at a task which with low error tolerance - you would use the algorithm instead of the human, no?
 I would expect that might depend on what sort of errors each make, no?
 That is not practical when you expect the system to be correct 100% of the time and make decisions based on that. There are many situations where this is critical. You would never want your car's safety system to be correct only 99% of the time.
 If you expect 100% correctness you are not a very practical man.Perfect is the enemy of good enough.
 And good enough is the enemy of perfection.I suppose it depends on your end goals.
 Over-achievement of my goals is not my goal.
 fiblye on July 10, 2020 [–] With humans, we can ask what went wrong and fix the problem.With black box algorithms, we throw in some new training data and just hope it's enough.
 > With humans, we can ask what went wrong and just hope we've fixed the problem.> With black box algorithms, we throw in some new training data and just hope it's enough.A small wording change and they're equivalent again.To some extent human beings are also a black box - with some very peculiar failure conditions and side-channel weaknesses.
 No, we don't do anything against the humans who are 99% correct. 99% is good enough in most real world cases.
 what's the promote option in this analogy?
 Call me naive but, wouldn't feeding it random input and expected output sorted by existing and proven (but slower) sorting algorithms do the trick?
 For an exact proof of correctness no, because that would only show the algorithm is correct for a selected set of input-output pairs.However, it is true that validating it for a very large number of inputs and outputs has value. This is not uncommon even in mathematics. For example, Goldbach's Conjecture (https://www.wikiwand.com/en/Goldbach%27s_conjecture) has not been proven but has been shown to hold for all integers less than 4 × 10^18
 exdsq on July 10, 2020 [–] This is done! The existing correct algorithm is called the oracle and it’s used to validate the generated algorithm.There’s some cool work on using neural program induction in compilers to try generate faster implementations based off the ‘oracle’ program that’s being compiled.
 > neural program induction in compilers to try generate faster implementationsCan u provide more details?I am familiar with garden-variety ML-optimization-of-combinatorical-space in compilers.How would u even "measure" the expected performance unless these are gradual greedy changes
 It would be funny if a two phase approach to sorting would result in increased average performance.First sort using the ML algorithm. Then sort the almost sorted result with a traditional algorithm optimized for almost sorted results.In fact, I think my own brain does the two phase approach to many problems.
 Could you train the ML algorithm to select the best traditional algorithm based on the distribution of the items being sorted?!
 > The same was true for java's binary searchI used to ask "implement binary search" as an interview question and gave up on using it because not a single candidate could do it correctly in 45 minutes. This was at a FAANG.
 Ha, I can get an offer from any of the FAANG without even trying and can comfortably say that not only I failed the binary search question in past when interviews were 1 hour long, I would still fail it now.
 I don't believe you. Surely, after a few hints, the FAANG canidates must have been able to figure out the classic gotcha of this question (why does this sometimes overflow and how do you fix {high + low / 2})If they didn't, than I believe you did a poor job of guiding them. I refuse to believe that this is some esoteric, tough question. If you can solve dynamic programming problems, you can easily solve this one.
 I interview tons of candidates at a FAANG company, and I can tell you for sure binary search is way too hard to get right on a whiteboard. Even in a proper coding environment I wouldn't ask it. There are lots of better questions with fewer sharp edges.
 What else is hard besides converting(high + low) / 2ToLow + (high - low) / 2What else about this problem is a "sharp edge"?
 I may have been a little too harsh. Binary search is close to being a good interview question, especially since the algorithm does, rarely, come up in real code. But I prefer questions that are more like what people will be doing in their day-to-day job, and binary search is just a little too fiddly.
 Me: "Show me a binary search in C on the whiteboard."Candidate: bsearch(key, base, num, size, compare_fn);Me: "Your offer should arrive within one to two weeks. Need any help with relocation expenses?"
 100% this. Maybe not everyone knows the specific function you’re looking for, but for the ones that know standard library APIs really well, that’s your hire.
 User23 on July 10, 2020 [–] I wouldn’t say so. In theory you’d expect a FAANG engineer to be able to construct a correctly terminating loop by reasoning about the loop invariant, but in practice every single candidate gets lost in guesswork and messing up the edge cases and termination condition.
 Apparently modern GPUs have sorting-networks implemented in hardware (giving you effectively constant-time sorting for fixed-size input) - which may be preferable for use if you’re processing hostile input which may be contrived to hit QuickSort’s worst-case and the cost of hitting the PCI Express bus is less than the cost of randomizing the input to mitigate QuickSort attacks.But yes - I agree it’s weird that many standard-libraries (looking at you, .NET!) have only a single `Sort` method oh their vector-types with the details of the implementation buried in the documentation instead of shipping with a first-class sorting-algorithms library.
 GPUs having hardware explicitly for sorting sounds a bit odd, when you can efficiently implement sorting networks in software on the same GPU.If you have a source, I would like to learn more.
 > "just run a final pass of insertion sort to make it useful!"(And if this insertion sort won't complete in O(n), say <= 3n, then, fallback to Quicksort, so won't turn into O(n^2) just because the ML alg hit a corner case)
 Possibly better would be Timsort (used in Python, Java, V8, etc.), as it's "designed to take advantage of runs of consecutive ordered elements that already exist": https://en.wikipedia.org/wiki/Timsort
 Haskell's standard library sort, a variant of merge sort, also makes use of runs.It runs in O(n log (number of runs)). It's quite a lot simpler than Timsort.
 >> So I am not sure I will ever trust an ML algorithm trained on inputs/outputs only (which is what I think "neural program induction" means).According to the authors, the proposed approach is notable for training not only on input/output pairs but also on program traces. In addition, reinforcement learning is used to train e.g. sorting agents on the behaviour of "teacher agents" that comprise hand-crafted implementations of well-known sorting algorithms (bubble- insertion- and quick-sort).
 >So I am not sure I will ever trust an ML algorithm trained on inputs/outputs only (which is what I think "neural program induction" means).Pfffft, just wait until they get it writing code with its own proofs of correctness in dependent type theory. It'll be correct and twice as unreadable as incorrect human-written code!
 Why not check in \$n\$ on a sorted if the postcondition holds or raise an exception otherwise. You can even generate a nearly sorted list and run a verified bubblesort. Don't see big problems if ML is used carefully.
 Checking that the output is sorted is the easy part.Checking that the output is a (optionally stable) permutation of the input is the hard part. You can't do that dynamically if you have, for example, overwritten your input by sorting in place. And making a copy of your input is only going to be affordable in very specific and small instances on which you might as well just run a well-understood algorithm.
 It looks like their algorithm can only modify the input list by swapping elements. This is guaranteed to result in a permutation.
 The code (which is a few percent faster than quicksort) is on the last page:`````` procedure QUICKSORTAGENT(input state) 2: Let i = 1, j = 2, l = 3, h = 4 3: if FunctionID = None then 4: return vh ← Function1(vl ← vl, vh ← vh) 5: else if FunctionID = 1 then . QuickSort 6: if vl < vh then 7: if prev = None then 8: return vi ← Function2(vl ← vl, vh ← vh) 9: else if prev = (vi ← Function2(vl ← vl, vh ← vh)) then 10: return AssignVar(j, i) 11: else if prev = AssignVar(j, i) then 12: return MoveVar(i, -1) 13: else if prev = MoveVar(i, -1) then 14: if vi > vl then 15: return vi ← Function1(vl ← vl, vh ← vi) 16: else 17: return MoveVar(j, +1) 18: end if 19: else if prev = (vi ← Function1(vl ← vl, vh ← vi)) t hen 20: return MoveVar(j, +1) 21: else if prev = MoveVar(j, +1) and vj < vh then 22: return vh ← Function1(vl ← vj , vh ← vh) 23: else 24: return Return(h) 25: end if 26: else 27: return Return(h) 28: end if 29: else . Function ID = 2, Partition 30: if prev = None then 31: return AssignVar(i, l) 32: else if prev = AssignVar(i, l) then 33: return AssignVar(j, l) 34: else if vj < vh then 35: if prev = Swap(i, j) then 36: return MoveVar(i, +1) 37: else if (prev = AssignVar(j, l) or prev = MoveVar(j, +1)) and A[vj ] < A[vh] then 38: if vi 6= vj then 39: return Swap(i, j) 40: else 41: return MoveVar(i, +1) 42: end if 43: else 44: return MoveVar(j, +1) 45: end if 46: else if prev = MoveVar(j, +1) then 47: return Swap(i, h) 48: else 49: return Return(i) 50: end if 51: end if 52: end procedure``````
 This reminds me of programming basic on an Acorn Electron. Line numbers and I don't remember indentation, though it was a few decades ago.
 The Electron had a new enough version of BBC Basic that "proper" procedures and UDFs were supported (in a limited fashion, but well enough considering the constraints of the machine) so you could pretty much ignore line numbers, in fact by using a text editor and TYPEing the result into BASIC you could write code without them (the interpreter needed them, the TYPE trick made it put them in for itself).But yes, this was uncommon so line numbers were a major thing.And indentation was optional, you could add extra spaces to the start of lines and they would be kept and would have no effect on execution, but you would be wasting precious bytes of RAM and if using one of the heavier display modes (0, 1 or 2) you would only have 8.5Kbyte to play with for all the code and run-time storage (your variables, BASIC's call stack).
 The source code in the PDF has indentation, it is a bit easier to read.
 Was this "discovered" by their system or was it hand-crafted by the authors from their custom instruction set elements ?
 No, that is the training agent implemented as an instance of quicksort. I don't know why the OP reports that it's "a few percent faster than quicksort" because I can't find where that is claimed in the paper.In fact, as far as I understand it the paper claims that the learned model (i.e. the student agent) has learned to reproduce the behaviour of this teacher agent after training for a smaller number of steps on average than what the teacher needs to sort a list. That is what the entire claim about superior efficiency rests on (there is no example of a learned model and no asymptotic analysis of the program such a model represents).
 As I understand it they designed quicksort like that to be able to train with it. It is quite clear from the video where it is called "quick sort agent" compared to the model one and function1 and function2 is in the stack trace."We found adding the previously executed action to the input state st is sufficient to handle dis-ambiguation for this quick sort implementation. Alg. 8 shows the converted quick sort scripted agent"
 Right! I was hoping to see the body of one of their system induced algorithms. They don't seem to have included any - or maybe I am mistaken and their system is opaque and does not allow a generated algo. to be read out...
 Quite dull research wrapped in fancy words.Essentially they generate functions that map a program state to another program state and counts those function calls and compares with e.g. the number of function calls in quick sort. They "cheat" by feeding how all elements compares the their neighbours at each step. Like, if you give the algorithm that input for free what's the point.Notably, none of the popular sorting algorithms decide which elements to swap by looking at the whole input at each execution step. On the contrary, the decisions are typically made based on local evidence only. ... We implement this principle for the sorting task by employing a small set of k(independent of n) index variables, and include in the input the information about how A[i] compares with A[i+ 1] and A[i−1]for each index variable i.
 >> Like, if you give the algorithm that input for free what's the point.The point, as also explained in the paper, is that even with such information and with a handful of basic operations from which to compose a program, the space of all possible programs is huge and finding a prorgam that does what it should in any reasonable amount of time is extremely difficult.If you are familiar with the literature of program induction (which includes a lot more than neural program induction and goes back a few decades) you will notice that the hypothesis spaces for program learning are truly vast and it is impossible to make any kind of progress without some kind of "dirty trick" to reduce their size. The common approach is to impose a strong inductive bias on the representation language, so as to guide the search to find a correct program. The approach in the article is no different. And of course "inductive bias" here always means domain knowledge.But, amicably, if you think there's any better way to do this, many people, myself included, would be very curious to hear what you have to say. In fact, if you could really innovate in this field you'd be guaranteed an illustrious research career. Learning programs from examples is hard.
 This is absolutely not cheating; every hand designed algorithm can access and compare everything in the array!The real point of what they’re saying in your italicized quote is actually that giving the net full access hinders efficiency, so they actually restrict it. Almost like the opposite of cheating.
 It is not a pointer to an array I'm concerned about but the "neighbour diff vector" or what you should call it that provided by the "environment". See A.1.2.Doing so many comparisons and storing them has a cost. Also the model can't decide if it is done so at each step the array has to be iterated to see if it is sorted by the "environment". Are they only counting function calls? I guess so. The paper is really hard to follow and the pseudocode syntax is quite madding.If I understand the paper correctly of course I could be wrong.If so, "our approach can learn to outperform custom-written solutions for a variety of problems", is bogus.
 >> Are they only counting function calls? I guess so.Oh that, yes, it's true. They're listing "average episode lengths" in tables 1-3 and those are their main support for their claim of efficiency. By "episode length" they mean instruction or function calls made during training by the student agent which they compare to the instructions/function calls by the teacher agent. So, no asymptotic analysis, just a count of concrete operations performed to solve e.g. a sorting task.
 > They "cheat" by feeding how all elements compares the their neighbours at each step. Like, if you give the algorithm that input for free what's the point.How is that "cheating"? The algorithm is still learning how to do a comparison-sort, which has direct applications (assuming it performs well).
 So do they feed such information only when training the system?
 You can generate that input only when needed. So whenever the sorting algorithm asks whether A[i] > A[i+1], you can perform the comparison there and then. So the answer to your question is yes.
 The input is always needed, the neural net is an iterative algorithm that views the "local information" gathered from the array on every time step. It's asking continuously whether e.g. `A[i] > A[i+1]`.
 No, the neural net needs to receive a feature vector before each instruction in order to sort the array. So whenever you want to sort things, you need to create that vector over and over.
 > none of the popular sorting algorithms decide which elements to swap by looking at the whole input at each executionYes. In particular, the O(n log n) lower bound holds for comparison sorts, that is algorithms that rely on pairwise comparison only.
 "all" of a constant number k of such inputs, not "all" of the entire input. It's not possible to feed a neural network an arbitrarily-large number of inputs in one batch.
 My error it seemed to be a reference implementation with all. Is it correct that they feed a vector of 68 such comparisons in the window inferface?
 According to the limitations stated in the doc.1. It runs slower on modern CPUs and you need neural logic units to see the speed benefits2. Model needs the be fined tuned to the data for best performance.While it may be better than quicksort in certain usecases it isn't going to replace it anytime soon.
 "Model needs the be fined tuned to the data for best performance." -> typical :D
 This is pretty cool, pardon my limited understanding, trying to see if I get this correctly:1) they train a model to do sorting, it actually sorts correctly. 2) they optimize the model on efficiency and it becomes better than many custom sort functions?If I remember correctly from school, you can basically speed up any kind of sorting by using more space (basically by using a hash table). Is the neural network just using the normal sort algo until it sees a pattern in the input and then skips the sort algo for the pattern output? Or to put it plainly, is it just a smarter (or dumber, depending on the size of the neural network) hash table?
 The question is how does the algorithm perform on average, what are the pathological cases and how slow are they?How do you prove that with an algorithm that no human can reason about?Basically this paper looks like they've found an algorithm which is efficient for given sets or given classes of sets. Whether that generalizes is a different problem.It's basically a cool automatic heuristic generator.
 Exactly, that’s what I thought. Although, they should be able to look at the set of instructions that generated the output, which is basically an algorithm in itself. Then they could try to prove whether that algorithm would really generalise.
 collyw on July 10, 2020 [–] > How do you prove that with an algorithm that no human can reason about?This is something I see becoming a problem in the future for machine learning.
 The model is given the input and a set of instructions (e.g. swap) for producing the output. Essentially, it’s taking a permutation of the instructions that minimise the running time, based on some underlying patterns in the data.
 Eh... Look I get that everyone wants to publish exciting results, but to call this unequivocally "better" than quick sort is a big stretch. The profiling results are unconvincing at best. A small, < 10% improvement, could easily (and much more likely) be explained by any number of possible differences in optimization between the quicksort implementation they're using and the neural network library.Table 3 is very suspicious too. If they're doing what they claim, then why is Binary Search all of a sudden significantly better than the learned agent for some problem sizes? It really feels like they're running up against some kind of inherent optimization boundary inside their neural network library...
 Better as in lower number of instructions not lower wall clock time.
 Some interesting extracts:“The instruction set together with the input representations jointly determine the class of algorithms that are learnable by the neural controller, and we see this as a fruitful avenue for future research akin to how current instruction sets shaped microprocessors.”“The generalization or correctness we aim for is mostly about generalization to instances of arbitrary sizes.”“[...] computing a = f(s) can be more expensive on current CPUs than executing typical computation employed in the algorithms studied here. We thus hope that this research will motivate future CPUs to have “Neural Logic Units” to implement such functions f fast and efficiently, effectively extending their instruction set, and making such approaches feasible.”
 Pretty cool."We include videos in the supplementary material showing the execution traces of the learned algorithms compared with the teachers, and observe qualitatively different behaviors."Anyone know where we can find these?
 I think it's this one. https://mobile.twitter.com/liyuajia/status/12812613378307112...
 > As highlights, our learned model can perform sorting perfectly on any input data size we tested onI'm not going to drop something into my production code that works on just the inputs you tested with. As ghj pointed out- Java's highly analyzed sort method had a bug for years and we didn't notice.It's neat research and I won't discourage it at all, but give me some proofs of correctness.
 Can anyone link to experiments in the space of evolving code by mutating it in a syntactically-correct way, running it, evaluating its results or killing it if it doesn't complete?
 There is a project along these lines at Google, which is used to test graphic drivers. The generated code is randomly (but equivalently) transformated, and the result is verified to be the same. If it doesn't match, then there is a bug. Link to the project https://github.com/KhronosGroup/SPIRV-Tools.
 Does this work for arbitrary data sorts? If not, beating quick sort is really easy with bucket sort.
 > Does this work for arbitrary data sorts?It's a Comparison Sort, as they say on page 5 - so I think the answer should be yes.
 Short version:In summary, this paper makes strong claims of generalisation and efficiency but does not seem to provide sufficently strong evidence to back them up, e.g. it does not perform an asymptotic analysis to support the efficiency claim, it doesn't show actual program output to allow the evaluation of what is actually learned and the claim about strong generalisation is based on a very vaguely described experimental setup.Longer version below.Interesting paper. It describes a Reinforcement Learning method to learn neural network models that imitate the behaviour of agents performing a) sorting, b) binary search and c) solutions to the NP-complete knapsack problem. The technique is notable because it learns from program traces also, not only from input-output pairs. The technique relies on a hard-coded set of problem-dependent a) input states and b) instruction sets. The latter are what it sounds like, sets of instructions that form the building blocks of the learned programs, although in an enhanced version of the technique, functions are also added to the (less complex?) instructions. Input states basically include all other information that may be useful to a learner, other than the instructions and functions from which to build a program. Two different setups are tested: a) one using the single instruction "swap" (that swaps two elements) and where the input state includes information of the full list, and b) one with more instructions and localised information that is reported to generalise better and be more efficient. The latter setup is also augmented with "functions" which are stated to be more complex than operations, though the distinction is rather on the vague side (there is some more information in an appendix but it's not very helpful, I'm still left wondering why an "instruction" is qualitatively different to a "function").The paper is poorly organised and requires long appendices to elucidate all manner of poorly defined concepts in the main paper. This doesn't help when it comes time to evaluate the two most striking claims in the paper, namely that the proposed method(s?) a) show strong generalisation and b) outperform hand-crafted sorting algorithm quick-sort in sorting lists of numbers.The first claim is difficult to evaluate because the paper presents results of sorting lists of variable _size_ but does not say anything clear about list _contents_. For example, while training lists are chosen to have a random length in [10,20] and with elements in the same range, testing lists have their length incrased, but there is no mention of the values of the elements in those litss and, going by what is described in appendix C, lists can have duplicate values. In other words, it's possible that the "strong generalisation" results refer to sorting lists of variable size with the same unique elements as in the training lists. In that case, the claim of "strong" genearlisation is not supported.The second claim is difficult to evaluate for two reasons: a) there is no example of the training output, i.e. we never get to see what progams, exactly, the proposed approach is learning; and b) there is no attempt to asymptotic analysis of the learned models (which would be hard without an example thereof). Instead, the claim of efficiency seems to rest on two presmises: a) that sorting programs of O(n log n) time complexity are in the hypothesis space defined by the input states and instruction sets and functions for the relevant experimental setup, i.e. the learner could learn an O(n log n) sorting algorithm; and, b) that the number of training steps taken by the learning agents in imitation of the teacher agents is in the ballparck of n log n of the input. In other words, the strongest evidence for the efficiency claim seems to be a count of instructions (or function) calls during training. So, this very strong claim is also very poorly supported.
 > I'm still left wondering why an "instruction" is qualitatively different to a "function"Functions can be recursive. So the algorithm that the agent generates is able to use divide-and-conquer.
 Yes, but the "functions" in the paper are implemented as two extra types of instruction- one representing a function call and one representing the "return" point of the function. So again, the difference is not clear. An instruction is an instruction, but some instructions are function calls (or returns). Why not just call them all "instructions", or all "functions"?
 Coming to a leetcode style interview near you!
 Evolving sorting networks was cool.This is not cool.
 "We don't have a formal proof for this, and only have empirical evidence, measured on a large number of test instances. More on this in the paper. The learned algorithm is not entirely opaque. I'd argue it is easier to understand its behavior than that of a neural net." https://mobile.twitter.com/liyuajia/status/12815201991083089...agreed, this is kind of backwards research
 How will you debug this?
 With a neural network trained debugger, of course.
 There are sorting methods faster than quicksort though. O(n loglog n) for instance. Or for randomized its O(n sqrt log log n).
 Omega(n log n) is the fastest possible for a comparison sort, by a simple decision tree argument. This is true even for the average case.Their algorithm is a comparison sort, as they make clear on page 5.

Search: