
An O(N) Sorting Algorithm: Machine Learning Sorting - akaryocyte
https://arxiv.org/abs/1805.04272
======
titzer
Clickbaity paper title. That raised my hackles a bit.

This is basically bucket sort with machine learning thrown in. It is _not_
O(N). O(N) has a very specific mathematical meaning, do not abuse it. It means
there exists some fixed K that for every N and every input of size N, the
running time is <= K*N. This algorithm does not have that property, so it is
not O(N). It might be expected linear running time, but it is not O(N).

From the conclusion: "For some distributions with less smoothness, it might be
difficult to learn the distribution, and if the training fails...the algorithm
might fail."

Grr.

~~~
madsohm
Also, they end up with a O(N*M) algorithm. They claim M is small, but it needs
to be strictly smaller than logN if they want to be better than O(NlogN). A
terabyte is 10^12. Thus log(N) is 12. They have 100 hidden neurons which is
their M. So their algorithm isn't asymptotically faster, unless they go in
N>10^100 with that amount of hidden neurons.

~~~
hvidgaard
For 32bit ints that in the order of 3.6e88 TBs.... To put that into
perspective the visible universe contains an estimated of 10^82 atoms at most,
so if a single atom contains 1 petabyte of data, and you want to sort all
atoms and all or their data - this ML approach would be faster. I wish they
had that discussion in the paper.

------
yorwba
This seems like a fairly natural generalization of the idea in "The Case for
Learned Index Structures" [1]: speed up some classical algorithm based on
heuristics by modeling the data with a neural network that can be tuned to the
specific task that the general algorithm is applied to.

That said, the O(N) claim is purely sensational. It all depends on being able
to determine the distribution parameters with sufficient accuracy in O(N)
time, which isn't really guaranteed for all kinds of data you might want to
sort.

[1] [https://arxiv.org/abs/1712.01208](https://arxiv.org/abs/1712.01208)

------
CarolineW
I've only briefly skimmed the paper, but here's something I've come up with,
based on what I read, that might or might not be the same as what they've
said, but I think "has legs":

* Randomly choose some fixed sized subset of the data and sort it;

* Based on that, deduce the distribution of the data;

* Based on that, create a destination array, and copy things from the input to roughly the expected right place in the output;

* Run around and fix things.

When sorting numbers, that feels like it has a chance of running quite
quickly. For other objects with some other comparison function, not sure how
it would work.

~~~
avip
What's the asymptotical complexity of running around and fixing things?

~~~
dspillett
That stage is presumably a more traditional sort operation. If the first stage
does a good job of estimating the correct order this could also be O(n) or as
close to as makes no odds. If the set isn't modelled properly by the earlier
stages it will still be O(n log n). Obviously that operation needs to be an
algorithm that can take advantage of partially pre-sorted input.

------
dspillett
Looks interesting after a quick skim, and for simpler datasets "machine
learning" shouldn't be needed just some more traditional statistical analysis.
I'll have to find time to read it fully and play with the idea.

It'll be interesting to see how the method fairs with messy real-world data
over many examples [O(n) is obviously the best case, which is also the best
case of the infamous bubble sort if early exit is implemented, how commonly it
achieves that performance or close to it is key]. It seems to be a two stage
process: the first being to more-or-less sort the input using a simple model
of the expected distribution, the second being a tidying exercise that is
essentially a more traditional sort.

------
phzhq
I'm the author of this paper. I didn't realize this paper cause so many
intensively discussions here until I heard from my friends. I really
appreciate your advice. I would like to explain some questions about this
paper.

First, this method is not based on comparsion, I definitely know the lower
boundary of comparsion sorting is NlogN. The idea of this algorithm is that
when we sort big data, the data may have some statistical properties cause it
so large, and if we choose some of them, the sub-dataset should hold the
properties, this is the basic assumption. I know this assumption is strong,
but it is reasonable for big data. According to the paper, we transform the
sorting problem to a mapping problem, so we learn the cumulative distribution
and predict the position for a data in the future sorted sequence.

Second, I totally agree with the saying that this algorithm highly depends on
the quality of distribution function fitting. That’s why I choose the General
Vector Machine(arXiv:1704.06885), it is a 3-layer NN without Backpropogation.
It is based on Monte-Carlo method and you can read this paper if you are
interested. GVM is suitable for function fitting as the fig 2,5,6 in
arXiv:1704.06885. And since it's a 3-layer NN, it can reduce time complexity
for the ML. And if the distribution has some sharp shape region, we may apply
ML piecewise.

Third, in the paper, I compare the relation between M and LogN. Actually the
theoretical lower boundary of M is 1, in this case we assume we know the
cummulative distribution in advance. But, unfortunately, the number of neurons
M is usually larger than 1(we take M=100 in this paper), we are not able to
get a good approximate distribution with too little neurons. Actually, I think
it's worth for further study. Also, with this O(NM) sorting algorithm, we do
not mean it’s faster than the NlogN quick sorting in all the cases. If we can
apply GPU/TPU acceleration, it may show a huge potential for sorting large
data. The current parallel computing sorting algorithms spend a large
proportion of time in networking, but with our method, little networking is
needed, hence we can largely reduce networking time cost.

If you would like to further discuss about our paper, welcome to email me at
phzhq@mail.ustc.edu.cn

------
devilsbabe
It would have been much more interesting to report on tests of the algorithm
on real data.

------
partycoder
You could also claim sleep sort is O(N).

