Hacker News new | comments | ask | show | jobs | submit login
[flagged] An O(N) Sorting Algorithm: Machine Learning Sorting (arxiv.org)
30 points by akaryocyte 9 months ago | hide | past | web | favorite | 14 comments

Clickbaity paper title. That raised my hackles a bit.

This is basically bucket sort with machine learning thrown in. It is not O(N). O(N) has a very specific mathematical meaning, do not abuse it. It means there exists some fixed K that for every N and every input of size N, the running time is <= K*N. This algorithm does not have that property, so it is not O(N). It might be expected linear running time, but it is not O(N).

From the conclusion: "For some distributions with less smoothness, it might be difficult to learn the distribution, and if the training fails...the algorithm might fail."


Also, they end up with a O(N*M) algorithm. They claim M is small, but it needs to be strictly smaller than logN if they want to be better than O(NlogN). A terabyte is 10^12. Thus log(N) is 12. They have 100 hidden neurons which is their M. So their algorithm isn't asymptotically faster, unless they go in N>10^100 with that amount of hidden neurons.

For 32bit ints that in the order of 3.6e88 TBs.... To put that into perspective the visible universe contains an estimated of 10^82 atoms at most, so if a single atom contains 1 petabyte of data, and you want to sort all atoms and all or their data - this ML approach would be faster. I wish they had that discussion in the paper.

> A terabyte is 10^12.

It's 2^40, and log2(2^40) is 40.

While i like the novel approach to see if it's possible, I miss some discussion that relates them to other algorithms for known scenarios. Sorting is embarrassingly easy to parallelize. Basically you can Divide'n'Conquer with as many machines as needed with only the networking and storage interface as the limiting factor. Could it be cheaper when using a GPU and ML? Probably, but when the running time is unpredictable and highly dependent on the training and you cannot even guarantee correctness, is it really worth it?

I'd be more interested in seeing the analysis in terms of IO operations; because the CPU will be plentiful when the data set grows beyond what is possible to juggle in the memory.

This seems like a fairly natural generalization of the idea in "The Case for Learned Index Structures" [1]: speed up some classical algorithm based on heuristics by modeling the data with a neural network that can be tuned to the specific task that the general algorithm is applied to.

That said, the O(N) claim is purely sensational. It all depends on being able to determine the distribution parameters with sufficient accuracy in O(N) time, which isn't really guaranteed for all kinds of data you might want to sort.

[1] https://arxiv.org/abs/1712.01208

I've only briefly skimmed the paper, but here's something I've come up with, based on what I read, that might or might not be the same as what they've said, but I think "has legs":

* Randomly choose some fixed sized subset of the data and sort it;

* Based on that, deduce the distribution of the data;

* Based on that, create a destination array, and copy things from the input to roughly the expected right place in the output;

* Run around and fix things.

When sorting numbers, that feels like it has a chance of running quite quickly. For other objects with some other comparison function, not sure how it would work.

It works for datasets whose distribution an underlying machine learning model can successfully derive. Otherwise it would fail:

"it works well for big data sets. Because the big data sets usually have fair statistical properties, and its distribution is usually easy to derive and also has some kind of continuity"

What's the asymptotical complexity of running around and fixing things?

That stage is presumably a more traditional sort operation. If the first stage does a good job of estimating the correct order this could also be O(n) or as close to as makes no odds. If the set isn't modelled properly by the earlier stages it will still be O(n log n). Obviously that operation needs to be an algorithm that can take advantage of partially pre-sorted input.

Looks interesting after a quick skim, and for simpler datasets "machine learning" shouldn't be needed just some more traditional statistical analysis. I'll have to find time to read it fully and play with the idea.

It'll be interesting to see how the method fairs with messy real-world data over many examples [O(n) is obviously the best case, which is also the best case of the infamous bubble sort if early exit is implemented, how commonly it achieves that performance or close to it is key]. It seems to be a two stage process: the first being to more-or-less sort the input using a simple model of the expected distribution, the second being a tidying exercise that is essentially a more traditional sort.

I'm the author of this paper. I didn't realize this paper cause so many intensively discussions here until I heard from my friends. I really appreciate your advice. I would like to explain some questions about this paper.

First, this method is not based on comparsion, I definitely know the lower boundary of comparsion sorting is NlogN. The idea of this algorithm is that when we sort big data, the data may have some statistical properties cause it so large, and if we choose some of them, the sub-dataset should hold the properties, this is the basic assumption. I know this assumption is strong, but it is reasonable for big data. According to the paper, we transform the sorting problem to a mapping problem, so we learn the cumulative distribution and predict the position for a data in the future sorted sequence.

Second, I totally agree with the saying that this algorithm highly depends on the quality of distribution function fitting. That’s why I choose the General Vector Machine(arXiv:1704.06885), it is a 3-layer NN without Backpropogation. It is based on Monte-Carlo method and you can read this paper if you are interested. GVM is suitable for function fitting as the fig 2,5,6 in arXiv:1704.06885. And since it's a 3-layer NN, it can reduce time complexity for the ML. And if the distribution has some sharp shape region, we may apply ML piecewise.

Third, in the paper, I compare the relation between M and LogN. Actually the theoretical lower boundary of M is 1, in this case we assume we know the cummulative distribution in advance. But, unfortunately, the number of neurons M is usually larger than 1(we take M=100 in this paper), we are not able to get a good approximate distribution with too little neurons. Actually, I think it's worth for further study. Also, with this O(NM) sorting algorithm, we do not mean it’s faster than the NlogN quick sorting in all the cases. If we can apply GPU/TPU acceleration, it may show a huge potential for sorting large data. The current parallel computing sorting algorithms spend a large proportion of time in networking, but with our method, little networking is needed, hence we can largely reduce networking time cost.

If you would like to further discuss about our paper, welcome to email me at phzhq@mail.ustc.edu.cn

It would have been much more interesting to report on tests of the algorithm on real data.

You could also claim sleep sort is O(N).

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact