This is basically bucket sort with machine learning thrown in. It is not O(N). O(N) has a very specific mathematical meaning, do not abuse it. It means there exists some fixed K that for every N and every input of size N, the running time is <= K*N. This algorithm does not have that property, so it is not O(N). It might be expected linear running time, but it is not O(N).
From the conclusion: "For some distributions with less smoothness, it might be difficult to learn the distribution, and if the training fails...the algorithm might fail."
It's 2^40, and log2(2^40) is 40.
I'd be more interested in seeing the analysis in terms of IO operations; because the CPU will be plentiful when the data set grows beyond what is possible to juggle in the memory.
That said, the O(N) claim is purely sensational. It all depends on being able to determine the distribution parameters with sufficient accuracy in O(N) time, which isn't really guaranteed for all kinds of data you might want to sort.
* Randomly choose some fixed sized subset of the data and sort it;
* Based on that, deduce the distribution of the data;
* Based on that, create a destination array, and copy things from the input to roughly the expected right place in the output;
* Run around and fix things.
When sorting numbers, that feels like it has a chance of running quite quickly. For other objects with some other comparison function, not sure how it would work.
"it works well for big data sets. Because the big data
sets usually have fair statistical properties, and its distribution is usually easy to derive and also has some kind
It'll be interesting to see how the method fairs with messy real-world data over many examples [O(n) is obviously the best case, which is also the best case of the infamous bubble sort if early exit is implemented, how commonly it achieves that performance or close to it is key]. It seems to be a two stage process: the first being to more-or-less sort the input using a simple model of the expected distribution, the second being a tidying exercise that is essentially a more traditional sort.
First, this method is not based on comparsion, I definitely know the lower boundary of comparsion sorting is NlogN. The idea of this algorithm is that when we sort big data, the data may have some statistical properties cause it so large, and if we choose some of them, the sub-dataset should hold the properties, this is the basic assumption. I know this assumption is strong, but it is reasonable for big data. According to the paper, we transform the sorting problem to a mapping problem, so we learn the cumulative distribution and predict the position for a data in the future sorted sequence.
Second, I totally agree with the saying that this algorithm highly depends on the quality of distribution function fitting. That’s why I choose the General Vector Machine(arXiv:1704.06885), it is a 3-layer NN without Backpropogation. It is based on Monte-Carlo method and you can read this paper if you are interested. GVM is suitable for function fitting as the fig 2,5,6 in arXiv:1704.06885. And since it's a 3-layer NN, it can reduce time complexity for the ML. And if the distribution has some sharp shape region, we may apply ML piecewise.
Third, in the paper, I compare the relation between M and LogN. Actually the theoretical lower boundary of M is 1, in this case we assume we know the cummulative distribution in advance. But, unfortunately, the number of neurons M is usually larger than 1(we take M=100 in this paper), we are not able to get a good approximate distribution with too little neurons. Actually, I think it's worth for further study. Also, with this O(NM) sorting algorithm, we do not mean it’s faster than the NlogN quick sorting in all the cases. If we can apply GPU/TPU acceleration, it may show a huge potential for sorting large data. The current parallel computing sorting algorithms spend a large proportion of time in networking, but with our method, little networking is needed, hence we can largely reduce networking time cost.
If you would like to further discuss about our paper, welcome to email me at email@example.com