
An interesting data structure with search time O(sqrt n) - z3phyr
http://forum.dlang.org/post/n3iakr$q2g$1@digitalmars.com
======
nikic
Per the last posts: Number of heaps: O(n^(1/3)). Maximum elements per heap
O(n^(2/3)).

Lookup of an element. Find the heap with the element (i.e. element between
maximum of previous heap and this heap). Perform linear search in heap. Worst
cast complexity O(n^(1/3) + n^(2/3)) = O(n^(2/3)).

Insertion of an element. Find the last heap those maximum element is still
larger than the element. Extract top element of the heap and insert the new
element instead. Then propagate upwards (i.e. extract top of next heap and
insert top of previous heap, etc). Worst case is O(n^(1/3) + log(n^(2/3)) *
n^(1/3)) = O(log(n) * n^(1/3)).

Deletion of an element. Find the heap with the element (i.e. element between
maximum of previous heap and this heap). Remove the element from it. Extract
top of next heap and insert it into this one. Then continue propagating
upwards. So this is O(n^(1/3) + n^(2/3) + log(n^(2/3)) * n^(1/3)) =
O(n^(2/3)).

Building the structure. Why is this O(n)? If you have already segregated the
array into segments for the separate heaps, heapifying them would be O(n). How
can the segmentation be done in O(n)?

Edit: Using the suggestion of splitting up according to 1+3+5+..+(2n-1) = n^2
we'd get:

Search: O(sqrt(n) + sqrt(n)) = O(sqrt(n))

Insert: O(sqrt(n) + log(sqrt(n)) * sqrt(n)) = O(log(n) * sqrt(n)

Delete: O(sqrt(n) + sqrt(n) + log(sqrt(n)) * sqrt(n)) = O(log(n) * sqrt(n))

~~~
mtdewcmu
Deletion can't be done efficiently, as pointed out by Timon Gehr:

>> Deletion leaves a hole in one heap, which should be filled by the minimum
element of the next heap, etc. The minimum cannot be extracted efficiently
from a max-heap.

~~~
nikic
That's true, I got confused with min and max there. gre's suggestion of using
a min-max heap (if deletion is required) seems a reasonable workaround for
this issue.

~~~
mtdewcmu
Using min-max heaps should be workable; indeed, the resulting data structure
seems to be a special case of the data structure described in section 3 of
[1].

[1]
[http://www.cs.otago.ac.nz/staffpriv/mike/Papers/MinMaxHeaps/...](http://www.cs.otago.ac.nz/staffpriv/mike/Papers/MinMaxHeaps/MinMaxHeaps.pdf)

------
evanpw
Here's a related paper (on HN last month) which tries to find the arrangement
of elements in an array which minimizes search time, measuring wall-clock time
rather than theoretical complexity:
[http://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf](http://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf).
(Spoiler alert: it's not sorted order with binary search, and the answer has a
lot to do with the cache).

~~~
mtdewcmu
Interesting. One layout they did not consider is the bit-reversal
permutation[1] of the sorted array. Searching is straightforward: to find
element i, you reverse the bits of i -- so then, you can just binary search.
This layout is similar to Eytzinger, so it should perform at least as well,
and it may have some advantages.

[1][https://en.wikipedia.org/wiki/Bit-
reversal_permutation](https://en.wikipedia.org/wiki/Bit-reversal_permutation)

------
psykotic
From some homework I did as a freshman, I remember another neat data structure
with O(sqrt(n)) search:
[https://ece.uwaterloo.ca/~dwharder/aads/Algorithms/Beaps/](https://ece.uwaterloo.ca/~dwharder/aads/Algorithms/Beaps/)

------
abcdabcd987
I'm confused. I can follow the logic for O(sqrt n) searching. yeah, that's
cool. But how to maintain the structure when inserting or deleting?

Insertion for example, if I want to insert a value which is smaller than the
smallest element in the last heap, then it'll have to be inserted into some
heap in the middle, right? And since the heap is in the middle, say heap K,
its size already reaches its upper limitation K^2. Then where should the new
value go? If I insist on pushing it into heap K, then the heap will violate
the size limitation. Should I split the oversized heap into several heaps some
time? If it does split, will the O(sqrt n) searching time still holds?

Or maybe I just have not caught the author's idea yet. :-(

BTW, why not BST, for everything is O(log n)?

Update: @nikic solved my questions. Thanks!

~~~
deciplex
I think for insertion, once you find the heap the new element should be
inserted into, you would remove the max element from the heap, and re-heapify
that subarray with the new element. Then you would insert the max you just
removed into the next heap in the same way, and so on. That sounds like it
could be less expensive than shifting the whole array, but I haven't done the
math (and it seems that Alexandrescu hasn't either, yet).

Note that I'm making a couple assumptions about the data structure that were
left unstated in the original post:

* Each element of each heap is less than every element of every subsequent heap.

* "When the max value in a heap is less than the searched element" was a typo and "When the max value in a heap is _greater than_ the searched element" was intended.

Maybe I just totally misunderstand this data structure though :-( It's still
morning for me.

~~~
AYBABTME
You'd re-heapify all heaps greater than the one that ejected an element, since
heaps greater than you are all full except for the very last one.

~~~
deciplex
I said that:

> _Then you would insert the max you just removed into the next heap_ in the
> same way, _and so on._

------
biot
I'm not sure I'm able to visualize this correctly. Let's say we have an array
of strings containing the top 20 movies from Rotten Tomatoes:

    
    
      The Wizard of Oz
      The Third Man
      Citizen Kane
      All About Eve
      Das Cabinet des Dr. Caligari. (The Cabinet of Dr. Caligari)
      Modern Times
      A Hard Day's Night
      The Godfather
      E.T. The Extra-Terrestrial
      Metropolis
      It Happened One Night
      Singin' in the Rain
      Laura
      The Adventures of Robin Hood
      Inside Out
      Repulsion
      Boyhood
      North by Northwest
      King Kong
      Snow White and the Seven Dwarfs
    

What would this data structure look like?

~~~
bpicolo
From my understanding it would be roughly:
[https://gist.github.com/bpicolo/32a7fc775ce1810c88a0](https://gist.github.com/bpicolo/32a7fc775ce1810c88a0)

~~~
mmozeiko
I won't be array of arrays. It will be just one array.

~~~
bpicolo

      we decompose an array of size 35 into: an array of size 1 followed by arrays of sizes 4, 9, 16, 5
    

Then again, maybe he means `mentally decompose` in terms of pretending they're
separate arrays? I interpreted it as `visualize this data structure`.

~~~
dubroff
I also originally thought he had an array of arrays, but now I I think the
run-times he mentioned for insertion and deletion would only work for a single
array, not an array of arrays.

------
davidhariri
From what I understand, this is how Medium organizes the fast retrieval and
editing of their rich text. You might want to read this for more information:
[https://medium.com/medium-eng/why-contenteditable-is-
terribl...](https://medium.com/medium-eng/why-contenteditable-is-
terrible-122d8a40e480#.b5dod0r3n)

I may have missed something, though :-|

~~~
lobster_johnson
That doesn't sound like it at all. Medium's document model is a list of
paragraphs, and that (according to your link) is it.

------
mjevans
The context of the proposed data structure is more important for evaluating
how effective it is at maximizing performance.

This context includes not only the characteristics of the (typical) systems
using the data layout, but also the typical use of that structure (read heavy,
write heavy, some mix). The cost of memory is also an issue. For a typical end
user program the bottleneck is almost always going to be the user. For larger
scale programs or more performance intensive areas of a program (E.G. graphics
languages or other large data operations) different tradeoffs should be
evaluated.

------
jegutman
Even just that question: "Why?" and the corresponding answer are worth a read.

~~~
Einherji
Agreed. I love that this came from noticing a "gap" and coming up with
something to fill the hole. I wish I thought like this more often!

~~~
logicallee
"Find a gap and fill it, I always say! My great-grand-uncle is the one who
came up with disposible ketchup packets. Me, I found a data structure with
O(sqrt) search time. Sure it doesn't have all the doohickeys of a Red-Black
Tree, it's not as quick on the insert as a radix - but at the end of the day
it gets the job done. And to some people, that's all that matters..."

------
gamesbrainiac
This is a very interesting discussion, but I'd like to ask a question about
this in particular:

> The short of it is, arrays are king. No two ways about it - following
> pointers is a losing strategy when there's an alternative.

Why is following pointers a losing strategy? If you have a contiguous array (a
vector), then the pre-fetcher can do its magic and give you a psuedo-cache
level. (I may be wrong about this assumption).

~~~
tux3
The prefetcher works only if it can predict what you're going to access next.
Chasing pointers means essentially jumping to random addresses in a given
memory region. That's pretty much the worst situation from a caching point of
view, what the prefetcher loves is linear traversal.

Of course things become more complicated when you start having larger amounts
of data, and you want to skip over most of it, but still keep the cache happy.

~~~
Someone
In theory, an advanced prefetcher can detect that you are stepping through an
array of pointers, dereferencing each of them, and start prefecting.

Even that would have its problems, though. Typically, every one of these
dereferences brings in a new cache line. Unless the data that the pointers
point to fills an integral number of cache lines, you are bringing data into
the cache that you will not need.

~~~
eloff
There are costs to this, because you may dereference things that are not valid
pointers either. The kernel used to have software prefetch code for linked
list traversal. On each access it would prefetch the subsequent access. Which
sounds like a huge win. However, in practice the most common list size was
just a few elements and there was a huge penalty for fetching the final,
invalid null pointer. They ended up taking out the prefetching recently (and I
think they saw a small speedup as a result!) Now maybe someone could invent a
hardware prefetcher that does this in a way that's a net gain, but the fact
that Intel hasn't done it yet suggests that it maybe it doesn't help much or
that the situation isn't all that common.

------
aheifets
I haven't thought about it closely but, from a cursory read, it sounds like
the SquareList data structure: [http://www.drdobbs.com/database/the-
squarelist-data-structur...](http://www.drdobbs.com/database/the-squarelist-
data-structure/184405336)

------
Veedrac
Another thing that works is a length-sqrt(n) list of length-sqrt(n) sorted
lists with internal gaps, like with Python's SortedContainers[1].

Something like this with a contiguous allocation would probably require quite
a lot of empty space - 100% reserve for each subarray by necessity[2] and 50%
for the array as a whole, so 200% overhead at worst and probably 90% overhead
on average. Alternatively you can heap allocate each subarray at the expense
of slower operations, but resulting in a more typical 100% worst-case
overhead.

So you'd store like this

    
    
        |----------HEADER---------| |-----------------------BUCKETS------------------------|
         idx
        (0, 3) (2, 2) (1, 4) (_, _) [1, 3, 7, _] [14, 15, 16, 19] [9, 12, _, _] [_, _, _, _]
            length
    

Cache efficiency might also suggest you store the minimums of each list in the
tuples.

So to search for a value, you do a binary search over the header to find the
wanted bucket and then over the bucket to find the value. That's O(log n).

Insertion is O(sqrt n) since you need to find the bucket and then do a normal
insertion into it. You might need to reallocate the whole thing (remember to
increase bucket size!) but that's amortized out.

Deletion is even easier since you don't have to deal with overflow (although
you might want to move empty buckets to the end or merge adjacent ones,
maybe). O(sqrt n).

Creating the structure is trivially O(n log n), since it's sorted.

Converting to a sorted list is just compacting the buckets.

You also get access by index at O(sqrt n) speed since you just need to
traverse the counts in the header. Not perfect, but close.

For a cache-efficient design it's pretty neat. Binary searches are fast, after
all.

[1]:
[http://www.grantjenks.com/docs/sortedcontainers/implementati...](http://www.grantjenks.com/docs/sortedcontainers/implementation.html)

[2]: When you split an overfull subarray, each new subarray will be half full.

~~~
Veedrac
After some thought, you can avoid so much reserve by just not having any
reserve buckets (and thus no need for the header either). You can only resize
once every O(sqrt n) operations and a resize costs O(n), so it's amortized O(n
/ sqrt n) = O(sqrt n) insertion. You can lower the constant factors a bit by
doing occasional local reshuffling.

Not sure if it's a great idea, but it's tempting.

------
rntz
I don't understand this bit:

> Now each of these arrays we organize as a max heap. Moreover, we arrange
> data such that the maximums of these heaps are in INCREASING order. That
> means the smallest element of the entire (initial) array is at the first
> position, then followed by the next 4 smallest organized in a max heap,

Just because the maximums of the heaps are increasing order doesn't mean that
the minimum is at the front. It means the first element is _smaller_ than the
_biggest_ element of the next four.

Consider, for example:

    
    
        1 4 3 2 0
    

1 < 4, certainly, but 1 > 0\. What am I missing?

~~~
deciplex
I'm pretty sure there is an additional (unstated) assumption that each element
of each subarray is less than all elements of subsequent subarrays. Otherwise
you could just take the largest sqrt(n) elements, sort them, and then randomly
assign the other elements of the collection to their heaps and it would still
meet the criteria he's laid out, but the search algorithm he's proposed
wouldn't work.

In fact, there's an error in the search algorithm as well:

> _Whenever the maximum element (first element in the subarray) is smaller
> than the searched value, we skip over that entire heap and go to the next
> one. In the worst case we 'll skip O(sqrt n) heaps._ When the max value in a
> heap is less than the searched element, _we found the heap and we run a
> linear search among O(sqrt n) elements._

For the unitalicized portion above, he means to say "When the max value in a
heap is greater than the searched element".

------
ithkuil
Relevant algorithms by J. Ian Munro and Hendra Suwanda from 1979

[https://cs.uwaterloo.ca/research/tr/1979/CS-79-31.pdf](https://cs.uwaterloo.ca/research/tr/1979/CS-79-31.pdf)
[http://www.sciencedirect.com/science/article/pii/00220000809...](http://www.sciencedirect.com/science/article/pii/0022000080900379)

------
altonzheng
Is it immediately obvious this can be constructed in O(n) time? The author
just gleans over it but I'm still not able to figure it out.

------
chias
Isn't O(log(n)) equivalent to O(sqrt(n))?

~~~
rntz
No.

Just as (e^n) grows _faster_ (in the limit) than any polynomial of n, such as
n^2 or n^100; likewise, log(n) grows _slower_ than any polynomial of n, such
as sqrt(n) = n^(1/2), or even n^(1/100).

They are both practically pretty slow-growing, though. It may well be that for
practical sizes, constant factors matter more than the asymptotic difference
on present computers.

~~~
JBiserkov
In Parallel programming class the professor told us: "When I was a student,
our professor told us that for practical purposes log(n) = 7"

------
blahblehbluh
Some other algorithms with O(sqrt(n)) runtime: [https://www.quora.com/Are-
there-any-algorithms-of-the-order-...](https://www.quora.com/Are-there-any-
algorithms-of-the-order-O-sqrt-n?share=1)

~~~
necessity
Another one is an uniform-sized carry-select adder. It has O(sqrt(n)) delay
considering MUX delay << full adder delay IIRC.

[https://en.wikipedia.org/wiki/Carry-
select_adder](https://en.wikipedia.org/wiki/Carry-select_adder)

------
herewhere
if 4 <= N <= 16, then sqrt(N) <= log2(N) else sqrt(N) > log2(N)

So, in Big-O term, this only trying to optimize only small dataset (4 <= N <=
16).

~~~
teraflop
Big-O is only strictly meaningful in terms of asymptotic behavior; as soon as
you start talking about specific values of N, you have to care about constant
factors.

In this case, the constant factors will probably be quite different because
the goal is to make better use of CPU cache.

