Hacker News new | past | comments | ask | show | jobs | submit login
Deduplicate a slice in Go, use sort or a map? (boyter.org)
37 points by nalgeon on April 12, 2023 | hide | past | favorite | 51 comments



I am curious how much of a difference there would be if the author would use

  result := make([]unit64, 0, len(elements))
in the old func. I think that the way it is written, it will continually have to be expanded as the backing array exceeds its capacity (in the course of append() calls).

And as others have pointed out, maybe use an empty struct (as opposed to a bool or empty interface) as the map value type. Also TIL about make(map[T]S, len) - didn’t know you could set len, or at least hadn’t seen it in practice, but it was suggested here and I think it would likely help as well.

Edited: I forgot to set len to 0.


In C++'s STL you can reserve() a specific capacity for the std::vector to use, and at the end of your operation you can trim off the remaining memory using shrink_to_fit():

    std::vector<int> elems;
    elems.reserve(max_bound_num_of_elems);
    // do some push_back()'s
    elems.shrink_to_fit();
Is there anything similar in Go?


That can do almost anything (including reallocating), so you can’t really rely on it:

“Requests the container to reduce its capacity to fit its size.

“The request is non-binding, and the container implementation is free to optimize otherwise and leave the vector with a capacity greater than its size.

“This may cause a reallocation, but has no effect on the vector size and cannot alter its elements.”


Yes. You can make a slice with zero length but the required capacity and append to it as you go. You can also make a slice of the estimated length and then reslice to the actual length.

   arr := make([]uint64, estimatedLength)
   
   ...
   
   arr = arr[:actualLength]


This would reallocate, wouldn't it?


If you were using append, then yes. Minor nit: when allocating with length (make(type, len)), you’ll get a slice with zero values for n elements.

If you use make(type, len, capacity), then it would only reallocate if your estimated size was under the actual size.

Also I believe that re-slicing would still use the same underlying array, so it would set the length but still keep the capacity.

You’d have to create a new slice of actual capacity and copy to it.


Neither approach will reallocate.

In the append case, since the underlying slice has capacity, then each append will just extend the length. If estimated length was too small, then reallocation would occur.

In the reslice case, the new slice is a value type that points to the original underlying memory, with a length that is shorter. So no reallocation.


Interesting idea. I am not sure that such is available in Go. In Go, with slices, once the backing array has been allocated, I don’t think you can deallocate some at the end of it. But I might be wrong.

There is a difference in Go’s slices between length and capacity. Length is the number of elements in the slice, and capacity is the number of elements that can fit in the backing array (so len <= cap).


The same would be true for the initial map:

Replacing

  encountered := map[uint64]bool{}
with

  encountered := make(map[uint64]bool, len(elements))
Should be faster.

Golang Hashmaps initially have one bucket with a size of 8. If more elements are added new buckets are created.

So giving a hint to initialize a correct number of things should always be faster.


Cache is one thing, but also you don't need to repeatedly perform hash functions. Array index lookup is of course far faster than map lookup.


also doing those map lookups twice per iteration, for some reason


Seems likely to be optimised by the compiler, not sure though.


Nice article. I discovered this myself recently, but in a slightly different context: Taking the union of two sets represented as arrays. To my surprise, sorting the arrays and iteratively computing their union was faster than a hashset.

Others in this thread have mentioned that computing the hash takes time. This is true. Let n be the size of the array. When n in small, the (constant) overhead of computing the hash >> O(n * log n) sorting. Now, consider what happens as n becomes large: Virtually every access becomes a cache miss. The (constant) overhead of a cache miss is >> O(n * log n). At the time, I did the math, and determined that n would need to be some unreasonably gargantuan quantity (on the order of the number of stars in the visible universe) before log n is big enough to overtake the effects of cache misses.


> You could probably improve the above by using an interface rather than a bool to reduce the memory impact. If you need a function that preserves the order that might be a better implementation.

An empty interface occupies 16 bytes. Perhaps the author meant an empty struct which consumes nothing.


It's weird, I've heard of the author plenty of times (and frequently use https://github.com/boyter/scc), I would've thought they knew about using an empty interface to emulate a set with no overhead in Go.


That’s what I get for turning throwaway slack posts into blog posts without looking over it. Didn’t expect it to end up being read. It was more “I wrote this on slack I should save a copy”. Indeed should be struct for lower overhead. I’ll fix it tomorrow once my ego recovers from the inevitable comments that suggest I should give up programming and never write any technical post again.


Sorry, the internet is full of people with low empathy.

Mistakes on internet are corrected in very harsh ways, and many times rarely forgiven.

But hey, keep writing, keep learning, keep reflecting and taking feedbacks to improve. Everyone does this


FWIW, I’ve taken some concrete lessons away from this post and the ensuing discussion.


I appreciate the effort you've put into sharing your learnings. My intention was not to criticize. I just wanted to point it out for everyone's benefit. I apologize if my comment came across as harsh.


Not at all. I annoyed the fediverse a while back and that was what I would call harsh. Well that and some reddit comments.

I actually appreciate comments like yours. I did indeed make a mistake. Just as mentioned never intended this post to end up here. Hence the spelling errors and problems like this.


BTW thanks for scc. Got to know from the other comment that you are the author. I use it on my solo projects to get sense of how much work I am actaully doing.


I look forward to the beating I get when I update it to have accurate gitignore file processing and the post about doing that!


> You could probably improve the above by using an interface rather than a bool to reduce the memory impact. If you need a function that preserves the order that might be a better implementation.

Maps don't preserve order. The benchmark also is using 10 element slices. That is likely too small to get performance improvements via map lookup and I think the benchmark would still favor unsorted slices over maps.

I'm afk but I suggest using map[uint64]struct{} to reduce memory usage over map[uint64]bool and I recommend benchmarking different sized source slices. 10 items is nothing. Bump that up a few orders of magnitude. There is no way 10 element slices are what they are seeing in production that is causing memory pressure. Still, if the problem is memory, an in-place sort is better that a full copy as a map. And as the author suggests, cpu caching should help with the slice operations.


> I'm afk but I suggest using map[uint64]struct{} to reduce memory usage over map[uint64]bool and I recommend benchmarking different sized source slices. 10 items is nothing.

Author covers both of these comments in the article (though they don't disclose how large their "larger" slice is). (edit: actually I misread you, sorry. You're right about empty structs vs interface).

> Maps don't preserve order

The author's approach using maps preserves order in the result, they are comparing the approach to the sort solution which does not preserve order, not claiming that maps themselves preserve order.


> One thing that stood out to me when doing so was a call I made to remove duplicate values from an array/slice of uint64. A fairly simple fuction that appeared often enough in the output.

But do those arrays need to contain duplicates, or the code can be changed elsewhere to not generate them?


The values come from the result of many hash functions.

I actually need them sorted as well hence they are added to an array, and deduplicated.

I had a look and didn’t see any obvious way to eliminate them hence the initial function, and the subsequent faster one.


Only you can decide that, but it's always worth the ask IMO.


Small nitpick:

    if !encountered[elements[v]] == true {
is a double negative and hard to understand. I'd write it as

    if !encountered[elements[v]] {
or if you must use the bare bool,

    if encountered[elements[v]] == false {



Op had memory pressure; compact uses the same slice. This is a good solution.


What is missing are benchmarks use the tested functions.The difference between them might dissappear again.


Added them in for you.


creating the map with an appropriate capacity would probably speed up its implementation significantly

having to rehash several times is nasty


I’ve never actually seen a length passed to make() when creating a map - does it work?

Edit: yep - https://go.dev/ref/spec#Making_slices_maps_and_channels


[From that link]

> map of type T with initial space for approximately n elements

That "approximately" isn't great there. Nobody wants "approximately". If "exactly" is difficult - which it might be in many conceivable designs for this data structure - then what you want to offer is "at least".

If I'm a programmer and I know I want to put all 623 of these Doodads in the data structure, offering me a way to get a data structure with space for exactly 623 is great. One with space for at least 623 is fine, whereas approximately 623 is barely useful. Should I try asking for 624 in case that helps ? Do I need to dig into the detailed implementation to figure out, OK, 700 will definitely ensure 623 will fit ?


I’m guessing the amount of space occupied by N elements will vary depending on how the hashing plays out for the given keys (and probably key insertion order too). So ‘approximately’ is probably the best you can do without making enormous pessimal overallocations.


Blergh, I spent a while digging and you're right which is pretty sad. Go's map appears to end up as an array of buckets where each bucket is eight items but can form a linked list of overflow buckets as needed.

So if we tell Go we want to store 623 Doodads in the map, Go will try to guess how many buckets (of up to 8 Doodads) it should use for the map, and I think it'll try 128 here? Although maybe 96.

128 buckets means when we actually put our 623 Doodads in the map, on average the buckets are just over half full, but there's a fair chance that by accident one of our buckets fills completely, requiring an overflow, which allocates even though we told the map we had 623 Doodads.

There are a lot of design choices in Go that I don't sympathise with and this is one of them.


That's just how hashtables work, isn't it? Do you know of any hash table implementations that allow this kind of 'at least' allocation?


Most modern hash map designs don't do this weird shuffle with buckets and linked lists because pointer chasing is super expensive.

https://doc.rust-lang.org/std/collections/struct.HashMap.htm...

Because it's documenting the actual API in the standard library this even spells out that the result has "at least the specified capacity"

https://github.com/facebook/folly/blob/main/folly/container/...

F14 is a linear map but I couldn't immediately find actual API documentation, however it should have the same property where if you ask for an F14 with specific capacity or you reserve enough capacity, that's an "at least" promise not an approximate one.


I believe Rust uses hashbrown as the underlying implementation now. This just calculates the number of buckets based on the number of items requested:

https://github.com/rust-lang/hashbrown/blob/009969a860290849...

In the worst case all the items you insert will go in one bucket, and you'll have to rehash, which requires allocation. I'm not sure this is any different from Go in that respect.

I suspect that the inherently probabilistic nature of a hashtable makes it impossible to guarantee capacity without allocating way more space than would be required in typical cases. If there is a clever way to avoid this it would certainly be interesting to read about it.

Edit: Also, Go's hashtable implementation does not use linked lists (though it is a chaining implementation).


Your initial observation is correct, Rust's HashMap is these days a HashBrown, which is an implementation of Google's Swiss Tables:

But the use of the "Bucket" nomenclature in that file has probably misled you, the buckets it is talking about are for putting exactly one item in and they're just stored as one huge growable array (like a Vec). Suppose we do as I mentioned before:

   let mut doodads: HashMap<Doodad> = HashMap::with_capacity(623);
The calculation will do 623 * (8/7) which is 712, and then it'll round up to the next power of two [ultimately because shifts are cheaper elsewhere in the arithmetic] which is 1024. So it allocates 1024 buckets sized to store one Doodad each.

The Swiss Tables algorithm would work up until 1023 Doodads are in the table (one must be empty, but it doesn't matter which one) however performance trade offs mean you should resize before that, HashBrown does so after 1024 * (7/8) = 896 items, which you'll observe is indeed "at least" 623 items.

> In the worst case all the items you insert will go in one bucket, and you'll have to rehash

In the worst case - which is incredibly unlikely with a good hash, and you'll notice Rust supplies a good hash by default - we'll insert 623 items which have the same hash, and so they'll prefer to ideally be in the same bucket, but they're instead taking up 623 contiguous buckets and all our searches are now linear, so our performance is poor with average probe length 312. But we don't in fact grow, even if you could argue we should grow, this is so unlikely that it's not catered for, you won't find any code to detect this scenario let alone do anything about it.

> If there is a clever way to avoid this it would certainly be interesting to read about it.

All the modern Open Addressed hash tables avoid this, by just not caring about the incredibly unlikely "But what if all my data hashes to the same value?" case. This means if you did hit that case their performance is abysmal, but you won't so who cares. They just degrade gracefully as we get progressively unluckier, whereas Go's approach is forced to allocate as soon as we get unlucky enough.

It might seem like Go's approach is great if we've got plenty of memory and don't mind allocations, since at least we don't degrade performance when we're unlucky. Unfortunately that non-degraded performance isn't very good. Go would choose 128 buckets with space for 8 Doodads in each bucket, but this means most of our buckets have 4 or 5 Doodads in them, and we always try them in order, so look-up actually probes considerably more Doodads than with Open Addressing normally. The Swiss Tables have to get pretty unlucky before they are that bad, and whenever they're not unlucky they're doing much better...


> In the worst case - which is incredibly unlikely with a good hash, and you'll notice Rust supplies a good hash by default - we'll insert 623 items which have the same hash, and so they'll prefer to ideally be in the same bucket, but they're instead taking up 623 contiguous buckets and all our searches are now linear, so our performance is poor with average probe length 312.

Yeah, this is the nub of the issue. You’re avoiding allocation by giving up amortized O(k) lookup. So there’s a trade off here. Roughly, you have a choice between getting slow in unlikely scenarios or allocating in unlikely scenarios. In typical Go code, neither option is obviously better or worse in general. In the context of Rust, it makes sense to avoid allocations.

Preallocating hashtables of fixed size seems like a rare use case to me, though I don’t doubt that it might come up from time to time.


> You’re avoiding allocation by giving up amortized O(k) lookup

What's "k" here ?

Here's an experiment, I know how to write it in Rust, but not in Go, so let's make it a thought experiment. We have 623 custom Doodads which we've arranged (this is the part I know how to do in Rust but not Go) to have identical hashes despite being different. This clearly could happen (though it's fantastically unlikely) for real data but we're doing it artificially.

With Swiss Tables, and several other techniques that come to mind, we end up with 623 contiguous entries, and a 312 average probe length which means our performance sucks, but we didn't allocate beyond the original 1024 single item "buckets".

What happens with Go's approach? I think it ends up with all buckets empty, except one, that one is full and has 77 overflow buckets connected. In this case we've also got 312 average probe length and the performance sucks even worse. We have to allocate those 77 overflow buckets, and if we make the mistake of trying to re-size we allocate for that and improve nothing at all.

EtA: Oh, and pre-allocating makes sense when we know up front how many things will go in the hash table, which happens really often when we're building it once, using it, and then throwing it away. It will usually make sense to pre-allocate even if the table will latterly grow, because we can skip all the little sizes that we'd otherwise have to grow past.


I think you’re looking at the truly pathological case where the entire value returned by the hash function is identical for all keys. I’m thinking of the more realistic case where it’s identical modulo the number of buckets (for a small number of buckets). In that case, rehashing is likely to help (since if you double the number of buckets, you can use one extra bit of the hash value). In this scenario I think Go will rehash whereas Rust won’t. At the very least, Go in principle has the option of rehashing in this scenario, as it makes no guarantee regarding allocations.

Preallocating makes sense, yes, but you can successfully preallocate in Go most of the time. Scenarios where you need guaranteed preallocation strike me as rare. At least, it’s not something I’ve ever found myself worrying about.


If it's happening because we genuinely got unlucky the strategy you describe is no more likely to help than anything else, whereas if it's happening because your hash function is rubbish your strategy works, but using a hash function which isn't rubbish also works for the good hash table designs and unlocks far better performance in the process.


If it also causes re-allocations when the capacity is set to too low a la C++, the speedup will be more than significant.


It necessarily does. That is also the case where it might cause a rehash, however that’s less certain as many hashmaps will trade memory for storing the hashes to avoid recomputing them over and over. I’d assume they can use these hashes directly when reallocating, though they still need to perform collision resolution.

Either way properly sizing the hashmap is a good idea. Though load factor needs to be take in account, and I’m not sure whether the average hashmap multiplies the capacity by the maximum load factor in order to ensure there’s actually space for the number of items intended. I’d hope so, but…



> TL;DR: If you don’t need to preserve order of the elements, sort the slice first and dedupe.

And if you need to preserve order of elements, there is always sort.Stable!


I don't se how stable sort would help preserving input order when deduplicating uint64. Stable sorting preserves the order of "identical" elements. It's only useful if the elements have other distinguishable properties which are not taken into account for sorting order.

For a list of plain uin64 sort and stable sort would produce the same output.

Even if you attached some other information to distinguish them, for the purpose of deduplicating you want the values with the same key to be in sequence. Stable sort could help you to choose whether you keep the first or last of the duplicate, but the initial order of unique elements would be lost just like with regular sort.


Yes, sometimes you want to de-duplicate pieces of information by a part of the piece and preserve order. The article tldr sounded like you can only de-duplicating by first sorting with package sort if you don't have to preserve order. However as I just wanted to point out, it works if you have to preserve order just as well.

Edit: I see now that the original article meant preserving output order. And i have not clarified that i meant input order.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: