Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: HyperLogLog in Zig (github.com/axiomhq)
154 points by seiflotfy on Jan 9, 2023 | hide | past | favorite | 34 comments



One change to make this library more idiomatic would be to make HyperLogLog avoid forcing heap allocation on the user.

So from this (simplified code):

    pub fn HyperLogLog(comptime p: u8) type {
        return struct {
            dense: []u6,

            const Self = @This();
            pub fn init(allocator: Allocator) !Self {
                return Self{
                    .dense = try allocator.alloc(u6, 0),
                };
            }
        }
    }
To this:

    pub fn HyperLogLog(comptime p: u8) type {
        return struct {
            dense: [1<<p]u6;

            const Self = @This();
            pub fn init() Self {
                var s = Self{};
                for (s.dense) |*x| x.* = 0;
                return s;
            }
        }
    }
The advantage to this API is that the user can choose where to place the HLL (stack, heap, global memory). Also note how the new init can't fail anymore.

That said, in the code that I've conveniently omitted there's one big complication: when the HLL has only few items in it, it doesn't use the dense representation and instead uses a `std.AutoHashMap` to keep counts.

Unfortunately `std.AutoHashMap` is not a type that can be reasonably disentangled from allocation.

I was about to drop my idea of commenting when I realized that a very neat solution would be to also provide the user with the choice (and admittedly extra responsibility) of when to switch from a hash map to a proper HLL.

    pub fn init(start: std.AutoHashMap) Self {
        var s = Self{};
        for (s.dense) |*s| s.* = 0;
        self.addFromHashMap(start);
        return s;
    }
There you go, allocation-free HyperLogLog :^)

I've also come up with my own funky interface when I implemented Cuckoo Filters in Zig, for those who are interested:

https://github.com/kristoff-it/zig-cuckoofilter


Or alternatively one could make good use of the existing buffer and use that to encode the sparse representation, like Antirez did in Redis:

https://github.com/redis/redis/blob/unstable/src/hyperloglog...


Why is Zigs syntax so … foreign to me? What’s the hype behind it?


You're looking at a piece of code that's doing a type of metaprogramming mostly unique to Zig. Additionally I'm using a for loop capture syntax that is not common in other languages (closest thing I can think of are Ruby blocks).

You're just looking at a piece of code that is understandable for a newcomer to not find familiar.


Take a look at Rust and you'll see where a lot of it is coming from.


I used to hate Rusts syntax, but a YT channel in the name of ”No boilerplate” made me see the beauty in it.


Actually i found zig easier to read than rust… something that i found more appealing!


This is awesome... question though: ```pub fn HyperLogLog(comptime p: u8) type { return struct { dense: [1<<p]u6;

            const Self = @This();
            pub fn init() Self {
                var s = Self{};
                for (s.dense) |*x| x.* = 0;
                return s;
            }
        }
    }```
doesn't this allocate 1<<p upfront though. If yes then the HLL of size 16384 bytes upfront which kinda beats the purpose of having a sparse representation no?


> doesn't this allocate 1<<p upfront though

Yes, it does. The idea (see the last code snipped in that post), is that the user delays the creation of the HLL until they are ready to switch to a dense representation. Before then, they just use a std.AutoHashMap directly.

Or, alternatively, the HLL could use the same buffer for both dense and sparse representation (see the Redis code).


Awesome! Does Axiom use Zig? Most of your projects seem to be in Go. Why did you choose to do this in Zig?


I'm also very curious to know this. HyperLogLog is written in Go:

https://github.com/axiomhq/hyperloglog

I would expect V to be a more natural choice for a port than Zig.


Author here, will publish my vlang take on it in the next couple of days.


Looking forward to it :).

It seems the V port should be pretty straightforward (the repo is 880 LOC and doesn't have any dependencies).


As a notice to others reading this user's comments, they are a recently made account that seems to exist mostly to post about V and post negatively about Zig. Extremely suspect given the history of poor behavior from the V project.


You made this account just to insinuate I'm a shill? Not cool.

I work professionally with Go and I follow news about V, yes. Also I'm interested in discussions involving Zig and V because even though there is an overlap of the design goals of the languages (e.g. zero-cost interoperability with C, high performance), there aren't many comparisons between them online.

My advice is listen to the criticism of Zig to improve the language instead of making throwaway accounts to accuse people.


I agree that the criticism is unfair because you've posted about other things as well. However, it's true that you've posted in the programming language flamewar style in the past; we're trying to avoid that here, so please don't do that in the future.

https://news.ycombinator.com/newsguidelines.html


Thank you, Dan.

I meant to ask my question in the spirit of HN (intellectual curiosity). I have found this community insightful when it comes to programming languages. Examples:

Vale

https://news.ycombinator.com/item?id=31786487

Lobster

https://news.ycombinator.com/item?id=19567160

I think this question will also be raised eventually since there have been other examples of Zig ports of Go projects:

https://news.ycombinator.com/item?id=31993429#31994394

I'm looking forward to OP's update. I believe it will be objective and insightful (surprisingly Google returns 0 results of "vlang vs. zig").


[flagged]


[flagged]


[flagged]


Please do not do programming language flamewar on HN. Even if someone else's comments are provocative, or you feel they are, reacting this way only makes things worse, and is just what we're trying to avoid here.

https://news.ycombinator.com/newsguidelines.html


> Autofree does not work and can lead to double free and use after frees currently

Unless you can give an example, I'm going to have to say you're still spreading lies.

The issues with autofree were fixed in June 2022. From V's homepage (https://vlang.io/):

> (June 9, 2022) As of today, programs built with the V compiler no longer leak memory by default.

Your previous comment was from July 2022:

https://news.ycombinator.com/item?id=31995339

I don't think HN is the place for football tribalism. We're discussing tools. Please don't get so emotionally attached to a tool.

I mentioned a language with a Go-like syntax in a discussion about a port of a Go project. I haven't been critical of Zig anywhere in the thread.


You did exactly what I asked you to stop doing, just after I asked you! This is a way to get banned here. Please don't do this in the future. It would be best to just avoid this topic and focus on other things you find interesting. I'm not going to ban you right now, because from your more recent posts it looks like you've started to do that instead, which is good.


It's true that users here should avoid programming language flamewar drama (or any flamewar drama), but you shouldn't be attacking another user like this. Between that and the name of the account, I think we have to ban this one.

https://news.ycombinator.com/newsguidelines.html


Looks cool. What are some applications of a cardinality estimator?


Memory bound alternative for the COUNT(DISTINCT ...) aggregate function in SQL.

ClickHouse[1] has a few functions for that, with different tradeoffs in memory usage, precision, and performance:

    - uniqExact - the same as COUNT(DISTINCT ...) - the exact version, using a hash table;
    - uniqCombined - a combination of a linear array of small size in memory arena, a hash table, and a HyperLogLog - this data structure is similar to HyperLogLog++; the log2 of the number of cells is controllable by the user;
    - uniq - a cardinality estimator based on adaptive sampling by hash, lower quality to memory usage compared to HyperLogLog, but beats it in CPU efficiency;
    - uniqTheta - uses the Theta sketch algorithm, whatever it means;
    - uniqUpTo(N) - my favorite, uses a linear array to give the precise answer up to N, and always answers N + 1 when the number of distinct elements is larger than N;
    - groupBitmap - an exact calculation using Roaring Bitmaps, makes sense for dense integer sets;
[1] https://github.com/ClickHouse/ClickHouse/

What is often forgotten in designing a data structure for a cardinality estimator - is that it should work well not only for a few large states but also for a large number of small sets.

For example, in a query like follows:

    SELECT URL, COUNT(DISTINCT UserID) FROM pageviews GROUP BY URL
You should expect that there are a large number of URLs, but most of them will get just 1..2 unique UserIDs. Obviously, it's not practical to allocate even a 4 KB HyperLogLog for every resulting record. That's how complex combined data structures are justified. They should start with ~16..64 bytes of automatic memory in arena or stack, and only then upgrade to a HyperLogLog.


Various aspects of query optimization like join planning and predicate selectivity estimation, e.g. as discussed in https://arxiv.org/abs/2010.00728


Rate-limiting is one.


How does one use Hyperloglog for ratelimiting?


Do you have an example?


For LSM trees to estimate effectiveness of a compaction


hashtable sizing for aggregation


Is it possible to use a single fn to replace the beta4..beta18 fns[0], so that the polynomial expressions are generated at comptime instead of being hand-written?

[0] https://github.com/axiomhq/zig-hyperloglog/blob/main/src/bet...



Nice!


What is the point of Zig? I mean, this looks approximately like code I would write in about a dozen other languages. So, what is this?


It's C level speeds.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: