Hacker News new | past | comments | ask | show | jobs | submit login
A neat XOR trick (mattkeeter.com)
363 points by jmillikin on Dec 13, 2022 | hide | past | favorite | 171 comments



Something about seeing fun bit tricks solving higher level problems is exciting. As someone who writes higher level software in general using golang/rust/python to solve "boring business" problems, seeing someone using XORs to accomplish something feels like leaving my house and going camping for the weekend.


One fun repository of logic operations like this is https://graphics.stanford.edu/~seander/bithacks.html


This is vaguely reminiscent of the xor swap procedure, although it seems more practical.

https://en.wikipedia.org/wiki/XOR_swap_algorithm


Right? Things like fast inverse square root feel like total black magic.


Very curious what you're up to on camping trips that involves black magic.


The moral is that POPCNT is critical to zillions of massive optimizations, and omitting it from base instruction sets, as has been done over and over and then later corrected, each time at great expense, is extremely foolish. The latest offender was RISC-V, but the overwhelming majority of x86 code is still compiled to a target version lacking it.


Well, actually, in this case, no, POPCNT is unnecessary here, since you don't actually have to count anything. you only need to know if the bitmask changed when you added a new character ;)

But otherwise, yes.

I spent quite a while optimizing GCC's bitmap operations years ago, including implementing some new sparse bitmap types, and the sparse bitmap types i implemented ended up dependent on the speed of popcnt and friends, so yes.


In practice if you want the popcount of anything larger than a couple registers you'll want to use vector instructions anyways though. There are a lot of operations that will speed up certain applications if made into their own instruction, not all of them need to be.


it's not sufficient that the newest character in the window is unique? the older characters need to be unique too.


I posted how to do it in another comment on the thread.


If you're using OR (unlike the OP) how are you removing the left side of the window? And still, even if the newest character is unique, previous characters might not be


Input: "abccdef"

Your algorithm will report a match at "bccd" since the "a" wasn't removed from the bitmask.


Yes, you are correct you would have to xor them back out


I swear I have seen this conversation before. A glitch?


If you XOR them together, you get back to the original conversation.


No - he responded on both parts of the thread where i commented. I started to respond here as well, then updated it to redirect to the other comment so we don't end up with two long threads.


It is good to remember that the instruction POPCNT was introduced for the first time in the instruction set of a computer by Alan Turing, already in 1949, in Ferranti Mark I (as "sideways add").

Unfortunately few other computers have included it before Cray 1, which made it well known, under the current name.


We had it on the CDC 6600 in 1963 as 'Bi NXj' if I remember rightly.


Sideways add makes a bit of sense for the name of this operation. But I can't figure out why Cray named it "population count." It just seems odd to refer to the set bits as part of a population that you're counting.


legend say that popcnt was the "NSA" instruction, was heavily used for cryptographic analysis and that was kept out of common instruction sets for a long time to give NSA an advantage.

It is probably just a legend though.


The usefulness of popcnt, et al, in cryptography was known at least as far back as Alan Turing. It wasn't 'kept out of' ISAs, if for no other reason than not all computer manufacturers were (are) American, so the NSA wouldn't have had much leverage to keep, say, Ferranti or Hitachi from including it in their computers.

The legend you're probably misremembering is the one where the NSA approached Seymor Cray at CDC while he was designing the 6600 super and 'suggested' that if he included a popcnt instruction in the ISA, the NSA would certainly look favorably on purchasing some. He did and they did (quite a few). This story is also possibly apocryphal.


> The usefulness of popcnt, et al, in cryptography was known at least as far back as Alan Turing

Ok. Why??????????????

@gpderetta is correct at least in quoting hacker's delight where it was also said to be rumoured the NSA wanted popcount but it was unclear to HD's author why they wanted it.


IIRC the Colossus machine built during WW2 was basically counting the number of bits resulting from doing various boolean operations on the input. It was used to crack the Lorenz cipher, which XORed the plaintext with a pseudo-random keystream (generated using mechanical rotors!).

Cryptography has advanced since then - and I'm not an expert - but there may still be statistical weaknesses that could be found by counting bits?


Why??????????????

It accelerates the calculation of the Hamming weight of a vector or string. Hamming weight is useful lots of places in crypto, like helping frequency analysis, or observed power consumption attacks against crypto systems. It's useful in many other disciplines as well.


I think it is generally taken as useful in, specifically, cryptanalysis, presumably for estimating Hamming distance.


Oh right! Hacker's Delight might be where I read the story, and likely I misremember the details.


I never think of POPCNT because it's not a language primitive in any languages I use.

I would do it by storing the product of the counts; divide by the outgoing count before decrementing, multiply by the incoming count after incrementing. If the product is one they're all unique. Scaling should be comparable to POPCNT.


>RISC-V

Doesn't B have a POPCNT?

B is mandatory in RVA22.


These are the cpop and cpopw instructions.

https://github.com/riscv/riscv-bitmanip/blob/main/bitmanip/i...


Can you buy any RISC-V embodiment that implements such an instruction? It seems to be extremely optional.


B is pretty recent, but Sifive p650 will have it

https://www.sifive.com/cores/performance-p650

Heres another:

https://www.andestech.com/en/2022/12/08/andes-technology-unv...

I expect all future Linux cores to have it


>I expect all future Linux cores to have it

All RVA22 profile cores will, as RVA22 requires B extension.

>Sifive p650 will have it

Current versions of U74 also have it. And a U74 with B is already shipping in VisionFive 2.


It is part of B extension, so it should be available in the VisionFive2 (now being shipped), which has the newer version of SiFive U74 that has B.


Am I the only one bothered that he didn't first optimize the HashSet solution to O(N)? When sliding the window, you increase the counter for the new element, decrease the counter for leaving element, and update for each element how many have their counter set to 1.

That makes a bit weaker the case for using popcount, since that actually is O(W/Wordsize), which happens to be O(1) in this case.


You don't even have to do this.

If HashSet.insert ever returns false, the character was already in the set, and the window is not unique, so you can early exit and move on. If you get through the window without insert returning false, it is unique. You don't need to do anything else (like check the length of the hash set at the end)


I think he meant to use a hash map plus a numOfOnes counter instead of hash set. With a hash map, you don't need to build the entire window every loop. It's only a constant amount of work per iteration, achieving O(N)


Yes. You can slide a hash map and a counter, as we went through below. You can also replace the hash map with a bitmask to store the set.

You actually don't need any extra anything at all if you want, as you can just keep a single count because you don't need to look back at old windows, and can keep moving forward to the last non-unique point, making it O(N).

I'm sure this being HN, someone will come along and post how to do this.

I already got caught out once by trying to code it too fast so not gonna do it ;)


I think you can use a queue of size w.

Something like this... this is psuedo code. You iterate through the string once. You pop an element off of the queue at most once. In this implementation you might add it to the queue twice.

def find_uniq_window(long_string, w):

    queue = Queue()
    hash_set = Hashset()

    for char in long_string:
        if queue.size() == w:
           old_char = queue.pop()
           hash_set.remove(old_char)
        queue.push(char)
    
        if char not in hash_set and queue.size() == w:
            # success
            return queue
        if char in hash_set:
            #empty the queue and hash set and start over
            queue = Queue()
            hash_set = HashSet()   
            queue.push(char)
            hash_set.add(char)
  
    # We never found the window of unique characters, so return false
    return False


Instead of a queue, you could just use two indices/pointers to the start and end of the current window.

Also I think on `char in hash_set` you'd want to advance the start of the window until uniqueness is restored, rather than moving start to the end.

E.g. if the string is "abac", when you get to the second "a", instead of restarting the window at the second "a" you want to instead move the start to "b". You might even be able to do something boyer-moore-esque if the hashset is instead a hashmap of chars to index. Then you can directly advance your window by how far in the mismatch is, which probably won't affect asymptotic complexity since you still need to remove elements but allows you to do it in bulk rather than one at a time.


Yes, you are right about advancing the start of the window until uniqueness is restored. So, basically just advance it by 1.

You could definitely use two indexes instead of a queue, and using indexes over an array will be faster than using a queue.


Thank you. I was wondering the same thing and I think the article is overfocusing on "cool bit tricks" while this simple approach works just fine.


Without the cool bit trick, would it be worth publishing the article at all? I found it a novel approach to explore.


You are right. The algorithm you described is very similar to this XOR one. However, I did some benchmarks[0] and non-XOR version was 40% slower to XOR version when compiled with target-cpu=native. Unfortunately, without `target-cpu=native` the xor version was 40% slower and plain old arrays won!

[0]: https://github.com/mpawelski/adventofcode2022_day06/blob/mas...


It's pretty well-known but my favourite XOR trick is XOR linked lists. [1]

It's possible to store a doubly-linked list using only a single pointer per node. For entry B, &B = &A ^ &C. You can walk the list in either direction as long as you have pointers to two neighbouring nodes in the list. Half as much memory overhead compared to an implementation where each node stores two pointers.

[1] https://en.wikipedia.org/wiki/XOR_linked_list


This is basically the prefix sum technique in disguise: the sum over an interval is the difference between the prefix sums to the endpoints. The summation can be substituted with any group operation, in this case modulo-2 addition of 26-long vectors.

Many leetcode-type problems are amenable to O(n) speedups using this technique.


This trick is called zobrist hashing in chess/go programming. It allows for incremental calculation of the board hash, which is particularly useful because all of the game variations you're exploring are all nearly identical to one another, saving a lot of compute.


It's reminiscent of Zobrist hashing in that it's an incremental hash calculated with xor. But Zobrist hashing has the crucial difference of using a random k-bit number to be xor'ed into the hash (k is a constant, typically 64 these days) for each member of the set that is present, instead of an N-bit string where N is the number of possible elements in the set.

For example, in a chess engine: if the hash value for "white knight on f3 square" is 0x8BADF00D and the hash value for "white king on e4 square" is 0x1BADB002, then a chess board containing only "white knight on f3 and white king on e4" would be hashed as (0x8BADF00D xor 0x1BADB002) = 0x9000400F.

The upside of this trick is that if your set can contain N different objects (e.g. 768 combinations of 12 different chess pieces on each of 64 squares), you don't need to use an N-bit hash value, which would be pretty big. In the particular problem described in this blog, this isn't an advantage as it's using a set containing only up to 26 letters, which neatly fits in a u32 even if you dedicate a bit to each letter.

However there are downsides to Zobrist hashing - it's very hard to guarantee that you don't get hash collisions in this manner, as AFAIK you'd have to try all combinations of valid sets, which is prohibitively expensive. So every algorithm relying on this hash value either has to be robust to hash collisions, or it accepts a small probability of failure.

Most importantly in this case, Zobrist hashing doesn't let you test whether a particular element is present in the set, nor does it let you count the number of elements in a set. So it wouldn't work as a solution to the problem in this blog, which requires counting how many unique letters are present in the hash value.


Zobrist hashing is different. With Zobrist you have a table pd random numbers. You look each key up in the table and xor the values.


Of all the combinational logic functions, XOR is the coolest. Even its name is cool! It sounds like an evil super hero, or an ancient god.

ChatGPT did offer a useful suggestion I'd never heard of when I asked it to describe The Mighty XOR:

>As I mentioned in my previous responses, XOR is a logical operation and does not have any physical form or abilities, so it cannot be a superhero. Therefore, it is not possible for a character named "The Mighty XOR" to be part of the Marvel universe or any other fictional universe, as they would not exist in reality. If you are interested in characters from the Marvel universe with powers related to digital logic or computing, you might consider the character of The Calculator from DC Comics, who has the ability to use his super-genius intellect to perform complex calculations and hack into computer systems. However, this character is not part of the Marvel universe and does not have the name "The Mighty XOR".

https://en.wikipedia.org/wiki/Calculator_(character)

>Calculator (Noah Kuttler) is a supervillain appearing in American comic books published by DC Comics. Originally introduced as one of many villains in Batman's rogues' gallery, the character was later redeveloped in the 2000s as a master information broker, hacker, and tactical supervisor to other supervillains, and foil to Batman's partner Oracle.

>[...] Calculator suffers from severe obsessive-compulsive disorder, unbeknownst to his peers (even though this was hinted at when he was in charge of monitoring Supergirl), and initially controlled this with medication.


It's also overlooking that there is an actual hero named X-OR, which is the French name of Space Sheriff Gavan, a somewhat popular (at least in France) Japanese show of the prolific "heroes in shiny suits and rubbery monsters from space" genre.

https://fr.m.wikipedia.org/wiki/X-Or


I ran the unoptimized version of the code into ChatGPT and he was able to optimize it similarly to OPs solutions.

I was quite impressed, here is the ChatGPT version:

fn run(s: &[char], window_size: usize) -> usize { let mut unique_chars = 0; for i in 0..window_size { unique_chars |= 1 << (s[i] as u32 - 'a' as u32); } if unique_chars.count_ones() as usize == window_size { return window_size; }

    for i in 1..s.len() - window_size {
        let prev = s[i - 1] as u32 - 'a' as u32;
        let next = s[i + window_size - 1] as u32 - 'a' as u32;
        unique_chars ^= 1 << prev;
        unique_chars |= 1 << next;
        if unique_chars.count_ones() as usize == window_size {
            return i + window_size;
        }
    }

    panic!("No unique window found");
}

//NOTE: my prompt was the make the code O(N)


I may be missing something, but isnt the obvious answer just to scan-and-compare 4 chars/time ? (Noting that != is XOR).

    int main(void) {

        char *c = 0, *data = "nznrnfrfntjfmvfwmzdfjlvtqnbhcprsg";

        for (
            c = data; 
            !((c[1] != c[0]) &&
            ((c[2] != c[1]) && (c[2] != c[0])) &&
            ((c[3] != c[2]) && (c[3] != c[1]) && (c[3] != c[0])));
            c++
        );

        printf("%ld", c-data+4);
    }


Sure, but now do 11 at a time with a similar algorithm (which was part 2 of the problem).


I've just submitted in order to see the second part, which just says "now look for 14 unique".

I dont see the problem here, if you want a notation to express it, a macro for something like,

    0 != 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
    1 != 0, 2, 3, 4, 5, 6, 7, ...
is straightforward, and could be parameterised on the window size.


That's fine, but it's O(n) in the window size whereas the xor trick is O(1).


Part 1 and part 2 are the same, only the window size varies. But within a part, the window size is constant. For most, it should have just meant that, if you hard-coded the window size during part 1, that you need to make it a parameter.

But for both parts, the naïve solution should be O(N * W), where N is the input length, and W is the window size. Like a sibling says, since within a part the window size is fixed, it is acceptable to call it O(N).

Edit: ah, I see what his solution is doing. It is correct. I need to adjust my definition of naïve, I guess. His is O(N * W²). You can still reasonably consider W² constant, though, I think. And he gets to what I would call the "naïve" solution.

(Previously.) ~The "naïve solution" in the OP does not appear to be correct. Or at least, if it does work,~ it appears to do far more computation that it needs to.

The "xor trick" is O(N), too. (There cannot be a more efficient solution, as the entire input must be considered in the worst case. TFA is correct that its final form omits W, though.)


The window size is always constant in the problem, so it's O(1)

There shouldnt be a loop over the window size in solutions to this (particular) problem


TFA's implementation is using a loop over the window size to compare each character to all the other characters, to see if they're unique. So, the article's O(N * W²) is correct, but I agree, the window size is fixed and small; it's fair to boil it down to just O(N).

(I never even considered doing a for loop like that to determine if the window is unique; I just histogram the window, and then check if all the values in the histogram are 1. If yes, unique. If I had been cleverer, I wouldn't rebuild the histogram as the window shifts, but there was no need for that. This is essentially the next solution TFA presents.)


If you consider W the window size, and N the input size, this is still O(N * W²), as TFA states. (TFA's implementation is generic over W, yours is constant and essentially -funroll'd in place. But it's the same, for big-O; as you mention in your reply later, you could even make a macro, generic over W.)


2nd part is a window of length 14.


FTR: This can be done with "only" two loops (opposed to the "naive" 3 in the article) -- without any extra data structures. It's sufficient to keep track of how many characters have been unique so far. For each "new" character, check whether it's different from all of them. If so, increase count. Otherwise, reset count to the distance to the match.

  def main():
    count = 1
    for i in range(1, len(SIGNAL)):
      for j in range(0, count):
        if SIGNAL[i - j - 1] == SIGNAL[i]:
          count = j
          break

      count = count + 1
  
      if count == 4:
        print(i + 1)
        break
If we consider the word length a constant this should be O(n).


I think you can also speed this up by incrementing i to the next candidate window end (increment by window_length - count). I.e. given "abcdefggqwertyu", if your previous window was "[abcdefg]" and you see that "g" conflicted so you reset count to 0, the next possible window is "[gqwerty]". So you don't have go through the intermediate windows like "[efgqgq]" because you know that count can never be 4 within there.

Edit: I think your solution already effectively does this, it's just that because you maintain invariant that the window always contains unique chars to avoid any additional datastructure, once the window start gets reset to "[g]" it needs to build back up the intermediate state one char a time to works it way back up to [gqwerty]. So the best case is approximately linear if count gets reset often, worst case is O(NW). The method I was thinking of directly tests the next window starting from its endpoint backwards, to do this you need a hashet. I think that would achieve on best case O(N/W) effectively sublinear, worst case is still O(NW) though.


That was my conclusion too, this is a super easy problem with a fixed window length. The bit hashing is neat but I was too confused by what seemed to be an over-complication of the problem to really appreciate it. Did you and I both miss something here?


I don't think so. As much as I love using bit operations: in this case I'd actually prefer a table of character counts for a "true" O(n) solution, as bit counting isn't guaranteed to be a "native" operation.


The article started with 3 loops, took it down to 2 then one. Why are you talking about 2 loops?


The article took it down to 2 by adding an extra hashtable. It's possible to go down to 2 without using an extra data structure.


One issue with this technique is when the set of characters grow above 64, so you can no longer give each one a unique bit.

This could happen if we are instead interested in a window of unique words in a long document.

Amazingly there is still a solution based on XOR!

It is a variant of Bloom Filters, which I know everyone loves, but it uses XOR instead of OR to combine things.

Basically each word is hashed to three locations and you flip those bits. Then you have to do some probabilistic analysis.

https://arxiv.org/abs/2211.03683

Edit: Actually the technique in this paper would take linear time to determine uniqueness. I wonder if there's a way to do it in constant time...

I guess you could just implement linear probing or cuckoo hashing in bits. But then update time would be amortized constant only.


>One issue with this technique is when the set of characters grow above 64, so you can no longer give each one a unique bit.

You can xor arrays of integers.


Right, but if the universe of possible characters grow much larger than the window size (as it does in many applications), we want a method that only pays in terms of window size, not universe size.


So you use a sparse bitmap, ie a hash set. The size of the hash set is bounded to the size of the window, so this should be pretty efficient.

If you have a huge string, a huge alphabet, and also a huge window, then probabilistic techniques start to make sense. But let's not run while we can successfully walk!


Sure. We can always use a hash set, but then you need to use chaining or some other collision resolution policy. The whole of this whole exercise is to avoid that and get something really fast and truly constant time :-)


In that case, you might still want to use a similar trick as a first pass (like a bloom filter), and go over the window that is potentially unique again with the HashSet. This should still be faster than using the HashSet for all windows.

https://en.wikipedia.org/wiki/Bloom_filter


This article’s technique can definitely expand into dynamically allocated bitsets.


In that case the complexity becomes O(N * A) where A is the size of the alphabet, since at every step you have to go through the whole bitset in order to count how many bits are set.

Edit - actually it should be possible to update the count incrementally, so it should still be O(N) I think.


Say we don't want to allocate memory dynamically though. We just want to allocate O(W) bits at the start of the program. Can you still do it?


This seems a bit overcomplicated to me.

Using a bitmap to track letters you've seen is a great idea. But the parity and counting gives me a headache. How about you start with a zero-size window at the start of the string, then iteratively try to grow it by moving the end forward until it's long enough, and if the character added would be a duplicate, moving the start forward until it's no longer a duplicate? Sort of inchworm-style movement.

I am not smart enough to write Rust, so here it is in Java:

    private static int run(String input, int windowSize) {
        int windowStart = 0; // start at the start
        int windowEnd = 0; // the window is initially zero-size
        int lettersSeen = 0; // a bitmap of letters we have seen

        while (windowEnd - windowStart < windowSize) {
            if (windowEnd >= input.length()) throw new IllegalArgumentException("No unique window found");
            int letterToAdd = 1 << input.charAt(windowEnd) - 'a';

            while ((lettersSeen & letterToAdd) != 0) {
                // duplicate, slide the start until we drop the original
                int letterToRemove = 1 << input.charAt(windowStart) - 'a';
                assert (lettersSeen & letterToRemove) == letterToRemove; // sanity check that we have seen this letter
                lettersSeen ^= letterToRemove;
                ++windowStart;
            }

            // add the new letter
            assert (lettersSeen & letterToAdd) == 0; // sanity check that we have seen this letter
            lettersSeen ^= letterToAdd;
            ++windowEnd;
        }

        return windowStart;
    }

This is not quite optimal in terms of bit operations - i mask the bitmap in the inner loop condition, but you could just test letterToRemove against letterToAdd and break if they're the same. I think that's a bit less readable though.

Note that although this has two nested loops, they don't both independently range over the whole string, so this is not O(n^2). The outer loop moves windowEnd over the whole string, and the inner one moves windowStart over the whole string bit by bit. Each index visits each position in the string at most once.


This is clever: you let the window size vary such that you never allow two copies of the same character in your window, and just stop when your window size reaches the target size. Seems like it should work well. But:

> i mask the bitmap in the inner loop condition, but you could just test letterToRemove against letterToAdd and break if they're the same

You still need to check ((lettersSeen & letterToAdd) != 0) once to know whether you need to enter that loop at all. Worst case (a long string of the same letter) you do this twice per letter in the input anyway.


I asked ChatGPT for a Rust port, https://play.rust-lang.org/?version=stable&mode=debug&editio...

I had to make zero changes, it appears to work.


Although input.chars().nth() is not good - because of how strings work in Rust, that has to iterate over the string, rather than indexing directly, so O(n) rather than O(1). I think the idiomatic efficient translation would be to use a CharIndices iterator to track each end of the window.



You can just use __builtin_popcount or equivalent, which maps to a single instruction on most platforms.


On a related note, almost any task around shuffling or shifting bits can be improved with xor. For example the straightforward way to remove bit i is

    mask = -1 << i;
    return (x & ~mask) | ((x >> 1) & mask);
but with xor we can do it in one less instruction:

    mask = -1 << i;
    return ((x ^ (x >> 1)) & mask) ^ x;


I think the claim at the end is supposed to say `O(D)` running time rather than `O(N)` (or I can't see where `N` is defined).

Does this technique catch letters duplicated more than once?

In a Python 3 shell:

  >> 2 ^ 4
  6
  >> 2 ^ 4 ^ 2
  4
  >> 2 ^ 4 ^ 2 ^ 2
  6
Yes I guess it does, because inputs are used up in order to turn bits on and off. Nice!


The insight is that the only way to get W bits set is for the last W letters to all have been unique. Duplicate letters toggle bits on and off, but never get you any extra 1 bits set.

If you wanted to detect V unique characters from a window of W characters where V < W, this trick wouldn't work. But you could still have a rolling window, it would just be a map counting "how many of this character in the window", which you can increment the relevant character for as it enters the window, and decrement as it leaves.


> I think the claim at the end is supposed to say `O(D)` running time rather than `O(N)` (or I can't see where `N` is defined).

Good catch; this should be fixed (using N for input length and W for window size consistently in the writeup).


I had considered this approach but the nature of AoC means it rewards real world time to solution rather than and kind of computation time, and a naive solution still runs fast enough that it more than makes up in time spent programming it.

There's actually whole classes of problems which would be more interesting if the naive solution wasn't fast enough, for example even on day 7 (or was it 8?) naively exploring to every edge from every square would be O(N^3) but still execute just fine.

I'm a couple of days behind but so far this year hasn't yet had lanternfish style problems where the naive solution blows up completely, but hopefully they will come as they are a lot more interesting.


Arguably day 11 part 2 is such a "lanternfish" problem, although it essentially tells you to watch out for it.


I definitely ran into that. I read "impractically large number" and thought "well the language I'm using has arbitrary size integers so I should be fine". So, I changed the code from 20 iterations to 10000 iterations and suddenly all 16 GB of my RAM was filled with Santa's worry counters.


Worse, I ported my solution to use BigInt and then found that was neither necessary nor sufficient.


I love the Sunday quaterbacking of AoC, more than the competition itself (I'm not even doing this year). And I agree with you that a quick reminder that some problems are "naively intractable" is a good wake up call once in a while.


If you aren't already familiar with it, you might find projecteuler.net fun!


codeforces.com or one of those is slightly more likely to be fun for programmers.

projecteuler.net is great, but it _very_ quickly becomes deep math.


One small thing:

        // Turn on bits as they enter the window
        set ^= 1 << (s[i] as u32 - 'a' as u32);

        // Turn off bits as they leave the window
        if i >= window_size {
            set ^= 1 << (s[i - window_size] as u32 - 'a' as u32);
        }
"turn on" and "turn off" actually mean "flip", right? Each bit is not a present/not present flag, it's the parity of the count of number of occurrences. Which still works, because of the "counting how many characters appear an odd number of times in a window" thing.


I wonder how long before this question starts popping up in interviews with immediate reject if one does not come up with magic-xor solution in 5 minutes on the whiteboard. EDIT: neat solution nonetheless!!


This was basically part of a public set of Google interview questions around 2009 already ;)


W cannot be larger than the size of the character set, so even the first algorithm runs in O(N) time.

It's a cool trick, but it didn't make the algorithm asymptotically more efficient as the OP suggests.


You could imagine a much larger character set though. For example you might be interested in a sliding window of unique words instead.


Yes, but then the last algorithm runs in the same time as the one before it. Using xor here really only gives you a constant factor improvement (which is still nice!).

Reminds me of the problem “find the only missing number in a scrambled list of 1..n”. You can get an “O(n)” solution using xor, but once bit size becomes a concern, there are less hacky solutions that are just as fast.


"Bit tricks" are worth a factor `log n`. Let me explain: If you have an array of `n` integers in {0,1}, and you want to count the number of 1s, you need `n` time. But if you have an array of `n` bits, with bit level operations you can do `n/W` time, where `W` is your word size. We often have W=32, 64 or 256 on some architectures, but you may still argue it's just a constant.

The thing is that we usually saw you have random access to all of your input data in constant time. But if your input size is `n`, your pointer size must be `log n`. So for standard algorithm analysis to work, you need to allow a word size of at least `log n`.


You're using n as a parameter of the machine's specs, which is a big no-no. You can't just come up with a bigger machine for each consecutive value of n. If that were allowed, all algorithms would trivially be O(1).

We usually assume the pointer size to be infinite (e.g. RAM machine), to avoid having to deal with access time. But if we don't, then it must be a constant size (and we have to adjust the big O formulas accordingly). In neither of those cases do you get free arbitrary length xor operations. Just like you don't get e.g. free arbitrary length multiplications either.

One nice property of xor, however, is that you can parallelize the heck out of it. So you can get arbitrary constant factor improvements by adding a bunch of CPUs.


Perhaps we are both right. I usually think of the Word RAM Model in which W is assumed to be at least logn. See https://en.m.wikipedia.org/wiki/Word_RAM#Model

This is a nice model because you don't have to assume things like "infinite pointer sizes", and because the algorithms that work well in practice (such as using bit tricks) also work well in the theory.

If you don't work in the word ram model you also wouldn't be able to do things like hash maps with constant time queries, since that requires a hash function which can't be computed in O(1) bit operations.


We're still at O(N * W) time. It might be possible to implement xor using only +, - and bit shifts (though I'm not seeing any obvious way to do so), but count_ones stays a bottleneck; I'm pretty sure you can't do that in O(1) even on a word RAM machine.

Not to mention, a machine where addition takes O(1) time is a massive cheat - at least definitely not what most people have in mind when talking about computational complexity. AFAIK, O(...) usually means "on a classic RAM machine" unless otherwise stated. With hash maps/sets, it really depends on what is expected to grow. Bigger input size doesn't always mean bigger keys.


> but count_ones stays a bottleneck; I'm pretty sure you can't do that in O(1) even on a word RAM machine

It's actually easy to do logn bit operations in O(1) time on the kind of machine you are talking about too. Just make a table with the result of all (lg n)/2 bit inputs. Such a table only takes sqrt(n) memory and time to create, and since you seem to accept O(1) table lookups you now have count_ones in two lookups.

Similarly it's easy to make small tables that allow you to do arbitrary binary operations on (lg n)/4 bit strings.

> at least definitely not what most people have in mind when talking about computational complexity.

I'm pretty sure if you look in CLRS or any standard text book of algorithms, they allow lgn bit operations in constant time. Every heard people saying "Sorting takes O(n logn) time"? They are clearly assuming comparing two numbers can be done in constant time.

Also look at any lecture notes from actual CS researchers, like this: http://www.cs.cmu.edu/~odonnell/toolkit13/Lecture05.pdf

> Doesn’t that imply that we need w ≥ log n?

> Answer: Yes! And that’s a standard assumption! The first time you see this it may seem totally weird: you’re assuming the computer hardware size depends on the input size?! That seems to make no sense. But once you calm down, it’s actually quite logical and cool. Of course you want a single pointer to fit into a word. You should just think of w as a parameter of the model, and w ≥ log n as an assumption.


> Every heard people saying "Sorting takes O(n logn) time"? They are clearly assuming comparing two numbers can be done in constant time.

Where n is the number of items. Not the number of digits! And we don't normally talk about "time" when discussing sorting algorithms in the first place. We talk about number of comparisons (precisely because we have no clue how big the items are!).

> Just make a table with the result of all (lg n)/2 bit inputs

We're interested in the number of ones in a C-bit integer, where C is the size of the character set (and incidentally the maximum acceptable value of the window size W). Simply counting the ones bit by bit costs O(C) time. Making a lookup table for all (C/2)-bit numbers - as you seem to suggest - costs O(2^(C/2)) time and space. And the algorithm we wanted to improve runs in O(N * C) time.

Look, I'm trying to decipher what your point is, but you seem to be confusing the size of the input space with the size of the input. So AFAIC, this is as much attention as this algorithm deserves. It's a nice party trick, but nothing more.


> And we don't normally talk about "time" when discussing sorting algorithms

Sure you might be a purist and talk about comparisons, but I'm pretty sure if you Google it most people would talk about time. And "n logn" is exactly the amount of time it takes to sort n word-length integers in Word RAM.

> We're interested in the number of ones in a C-bit integer

You're not really engaging with the point that you can make a table in sqrt(n) time that allows counting the bits in logn bit strings in constant time. That means the final algorthm runs in O(NC/logN) time.

This is definitely not a party trick, as you would know if you've ever written a program using these methods. Now with AVX instructions it's more important than ever.

You also didn't engage with O'Donnells lecture notes. I can find you many more text books explaining why this is the right assumption to make, in case you don't like the reasons I've given above.


If ascii characters aren't guaranteed, would this solution still be able to scale? I wonder how the overhead from scanning the input to find the domain to use for the bitset compares to overhead of allocating a bunch of sets.

Edit: More generally, I feel like these sorts of tricks are getting less relevant over time. With the prevalence of Unicode text how many clever interview question optimizations that assume characters are just another way of saying u8 are no longer relevant?


> There's no moral to the story, other than "xor is cool"

The moral is knowledge is king and no matter how much iron you throw at a problem a well-designed algorithm will beat it.


…a simple histogram (the second proposed solution in TFA) runs "instantly", for both parts.

Part of the meta-game to AoC is knowing that you can limit your answer to only the requirements in the combination of the question and input given. If the naive solution runs fast enough, and is vastly quicker to implement, that's the one you want.


Cache misses, TLB flushes, and pipeline stalls would beg to differ with you.


Not really, no. For plenty of problems, the most efficient known algorithms will easily outperform a naive algorithm, even if running on drastically inferior hardware. 'Drastically' in this context can mean the slower approach taking billions of times longer.

Wikipedia gives an example. [0] Many courses on algorithms and/or complexity theory emphasise such examples in their introductions.

That said I don't know why jhoechtl thought that was the moral of this blog post, which doesn't illustrate that point at all.

[0] https://en.wikipedia.org/wiki/Computational_complexity#Use_i...


> 'Drastically' in this context can mean the slower approach taking billions of times longer.

Standupmaths - "Someone improved my code by 40,832,277,770%": https://www.youtube.com/watch?v=c33AZBnRHks (OK, 41 billion percent is only 0.41 billion times better, but still.)


Mea culpa.

All I was trying to say is that beyond a point, it's not _just_ the algorithm, you need to pay attention to mechanical sympathy.

Do well-designed algorithms outperform naive algorithms? Almost always. Is designing the algorithm well sufficient? I'd think not.


Note that for enumerable domains of less than 64, using a long plus bit ops has been a standard Set implementation for quite a while. For example, Java's EnumSet uses a long if the enum has less than 64 values:

https://github.com/frohoff/jdk8u-jdk/blob/master/src/share/c...

Where add() uses `|= bitmask`, remove() uses `&= ~bitmask`, and size() uses a count of the 1's in the long.

Adding XOR as an efficient toggle would be interesting, but unnecessary to keep this O(n), if I understand correctly. It's just toggling the value, so (albeit with an extra branch), you could implement it as:

    if (!set.contains(val)) {
      set.add(val);
    } else {
      set.remove(val);
    }


I had a very similar solution for my AoC in Toit, but I used a Deque instead of just having a trailing index into the input string. https://github.com/erikcorry/advent-of-code-2022/blob/main/d...

Then a friend solved it with regular expressions, but I found a sicker regex. After all, that's what regexes are about.

https://mobile.twitter.com/erikcorry/status/1600524753596456...


'e' ^ 'f' ^ 'g' == 0x40. Uh-oh.

Edit: I misread the post; my bad. The post effectively uses a bit vector to store the last N chars in a window, and the bit vector happens to fit in a single machine word. Also, XOR happens to be a good way to update the bit vector, because it turns out it's sufficient to store how many times each character appears in the window mod 2.

So to be clear, my "demonstration" above only works because the ASCII representations of e, f, and g are not linearly independent with respect to xor. However in the article, the representations of all of the characters are chosen to be a linearly independent set.


Ah OK, the article is actually suggesting making a list of Booleans whose length equals the number of possible characters. It just happens that it's assuming 26 allowed characters which fit in the bits of a 64 bit number.

The running time is does not depend on the window length, but does depend on the number of possible characters. If it's all of Unicode for example you'd be stuffed: you could fix the vector of values in memory (currently there are approx. 150,000 unicode code points), even if you used a byte per value, but counting number of true values will require iterating over the whole vector.

Even just going from 26 Latin letters to 256 byte values makes this trick quite messy unless your language has a really nice bit vector type (admittedly many do).

This comment was helpful for me to understand what was going on, even though it's a mistake followed by a correction.


The example code in the article left out the lookup table to convert each character into a single bit representation. All the tricky xor and popcount stuff could have been a 26 byte array just as easily and been O(n).


There's no lookup table. The example code in the article does the conversion using this expression:

    1 << (s[i + j] as u32 - 'a' as u32)


Yes, for some additional explanation for those that might use it, that's "ASCII math": it assumes all lowercase ASCII Latin letters (which most Unicode encodings including UTF-8 and UTF-32 inherit). ASCII was intentionally designed so that the lowercase letters a-z are in English alphabet order and so subtracting 'a' from the letter gives you an alphabet position number from 0-25. (The same applies in ASCII for the upper case range A-Z and the numerical range 0-9, though doing math between ranges is less fun.)

Then you've just got a standard single 1 left shifted 0-25 positions. (So 'a' is 1 and 'z' is 1*2^25.)


You don't need to count ones at all, because you can stop looking at a window as soon as you discover it's not unique. As a result, you only have to detect if the bitmask is changed from adding a new character, not how many one bits there are in it. (This is true in the hash-set version as well - if insert returned false you can move on).

  old bit mask = current bit mask
  current bit mask = current bit mask OR new character
  if old bit mask == current bit mask
    window is not unique, move to the next window
  (otherwise window is so far unique)


OR wouldn't work with a sliding window since it can't be inverted?

This should work:

  init bit mask and count of bits to 0
  for each new char:
    old = bit mask
    bit mask = bit mask XOR old char
    if bitmask > old then count++ else count--
    old = bit mask
    bit mask = bit mask XOR new char
    if bitmask > old then count++ else count--
    if count == window length: return match
The idea is that each XOR will always set or clear exactly one bit, so we can maintain a count of set bits.

But if there is a POPCNT instruction it would probably be faster to use it.


My guess is, it’s still faster to just check whether hashset.insert returns false. Since most windows are not unique early exit will likely beat faster processing.


I doubt the hashmap would beat the XOR method from the article. Hashmaps means allocations, and it means hashing. Hashing is going to be at least as much work as XORing a couple values in the mask. Hashmaps also means chasing pointers all over memory and ruining your cache.

You can add an early return to the OPs XOR method by just checking if the char is already in the bit mask. This will be faster if the word size is large.


the domain/range is fixed, so you don't have to allocate, actually, because you can have a fixed sized table and perfect hash :)

That said, I agree you can make the bit munging faster in the end, i'm just saying i don't think the speed up is anywhere near the improvement from early-exiting.


Even if the hashmap library knew that the only valid keys were 'a' .. 'z', it wouldn't be magically faster. The best it could do is use basically the same code as a hand-rolled implementation.

Bit operations and shifts take a single clock cycle, and the mask can be stored in a register throughout the entire loop.

If "early exit" brings any improvement, I don't see why the best wouldn't be to combine the two solutions instead of choosing one or the other.


> Even if the hashmap library knew that the only valid keys were 'a' .. 'z', it wouldn't be magically faster. The best it could do is use basically the same code as a hand-rolled implementation.

This is known as a perfect hash[1]. knowing that you will never have collisions does allow for a faster implementation. The hash map can be backed by an array which will never need to be resized, and you don't have to fiddle with linked lists to chain collisions.

You're correct though, that this is something you will have to implement yourself. Library hashmaps are going to trade performance for general usefulness.

[1] https://en.wikipedia.org/wiki/Perfect_hash_function


For small windows it is possible that an early exit might be counterproductive as it exercises the branch predictor.

One would have to test ti see where the cutoff is, but often doing extra work is faster.

Edit: but this is discussed elsethread.

A fixed size run also helps with vectorization.


You can combine the two XORs:

   new_mask = old_mask XOR old_char XOR new_char
   count += signum(new_mask - old_mask)
Where signum(x) = (x > 0) - (x < 0)


If the old and new char are different, two bits will have changed state and comparing the old and new mask will have unreliable results.

e.g.

old mask = 011100 new mask = 001111, numerically less even though it has one more bit set


Dangit! I concede…


And how do you remove the character leaving the window on the left? I don’t think this solution works the way you think it does…


Yes, I wrote it too quickly you would have to xor characters in and out instead. I doubt it is worth it vs popcnt at that point

It doesn’t change the fact that problem is incremental, and most importantly you can early exit windows as soon as you discover a single non-unique character. You don't have to wait till the end of the window. They don’t implement this for hash set, for example.

Since most (n>1) windows are not unique (well, more accurately the probability of the window being unique decreases as the size of the window approaches the number of possible characters), early exit from windows will likely beat bit munging except for small windows, until you get into SIMD solutions.


Bit operations are fast, branches are slow.


Well, yes, but most of these branches can be if-converted into conditional moves anyway if you want.

So they are data dependencies and not control ones. There are still some control ones.

Beyond that, let me be super clear: Imagine the following six versions:

1. One window at a time processing. No early exit, no bit munging.

2. One window at a time processing. No early exit, bit munging.

3. One window at a time processing. You don't use direct bit munging, but you early exit the window when you hit the non-unique character, and skip the window forward to the first instance of that non-unique character.

IE given "hmma<something>", n=4, you early exit at the second m, and skip processing windows forward to right after the first m (since no window prior to that can be unique, as they will all hit the double m)

4. One window at a time processing, bit munging, same otherwise as #3

5. Sliding window processing, no bit munging

6. Sliding window processing, bit munging.

The speedup between the 1-2 vs 3-6 is much greater than 5 vs 6 and 3 vs 4.

You could probably make 3 pretty darn competitive with 6, and 4 could probably beat 6 with SIMD (IE processing multiple windows in parallel really fast, and maybe wasting work, vs guaranteeing you only do the minimum work to find a unique window)


I don't think you are correct. Certainly not with a word size of 4. For very large word sizes, the early return will be a big win. But you are really under-rating how expensive the hashmap will be vs "bit munging".

You could roll your own perfect hash, but you would end up with a solution that looks almost identical to TFA. If you use a stdlib hash, you are going to be chasing pointers all over memory to do a single insert / lookup. De-referencing a single pointer not in cache costs 50 - 100 clock cycles on a modern system. By the time you do one insert, TFA will have XOR'd at least 32 chars into its bit mask. And your cache won't be as nice as in TFA, slowing you down even more.

Given the two scenarios: 1) bit munging, no early return 2) hash lookup, early return

I would guess scenario 1 wins until word size gets above ~256. And obviously, we can add an early return to scenario 1 to make it unquestionably the fastest.


Isn't the memory just O(Unique) since the window size is bounded by the max number of unique characters (in this case 26)? In the general case, the hash would need to store one bit for each unique element.

If you use a bit mask, you allocate all the memory up front. In a set you allocate through runtime, but could just use set.add(char - 'a') for a similar memory bound. But both need to be able to store every unique element. They are both O(Unique), it just happens that 26 <= num_bits(u32).


Yes. It's a neat trick, but from an algorithmic point of view it's exactly the same as the obvious solution, replacing a set with a hash-set. I did this with a counter in python:

  def first_diff(s, n):
      cnt = Counter()

      for i, c in enumerate(s, 1):
          cnt[c] += 1
          if i > n:
              top = s[i - n - 1]
              cnt[top] -= 1
              if cnt[top] == 0:
                  del cnt[top]
          if len(cnt) == n:
              return i


The true insight here is not using XOR, it's sliding the window efficiently. Here is pseudocode not using XOR.

    members = [0]*len(alphabet)

    for i in 0...input length:

        members[input[i-window_size]] = 0 if bounds ok
        members[input[i]]             = 1

        if sum(members) == window_size:
            return i
If `members` is a bit vector and `sum()` is a popcount, this code is equivalent.


Isn't the real performance gained by eliminating the triple nested for loops and memoizing over a sliding window, rather than the use of XOR? You would get the same order-of-magnitude improvement if you used a map combined with iterate-only-once approach.

Ie, it's undeniable that the final program works much faster, but the article should really be called "a smarter memoization trick."


I did something similar and counted from the end, to make much bigger jumps!

https://github.com/timvisee/advent-of-code-2022/blob/master/...

Maybe a bit less neat in the binary sense, but definitely very fast.


That's a nice low level solution the problem - I have to confess that I just wrote a regular expression for it :)


> I just wrote a regular expression

Running time of O(MG)


A neat trick indeed, using bitsets in a sliding window. Contest programming makes heavy use of some such tricks.


I've seen this use of constants inside big O notation in leetcode and 'informal' discussions. Is it pedantic to say that O(NM) == O(N) if M is a constant (in this case since it's bound by 26)? Or is this the current and expected usage?


It's correct to omit constants from big O, but in the article W is a distinct input variable so the usage is valid. The size of the window W is bound by N, not a constant 26.


But it's impossible for W to be bigger than 26 since you can't have 27 letters in a row that are all different.

Normally for a search algorithm the size of the alphabet is assumed to be constant.


You can also swap two variables without using a temp variable thanks to XOR

https://stackoverflow.com/a/19618810


Unless they are the same memory location!

http://underhanded-c.org/_page_id_16.html

x86 CPUs have a dedicated instruction to swap two registers, or a register with memory.


A note of caution: using that trick on modern hardware and/or languages is likely to be slower than just using the temporary variable. Because compilers and processors can see through the latter better than the XOR.


Can someone explain the 1 << stuff? I’m very new to bit fiddling but I did follow the blog.

“ fn run(s: &[char], window_size: usize) -> usize { let mut set = 0u32; for i in 0..s.len() { // Turn on bits as they enter the window set ^= 1 << (s[i] as u32 - 'a' as u32);

        // Turn off bits as they leave the window
        if i >= window_size {
            set ^= 1 << (s[i - window_size] as u32 - 'a' as u32);
        }

        // Check the current window and see if we're done
        if set.count_ones() as usize == window_size {
            return i + 1;
        }
    }
    panic!("No unique window found");
} “


The "stuff" assigns an integer value to the char. 'b' - 'a' = 1, for example. This assigns each char a unique integer value. By shifting 1 up that many times, you assign a unique bit position to each char.


For those new to bitshifting, here are the results of the "stuff" operation that @dahfizz describes:

  'a' - 'a' = 0, so take 1 and shift it left 0 times:
  00000000000000000000000000000001 = 'a'

  'b' - 'a' = 1, so take 1 and shift it left 1 time:
  00000000000000000000000000000010 = 'b'

  'c' - 'a' = 2, so take 1 and shift it left 2 times:
  00000000000000000000000000000100 = 'c'
(I'm not sure why the post's author chose to use 10000000000000000000000000000000 as their example for 'a' rather than the above which IIUC is how the code actually works.)


'1 << stuff' = 'bit shift left 1 by stuff many positions'. Say you have a byte, 8 bits, '1' is 00000001 - if we shift it left 3 bits we have 00001000 (8).

'Stuff' there is just any expression (to understand separately) that evaluates to the number of positions to shift; `<<` is the operator for bit-shifting (left, cf. `>>`).

In brief `(s[i - window_size] as u32 - 'a' as u32)` is finding the character that just left the window on the left, represented as a number, starting from 'a' as 0 so that all 26 fit inside 32 bit positions.


Thank you all who kindly took to answering my question. It makes sense now!


1 << n sets a 1 bit at the nth bit position. << is the left shift operator.


I am wondering why the bit masks are Big (not Little) Endian? I'm sure there's a reason the example is done that way, I just don't know what it is.


I wanted to put the 1 visually close to the equal sign, so you didn't have to scan all the way to the end of 0000000000000000000000000000000 to find it. Perhaps this is misleading, because the code places the set bit at the LSB!


I think my intuition tells me that this is analogous to the convolutions you do in neural network image processing. Just that this is one dimensional.


I did it using sets. I realise now I was using the HashSet optimization


gamozolabs benchmarked this kind of stuff in a recent stream: https://www.youtube.com/watch?v=ru97E9xA0s4


One can improve this even further I think. First, using an iterator instead of indexing into an UTF-8 string. Second, using the codepoint value directly as a bitmask (why is author realigning the bit mask to "a" in the first place?)


I assume the author is realigning to get only one bit for each letter. This relies on having an alphabet of 26 characters where 26 is less than the 32 available in a u32. In most real-world problems, you need to allow for the full ASCII or Unicode range and this wouldn't be possible.

Using codepoints directly, there is overlap. 'f' ^ 'd' will give you the same bit pattern as 'b'. You could keep an xor value for each window size smaller than the full window but that effectively brings back the inner loop that using xor is avoiding and you could just use equality. With codepoints, there may be a solution similar to a bloom filter so efficiently determine whether a duplicate is possible but I've not thought through that fully.


You are right. Total brainfart on my side. I skimmed the article too fast and completely misunderstood the approach.


The author passed the input as &[char] not &str, so indexing is no problem. That's weird in its own right, but obviously not the core of the post.


Looking at ASCII decimal values, a-z correspond to 97-122 so if you want to use a 32 bit bitmap you remap 97-122 to 0-25 by subtracting 97 ('a').

If you can use a 128 bit bitmap for the same cost then you could indeed directly index by ASCII values.

You can also get rid of the 'if' on window size in the main loop by partially unrolling it and taking those cases (start and end of string) outside.


The point is that each character represents a single bit in the mask, so it's easy to count the whole number of characters using popcount. If you XOR in the characters directly you can no longer easily count which characters are in the set.


Very cool! ^


Note that you could also solve this with AND/OR, with just a little more complexity. To “turn on” the bit for the new leading character as you slide the window forward, you of course OR it with the bitmask for the window, and to “turn off” the bit for the old trailing character, you AND its complemented bit-value with the bitmask. Not as elegant as XOR, but still works, and shows that XOR properties aren’t essential to the solution.


If the same character occurs twice in the window, the second OR will have no effect but the AND will remove it.

The trick is that XOR will leave the bit set if there is an odd number of occurrences, and the only way to have the bit count equal the window size is if every char is unique.


This is also essentially the trick applied in Kadane's Algorithm for the "maximum contiguous sum" search, albeit using masks to "undo" the set-of-characters check whereas Kadane's uses subtraction to "undo" the addition for maximum sum check.

https://www.geeksforgeeks.org/largest-sum-contiguous-subarra...


Kadane's algorithm (whose name sounds cooler than what it does) doesn't do any masks, it basically just keeps extending the window until the sum becomes negative at which point it starts a new window.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: