"Bit tricks" are worth a factor `log n`. Let me explain: If you have an array of `n` integers in {0,1}, and you want to count the number of 1s, you need `n` time. But if you have an array of `n` bits, with bit level operations you can do `n/W` time, where `W` is your word size. We often have W=32, 64 or 256 on some architectures, but you may still argue it's just a constant.
The thing is that we usually saw you have random access to all of your input data in constant time. But if your input size is `n`, your pointer size must be `log n`. So for standard algorithm analysis to work, you need to allow a word size of at least `log n`.
You're using n as a parameter of the machine's specs, which is a big no-no. You can't just come up with a bigger machine for each consecutive value of n. If that were allowed, all algorithms would trivially be O(1).
We usually assume the pointer size to be infinite (e.g. RAM machine), to avoid having to deal with access time. But if we don't, then it must be a constant size (and we have to adjust the big O formulas accordingly). In neither of those cases do you get free arbitrary length xor operations. Just like you don't get e.g. free arbitrary length multiplications either.
One nice property of xor, however, is that you can parallelize the heck out of it. So you can get arbitrary constant factor improvements by adding a bunch of CPUs.
This is a nice model because you don't have to assume things like "infinite pointer sizes", and because the algorithms that work well in practice (such as using bit tricks) also work well in the theory.
If you don't work in the word ram model you also wouldn't be able to do things like hash maps with constant time queries, since that requires a hash function which can't be computed in O(1) bit operations.
We're still at O(N * W) time. It might be possible to implement xor using only +, - and bit shifts (though I'm not seeing any obvious way to do so), but count_ones stays a bottleneck; I'm pretty sure you can't do that in O(1) even on a word RAM machine.
Not to mention, a machine where addition takes O(1) time is a massive cheat - at least definitely not what most people have in mind when talking about computational complexity. AFAIK, O(...) usually means "on a classic RAM machine" unless otherwise stated. With hash maps/sets, it really depends on what is expected to grow. Bigger input size doesn't always mean bigger keys.
> but count_ones stays a bottleneck; I'm pretty sure you can't do that in O(1) even on a word RAM machine
It's actually easy to do logn bit operations in O(1) time on the kind of machine you are talking about too. Just make a table with the result of all (lg n)/2 bit inputs. Such a table only takes sqrt(n) memory and time to create, and since you seem to accept O(1) table lookups you now have count_ones in two lookups.
Similarly it's easy to make small tables that allow you to do arbitrary binary operations on (lg n)/4 bit strings.
> at least definitely not what most people have in mind when talking about computational complexity.
I'm pretty sure if you look in CLRS or any standard text book of algorithms, they allow lgn bit operations in constant time. Every heard people saying "Sorting takes O(n logn) time"? They are clearly assuming comparing two numbers can be done in constant time.
> Answer: Yes! And that’s a standard assumption! The first time you see this it may seem totally weird: you’re assuming the computer hardware size depends on the input size?! That seems to make no sense. But once you calm down, it’s actually quite logical and cool. Of course you want a single pointer to fit into a word. You should just think of w as a parameter of the model, and w ≥ log n as an assumption.
> Every heard people saying "Sorting takes O(n logn) time"? They are clearly assuming comparing two numbers can be done in constant time.
Where n is the number of items. Not the number of digits! And we don't normally talk about "time" when discussing sorting algorithms in the first place. We talk about number of comparisons (precisely because we have no clue how big the items are!).
> Just make a table with the result of all (lg n)/2 bit inputs
We're interested in the number of ones in a C-bit integer, where C is the size of the character set (and incidentally the maximum acceptable value of the window size W). Simply counting the ones bit by bit costs O(C) time. Making a lookup table for all (C/2)-bit numbers - as you seem to suggest - costs O(2^(C/2)) time and space. And the algorithm we wanted to improve runs in O(N * C) time.
Look, I'm trying to decipher what your point is, but you seem to be confusing the size of the input space with the size of the input. So AFAIC, this is as much attention as this algorithm deserves. It's a nice party trick, but nothing more.
> And we don't normally talk about "time" when discussing sorting algorithms
Sure you might be a purist and talk about comparisons, but I'm pretty sure if you Google it most people would talk about time. And "n logn" is exactly the amount of time it takes to sort n word-length integers in Word RAM.
> We're interested in the number of ones in a C-bit integer
You're not really engaging with the point that you can make a table in sqrt(n) time that allows counting the bits in logn bit strings in constant time. That means the final algorthm runs in O(NC/logN) time.
This is definitely not a party trick, as you would know if you've ever written a program using these methods. Now with AVX instructions it's more important than ever.
You also didn't engage with O'Donnells lecture notes. I can find you many more text books explaining why this is the right assumption to make, in case you don't like the reasons I've given above.
The thing is that we usually saw you have random access to all of your input data in constant time. But if your input size is `n`, your pointer size must be `log n`. So for standard algorithm analysis to work, you need to allow a word size of at least `log n`.