Multiplying together the first 26 primes requires that you can store a number up to 232862364358497360900063316880507363070. log2 of that number is about 127.5. So you need 128-bits of storage.
Essentially, the correct solution is to sort both lists, and then do a single pass through checking for unique items in one list. Once you have this solution, you can do some micro-optimizations:
- note that the set of symbols is limited, so you can use a counting sort (or other non comparison based "sort")
- note that the sorting function doesn't need to be an actual sorting algorithm, and may discard duplicates.
If you apply these two optimizations, you should end up with the bit-field approach (or maybe a slightly more memory hungry one that doesn't store stuff in individual bits... but algorithmically it doesn't bring anything new to the table).
For some reason, a minority of people are then thinking that a further optimization is to use a more inefficient data structure in the sorting algorithm for storing whether or not a letter has been seen. This has the effect of a more expensive read operation, write operation, more memory usage and provides no other benefits.
Standard counter-intuitive example (from the game development domain) is how to check if an array of integers has a zero in it. The answer, when all things are considered, is to scan through an entire array making a simple arithmetic for each item and one comparison at the end rather than compare each element.
The same may happen with your example. Filling up 32-bit mask might be more expensive than to flip an element in a boolean array. That's not even considering that the array option allows for early termination of the second loop.
Consider the following string, where   is a non-breaking space:
Main string: "Counter example:  _à²"
Shorter string: "ಠ_ಠ"
The array of bools, bitmap (remember 32-bit int!), hash map; all are the same basic deal, easy to understand code, etc. Any would probably be fine, imho.
Hash map would be my initial implementation, though, if only because I don't know where it would be used, and the map will be immediately understood.
Well, I guess not everyone is an amazing genius like you.
A Bloom filter is usually used where representation size is important, because it can be orders of magnitude smaller than a hashtable. It doesn't perform very quickly though, because of the hashing operations needed during insert.
A Bloom filter is great for P2P search applications for example. Then peers can pass Bloom filters around, and an initial search can happen locally. If it succeeds then the search can ask the remote host if the file actually exists. The frequency those remote requests will fail depends on the number of false positives the Bloom filter is configured to give (ie, a function of its size).
Characters can repeat, so a boolean array doesn't work.
Plus, the story says I mumbled awhile that given the characters were limited to alphabetic (his original specification) that I could use an array instead of a hashtable for some constant time savings but that was about it.
In this case the question says find out if all the characters in the smaller string are in the larger string, so yes, I think you are right - the repeats don't matter here.
The hash table solution is O(n+m). 24 operations on the example. The array solution is O(n+m) and 24 operations. The same.
Your intuition tells you that the array is faster because subconsciously you're making assumptions about implementation details. What if you're in a language like PHP where arrays are implemented as hash tables? What if you're in C, but the hash table implementation uses a resizable array and happens to choose an initial size of 26?