Hacker News new | comments | ask | show | jobs | submit login

Angry because neither the hash table or the prime multiplication would be as fast as a boolean array indexed by the char value. As an added bonus, the boolean array actually makes the most intuitive sense.

Even faster is to fill a 32-bit mask with "seen" characters for each string. AND them together and compare result to the mask from the second string.

Maybe. You need 256 bits for all the possible chars, and this solution is perhaps a little less intuitive. Also, the shift + and + cmpz is not necessarily going to be significantly faster than an address lookup + cmpz because of the latency of getting the array from memory or cache. Though it will use less cache space compared to the size of modern caches that's not enough of a win to matter much I would guess.

I was assuming a 26 letter alphabet. At that point the shifting, anding and comparing would take place in registers.

True, although if there are lower case letters too it could become a problem. At any rate, I'm not sure that the register aspect would improve speed significantly because of the latency of getting the string from main memory. You're still going to get pipeline stalls as you reach portions of the string that are not in the cache.

Yeah, the registers will only be faster than manipulating some sort of bool array. Reading the original string (with associated cache stalls) will certainly be necessary.

Doing it this way will use the optimal amount of memory (in the worst case) (assuming an O(n+m) answer). There are exactly 26 bits of state that need to be tracked (where "bit" is used in the information theoretic sense of the word). If you choose to store this state as 26 actual bits, then you win.

Multiplying together the first 26 primes requires that you can store a number up to 232862364358497360900063316880507363070. log2 of that number is about 127.5. So you need 128-bits of storage.

Essentially, the correct solution is to sort both lists, and then do a single pass through checking for unique items in one list. Once you have this solution, you can do some micro-optimizations: - note that the set of symbols is limited, so you can use a counting sort (or other non comparison based "sort") - note that the sorting function doesn't need to be an actual sorting algorithm, and may discard duplicates.

If you apply these two optimizations, you should end up with the bit-field approach (or maybe a slightly more memory hungry one that doesn't store stuff in individual bits... but algorithmically it doesn't bring anything new to the table).

For some reason, a minority of people are then thinking that a further optimization is to use a more inefficient data structure in the sorting algorithm for storing whether or not a letter has been seen. This has the effect of a more expensive read operation, write operation, more memory usage and provides no other benefits.

If you're marking used letters in a mask, why sort first?

It would depend on the hardware used.

Standard counter-intuitive example (from the game development domain) is how to check if an array of integers has a zero in it. The answer, when all things are considered, is to scan through an entire array making a simple arithmetic for each item and one comparison at the end rather than compare each element.

The same may happen with your example. Filling up 32-bit mask might be more expensive than to flip an element in a boolean array. That's not even considering that the array option allows for early termination of the second loop.

Early termination is the biggest advantage of the array version. The 32-bit mask is probably the fastest possibility since it will reside in a register.

The array is a pretty obvious solution (to me, anyways). But the hash table has an added benefit.

Consider the following string, where   is a non-breaking space:

   Main string: "Counter example:  _à²"
   Shorter string: "ಠ_ಠ"

Note: "Say you have one string of alphabetic characters"

The array of bools, bitmap (remember 32-bit int!), hash map; all are the same basic deal, easy to understand code, etc. Any would probably be fine, imho.

Hash map would be my initial implementation, though, if only because I don't know where it would be used, and the map will be immediately understood.

I come from a compiler background. When I hear "alphabet" I don't think a-z - I think of an arbitrary set of distinct symbols that can be arranged into a string.

Yes, but even in compiler land, you don't say "alphabetic" to mean a symbol taken from your particular alphabet (since that is practically a given... a symbol outside the alphabet might as well not exist).

The array is a pretty obvious solution (to me, anyways).

Well, I guess not everyone is an amazing genius like you.

I recently used this algorithm for speeding up an iTunes style search function. Originally I did a strstr over every item in the database, but it wasn't quite fast enough. I precomputed a 32-bit mask - 1 bit per alphabetic char, 5 bits per digits, and the last bit for special characters - for every database item, and used that as an initial search filter. I only needed to use the more costly strstr on items that made it through this filter.

That's pretty clever. It can be tempting in such a situation to immediately go for the big guns (patricia trees or whatever), but often a simple hack like that gives a lot more bang for the buck.

A Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter) might work well in this case.


A Bloom filter is usually used where representation size is important, because it can be orders of magnitude smaller than a hashtable. It doesn't perform very quickly though, because of the hashing operations needed during insert.

A Bloom filter is great for P2P search applications for example. Then peers can pass Bloom filters around, and an initial search can happen locally. If it succeeds then the search can ask the remote host if the file actually exists. The frequency those remote requests will fail depends on the number of false positives the Bloom filter is configured to give (ie, a function of its size).

I haven’t tried a Bloom filter, but I think for my application, the simple bit array might give better performance. The nice thing about the bit array is it works great on the worst-case. After the user has typed in 3 or 4 characters of their query, the search space has been narrowed enough where the performance isn't an issue. It’s that first or second character when the search space is large that gives problems. Using the bit array in the single character case, the 1 to 1 mapping of characters to bits gives the result directly.

the boolean array indexed by char effectively is a hashtable isn't it? where the hashing algorithm is index = ord(theChar).

This is not quite true. It's a mapping, but it's not a hash table because the mapping is bijective (while hash tables just project the keys into the table space). So there is no possibility of collision, the number of keys is fixed, etc. It's just a table, like any other, and most of what is implied when talking about hash tables does not necessarily apply here.

that was a genuine question. I think I'm right, but I might not be. Judging by the upvotes it looks like I am, but I'd like someone to come forward and say "Yes, that's correct" explicitly.

Yes, that's correct.

thanks :)

Exactly. When your domain is restricted (characters or small ints), then why use a hashtable? Use an array. If initializing an array is a concern, use a smartarray.

Your anger is misplaced.

Characters can repeat, so a boolean array doesn't work.

Plus, the story says I mumbled awhile that given the characters were limited to alphabetic (his original specification) that I could use an array instead of a hashtable for some constant time savings but that was about it.

it doesn't matter if characters are repeated. Let's assume ascii characters, so 'A' = 65 in decimal. If you make an array, T[], where T[65] = 1 if 'A' exists and 0 if 'A' didn't exist in the first string, you can run through the 2nd string to check if T[65] was true (1) or false (0), telling you if the character in the 2nd string appeared in the first. Same thing as the hashtable. I think that's what was meant by boolean array.

I guess it depends how the question is presented. Often it is known as the "Ranson Question" (A set of magazines, a given ransom message: can you make the message from the magazines).

In this case the question says find out if all the characters in the smaller string are in the larger string, so yes, I think you are right - the repeats don't matter here.

sorry, i don't get how characters repeating affects things

If there are 3 As in the first string and 4 in the second, it fails. A boolean value can't count to 3.

Maybe in a different problem it would fail, but this is merely a presence test, so booleans are fine. At any rate, an array of int32s instead of bools would defeat the objection for reasonably long strings.

This is that different problem. The author's description is muddy, but you do need the counts.

Oh really? How much faster would that be?

As much faster as the length of the second string times the difference between an address lookup and a hash table lookup.

Which is how much? I wanted you to try and quantify the difference, because the process of doing so would lead to the flaw in your argument.

The hash table solution is O(n+m). 24 operations on the example. The array solution is O(n+m) and 24 operations. The same.

Your intuition tells you that the array is faster because subconsciously you're making assumptions about implementation details. What if you're in a language like PHP where arrays are implemented as hash tables? What if you're in C, but the hash table implementation uses a resizable array and happens to choose an initial size of 26?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact