What is a subsequence then? I thought you are looking for words that show up in ...

tylerhou · on Dec 19, 2018

The subsequences of a string X are any strings Y such that Y is X with zero or more characters dropped, with the condition that the original characters of X must remain in the same order. As an example, the string "help" has 16 subsequences: ["", "h", "e", "l", "p", "he", "hl", "hp", "el", "ep", "lp", "hel", "hep", "hlp", "elp", "help"]. Note that the set of subsequences are isomorphic to the power set (assuming repeated characters are unique), so for a string with n unique characters, there are 2^n subsequences.

Waterluvian · on Dec 19, 2018

Okay gotcha. So for each word in the length-sorted set, walk the string "picking up" characters as you go. If you complete the word, that's your answer.

I'm sure there's a faster or more mathematical answer. But that would be my napkin python.

(Edits made)

tylerhou · on Dec 19, 2018

You probably don't need a dict; just a list of candidates. And you should clarify that you should sort your list by reversed word length and keep another list of indexes into each word.

EDIT: You shouldn't sort the list beforehand. You should compile the list of matched candidates after walking through the whole string and pick the longest match. (Or just represent the matched candidates in a list of bools). It's possible that your answer only happens to match the last few characters of your haystack, so you must walk the entire length of the haystack.

EDIT2: This solution also makes a bunch of assumptions and you can find a better runtime in certain cases. For example, if you have many more candidates than the length of your haystack, it might be worthwhile to generate all O(2^n) subsequences and store them in a hashmap/bloom filter. Then you can do (almost) O(1) lookups for each word. This would also lend well to a distributed compute approach: you could have k machines sharded in this manner: the first machine computes and stores in your hashmap/bloom filter the first 2^n/k subsequences, the second the second 2^n/k, etc. Then you can either 1) in the case of hashmaps, fan out each request to each machine, succeeding if any machine succeeds or 2) merge the bloom filters on a central machine (assuming the same hash functions and modulus constants) and do O(1) lookups. (Copying hashmaps is O(data stored inside); merging bloom filters is O(bloom filter size).)

Waterluvian · on Dec 19, 2018

I guess it depends on profiling real world behaviour. If you walk the list only a few times because long words are present, that's better than walking the list once and for each character, walking your set of words. Or if most words aren't a match early on, you'll cull the candidate set pretty quickly.

Yeah I'm convinced I wouldn't bother with something beyond my basic idea until I profile.

tylerhou · on Dec 19, 2018

Yes, I missed that you can drop candidates whose lengths are less than your longest match so far. So a set might be better.