
The Deceptive Anagram Question - gamesbrainiac
http://nafiulis.me/the-deceptive-anagram-question.html
======
OhHeyItsE
Author sounds pretty darn smart. But, hey, he/she couldn't come up with the
correct solution in the allotted time, under pressure, with a marker in his
hand - must not be good enough. Hope the candidate they took on meets all of
their expectations.

~~~
Grue3
He needed a help from a friend to get this right. I'd argue that at the time
of the interview he wasn't ready. I'd also argue that the basic linear
solution for this problem is straightforward, doesn't require any clever
tricks, and a qualified candidate would have no problem solving it.

------
rasz_pl
>A Vector Approach to Hashing

is something author didnt bother to test at all. His word list includes words
with apostrophes (hex #27), like basic's, his code accounts only for ascii
letters and crashes on that list :)

> its O(26) for list creation

hmm, no? unless python people create lists one element at a time ...tested and
yes they do :o :D every additional byte takes ~10ms on my test computer(110K
elements loop), hilarious :))) Its cheaper to create two lists outside main
loop, and simply copy empty list over the work list instead of creating new
list every time (or resetting global list manually), we are talking ~10% of
whole algorithms time cheaper. Python is one hilarious way of benchmarking
algorithms. Simple x=1 in your main loop will cost tens of milliseconds.

>and its O(26) for creating the tuple

Ehhh :/ Yes, sorting is faster in python, ONLY because sort function is highly
optimized low level code, while anything you write yourself will be extremely
slow due to interpreted nature of the language. You can easily write O(n) hash
routine in low level language (assembler, c) and beat nlogn sort.

>But, I think this time, we can actually use collections.Counter correctly

correctly? yes, faster? NO, ~10% slower. Again seems author simply assumed
something is going to work and didnt test.

All in all main lesson from all this should be not use python to gauge
algorithms speed, unless you are being hired to write speed critical python
code (ahahahaha).

~~~
gamesbrainiac
> is something author didnt bother to test at all. His word list includes
> words with apostrophes (hex #27), like basic's, his code accounts only for
> ascii letters and crashes on that list :)

Oh, I know it crashes. I had to remove a total of ~20 words that had
apostrophes from the original word list. I just added this since it was an
interesting take on the problem that I hadn't thought of. Instead of assuming
that you have only alphabetical letters, you can assume that you will be
getting any ascii character, and hence make an array of 255, and not 26. Which
only adds to the problem.

> Ehhh :/ Yes, sorting is faster in python, ONLY because sort function is
> highly optimized low level code, while anything you write yourself will be
> extremely slow due to interpreted nature of the language. You can easily
> write O(n) hash routine in low level language (assembler, c) and beat nlogn
> sort.

No argument there. In fact, I tried alternatives in Python which were
algorithmically (is that a correct word?) superior, but sucked when put to the
test. I even tried with array.array, but unfortunately, that takes a list as
an initialiser. So, if you wanted to initiate a fixed size array, you would
actually be creating the list, and then creating the array which makes the
process even slower.

> correctly? yes, faster? NO, ~10% slower. Again seems author simply assumed
> something is going to work and didnt test.

You insist on saying that I didn't test it (your other comments seem to
reflect a similar level of disdain for other people's intelligence as well).
It is indeed slower, but the reason I added it is to showcase a bit of stdlib
magic. Initially, I was using Counter the wrong way, and now there's a place
for it (the whole story ties in at the end). But on the bright side, you tried
out the code yourself, and hopefully learned something new (I know that's hard
considering how impeccably smart you most certainly are). To be honest, I
didn't really think people would read it that far.

------
rikkus
I stopped when I scrolled as far as where he mentioned having trouble and
thought I'd try it out myself. Here's my initial solution:

    
    
      new [] {"pool", "loco", "cool", "stain", "satin", "pretty", "nice", "loop"}
      .Aggregate (
        new { Index = 0, Found = new Dictionary<string, List<WordAndIndex>>() },
        (acc, word) =>
          {
            acc.Found
              .FindOrCreate(word.Sort())
              .Add(new WordAndIndex(word, acc.Index));
    						
            return new { Index = acc.Index + 1, Found = acc.Found };
          }
      )
      .Found
      .Where(f => f.Value.Count() > 1)
      .SelectMany(f => f.Value)
      .OrderBy(wordAndIndex => wordAndIndex.Index)
      .Select(wordAndIndex => wordAndIndex.Text)
    

(some helper methods omitted for brevity)

I was fairly happy with this, but curious to see what he'd come up with in the
end. I like it - especially the way it uses indices and zip. Here's my
translation to C# (no helpers added here!):

    
    
      IEnumerable<string> Anagrams(IEnumerable<string> words)
      {
        Func<string, string> sortLetters =
          (s) => new string(s.ToCharArray().OrderBy(c => c).ToArray());
    	
        var sortedWords = words.Select(sortLetters);
    		
        var sortedWordsLookup = sortedWords.ToLookup(s => s);
    	
        return words.Zip(
          sortedWords,
          (word, normal) => new { word, normal }
        )
        .Where(
          (wordAndSortedWord) =>
            sortedWordsLookup[wordAndSortedWord.normal].Count() > 1
        )
        .Select(wordAndSortedWord => wordAndSortedWord.word);
      }

------
shred45
Can someone explain the comment about how quadratic time complexity does not
scale?

~~~
crimsonalucard
Good question. I just want to mention that this is a fact that a student with
a degree in computer science should absolutely know.

~~~
shred45
I'm getting the impression that the number of facts that I should be ashamed
of not knowing is O(n!) where n is the unix time stamp. That definitely
doesn't scale.

But more seriously, I think I worded my question poorly. I understand why
O(n^2) is not ideal. The author's statement was: "for more computing power we
throw at this the slower it gets per computer." Now I'm not sure how he would
seek to throw computing power at the algorithm, but I can immediatly think of
a parallel implementation of this algorithm with efficiency O(1), at least for
p<=n.

So my question is, why does a quadratic algorithm neccesarily imply
diminishing returns when you add more computing power? That is how I
interpreted his sentence and it is not clear to me.

~~~
aetherson
I think that the author phrased himself clumsily or (less likely)
misunderstands O-notation. My guess is that he intended to say, "As the
problem gets bigger, throwing more computing power at it gets less and less
efficient," and got tripped up in the words.

~~~
shred45
Ah, this makes much more sense, thanks.

~~~
gamesbrainiac
I am terribly for the failure of an illustration I provided. I'm sorry I
didn't see this sooner, or I would've jumped at a second chance to explain. I
assure you, I'll try to do better in the future.

------
nine_k
The key take-away: keep thinking about 'borked' interview questions. This can
make you a better programmer.

~~~
gamesbrainiac
Agreed. I feel the key take away from any failure is to learn from them, and
thus the mistakes that we make become earnest ones.

------
maxerickson
I think it is better to think of sorting the words as standardization (or
looking it up, canonicalization). A typical goal when hashing is to avoid
collisions, this algorithm is seeking a certain type of collision.

~~~
gamesbrainiac
Thank you so much for pointing this out. When Alexander use d the term
"normalization" instead of "hashing", I felt that his term was more
appropriate. If you know of any place where I can learn more about these other
types of normalization/standardization/hashing?) please feel free to provide
links. I would be truly grateful for the opportunity to learn more.

------
father_of_two
Here's a variant of the last solution the author presented, using a Counter of
each word, instead a sorted list of words' chars, as normalization.

    
    
      cnt=Counter(tuple(Counter(w).iteritems()) for w in words)
      print [w for w in words if cnt[tuple(Counter(w).iteritems())] > 1]
    

This is just code golfing, I don't even think it's clear. The last solution of
the author is nice, though.

~~~
gamesbrainiac
This is actually quite nice. A little hard to see, but I see what you did
there. You'd get an earful from the algorithm-nazis though ;)

------
chatman
"This is wrong on so many levels, I don't even know where to begin."

^ This was remarked about a _correct_ , but slow algorithm.

I couldn't figure out the deception part either.

------
jack9
This problem is used in most wordster/boggle/Wordy Birdies (from the makers of
Angry Birds) style games.

"In the same order" is an unnecessary additional complexity. Asking for an
O(1) solution demonstrates the exact same thought quality, so it's a bad
(overly complicated and longer to answer) interview question for the purpose
intended.

------
nartz
you only need to go through the array one time - here you are doing it
multiple times....

also, you can define a order independent hashing function, that just maybe
assigns prime numbers to each letter and adds them up....

instead...

hashed_words = {}

anagrams = []

for word in words:

    
    
      h = compute_hash(word)
    
      w = hashed_words.get(h)
    
      if w == True
      // we've already encountered another anagram, just add it
    
        anagrams.append(word)
    
      elsif w
    
       //case where both w and word are anagrams, so append both to list
    
       // and set the value to True, so we can shortcircuit this operation in the future
    
       anagrams.append(w); anagrams.append(word);
    
       hashed_words[h] = True //flag indicating weve seen this word before
    
      else
    
        // first time we are seeing this, so lets just set it
    
       hashed_words[h] = word
    

return anagrams

~~~
vilhelm_s
I think this would output ["loco", "cool", "stain", "satin", "pool", "loop"]
but the question asks for ["pool", "loco", "cool", "stain", "satin", "loop"]?

~~~
nartz
Ah you are rite, overlooked that

------
meggar
Isn't the O(n) version actually O(NlogN) because of the sort?

~~~
cousin_it
I think it only sorts the characters in each word, not the list of all words.

~~~
meggar
right, but the sort was described as being O(n) for some reason.

~~~
aetherson
So the time that is involved in this algorithm is some rough constant K *
length of list * time taken to sort the average word on the list.

If we were likely to see variants of the problem that scaled up both in the
length of the list and the length of the words in the list, then we would
properly describe the performance of the algorithm as O(n * m * log(m)), where
n is the length of the list and m is the length of the longest word in the
list.

If, on the other hand, we assume that m is bounded and will never exceed a
relatively short length (for example: because these are english words, not
arbitrary character strings), then we can say that m*log(m) is essentially a
constant, and the performance of the entire algorithm will vary meaningfully
only based on the length of the list.

That seems a reasonable assumption, so we say the performance of the algorithm
is O(n).

~~~
gamesbrainiac
This is correct. You certainly are better at explaining this sort of stuff
better than I do. If you have a blog or if you write for any org, please do
give me a link, I'd sure like to improve.

Let me just re-iterate your point. The longest word in the English dictionary
is 45 characters long. The average is 5 and the average of the word list I've
used is 8. So, when it comes down to it, saying that its constant time is a
fair assumption because your (as you eloquently put it) the length of a word
is bounded.

