

Embedly Challenge Results - screeley
http://blog.embed.ly/embedly-challenge-results

======
dustingetz
> a text of 900 unique words
    
    
      >>> words = [2520/i for i in range(1, 900)]
      >>> len(words)
      899
    

hmmm.. anyway:

    
    
      occs = [2520/(n+1) for n in range(900)]	
      assert 900 == len(occs)
      assert [2520, 1260, 840, 630, 504] == occs[:5]
    
      num_words = sum(occs)
    
      for guess in range(100):
          count = countWords(occs[:guess])
          if count >= num_words/2: break
    
      assert 21 == len(occs[:guess])

~~~
screeley
Updated accordingly. We used floats instead of ints which means that it should
have been 21 not 22.

~~~
Terretta
It _would_ be unusual to have non integer occurrences of words in the text.
But given the question didn't specify how to round, both 21 and 22 could be
valid answers even with only whole number counts of words. Round down gives
21, while round by half gives 22.

Interesting to see how few of the proposed answers used an HTML parsing
library (simplistic matching of potentially unknown document syntaxes is a
notoriously brittle approach), and surprised how few counted depth relative to
the article tag.

Given embedly's business and the setup discussion, seems like a valid solution
should work with any arbitrary HTML page containing an article tag and
paragraphs within it, while many of the gist lists either counted P depths by
hand (!) or assumed that one particular document.

If the <article> tag or the <div> by it or the <p> tags had had so much as a
space before the closing angle bracket (and forget about classes or styles)
most of them would have failed. For the most part, only the solutions pulling
in an external parsing lib would have still worked. Python's lxml.soupparser
comes to mind (or lxml.etree for this task), and was happy to see several
similar libs invoked.

Interesting that you had to replace the document with a cleaned up one to get
more successful answers.

Thanks for sharing the results.

------
johno215
Living Under a Rock Question:

What is up with all the startups using .ly domain or "ly" in their name? I can
understand using a foreign top level domain in order to find available domain
names, but that does not explain why we don't see ones from all the other
international TLDs.

~~~
Tossrock
I think it's just a naming fad, along the lines of the [word]r naming fad.

