

Solving ITA's Word Numbers Puzzle - NathanWong
http://nathan.ca/2011/12/ita-word-numbers/

======
psykotic
This is a fun puzzle. Here's my take on it. I stopped reading the article
after the introduction so I could attack the problem fresh, so hopefully this
isn't the same as what the author did.

The trick is working backwards. We want to quickly count the number of
occurrences of each letter in the list of number words. Knowing how to do
that, we just have to do a linear scan over the cumulative counts for A, B, C,
... to find the interval into which the query index fits. If it helps prime
your intuition, my inspiration is counting sort with the counts evaluated
partly analytically.

Number words can be divided into a fixed set of recurring components: one,
two, three, four, ..., ten, eleven, twelve, thirteen, ..., twenty, thirty,
..., hundred, thousand, million, billion. Build an inverse index that maps
each letter to the set of all component words that contain the letter, paired
with an occurrence count. For example, the inverse index would map T to {(two,
1), (three, 1), (ten, 1), (twelve, 1), (thirteen, 2), ...}.

With this in hand, the problem is reduced to counting the number of each
component word in the string.

Let's consider that subproblem for the component 'nine' ('nineteen' and
'ninety' are separate components) for numbers with up to n digits. Numbers
with the digit pattern xxx...xxx9 each contribute one occurrence of 'nine',
and there are 10^(n-1) such numbers ('... nine'). Likewise each xxx...xxx9xx
contributes an occurrence, and there are again 10^(n-1) such numbers ('...
nine hundred ...'), the pattern xxx...xxx9xxx has 10^(n-1) members ('... nine
thousand ...'), etc.

This same analysis applies to all ones components (one, two, three, ...,
nine), and an analogous analysis applies to other components.

------
strags
I have a ruby solution that takes 20s to run on a machine that's about 5 years
old.

I just brute-force generate the strings from 1-999999, as well as "million-
prefix" strings of the form "(1-999)million". These are then sorted in the
same array. Also compute the sum of the length of the strings from 1-999999
while we're at it.

Then, run through the sorted array, keeping a running total of the length.
Each time you hit a "million-prefix" string, it's easy to compute the total
length of all the numbers that start with the prefix - it's just the sum of
the string lengths from 1-999999 plus 999999*prefix-length.

If this subtotal doesn't push you past the 51B limit, then keep going. If it
does, then run through the 1-999999 numbers only, adding each one to the total
individually, until you reach the magic 51B mark.

The practical upshot is that you can skip millions of numbers at a time.

Ugly code here: <http://codepuppies.com/ben/ita/t2.txt>

------
BruceJillis
Cool! A very nice read.. thanks for sharing. I actually looked at that puzzle
but decided strawberry fields was more interesting. My solution comes up with
reasonably efficient answers within a minute or so, even on a 1000h eeepc :) I
don't have a nice writeup like you but the README contains a brief explanation
(in concordance with the rules for the puzzle):
<https://github.com/BruceJillis/Strawberry-Fields> I decided to post this
because you mention constraints and my solution is written up in a constraint
logic programming language (eclipse), it might be interesting to translate
your solution and compare.

------
mukyu
[http://conway.rutgers.edu/~ccshan/wiki/blog/posts/WordNumber...](http://conway.rutgers.edu/~ccshan/wiki/blog/posts/WordNumbers1/)
I read this write-up of the problem years ago that takes a different approach.

~~~
darkane
Interesting that this and the OP's solutions result in different answers.

~~~
lurker17
That's because you missed the hidden Part 4 where Shan&Thurston finish their
solution:

[http://conway.rutgers.edu/~ccshan/wiki/blog/posts/WordNumber...](http://conway.rutgers.edu/~ccshan/wiki/blog/posts/WordNumbers4)

~~~
darkane
Ah, good call. Very easy to miss that link.

------
andrewcooke
couldn't you just use "sort" to sort the file on disk and then offset into the
file to find the answer? i would expect sort to work fine for files larger
than memory (and if it doesn't i bet there is some tool that does - merge sort
with tapes is not so old...)

[edit: you'd need to add, then strip, carriage returns between numbers, i
guess]

~~~
lurker14
You could, but that's incredibly slow.

~~~
andrewcooke
really? for what value of incredibly? i bet it's way faster than writing all
that code.

[edit: to save people reading below i missed the sum requirement, which this
wouldn't do, and a rough back of the envelope calculations suggests it would
take about the same length of time as the code took. so i was wrong, sorry...]

~~~
NathanWong
I didn't have 70 GB of disk space free (two small SSDs in my laptop), but I
just started running it on my old machine and will let you know the results.
The grammar, parser, and traversal functions took me somewhere between two and
three hours in total to think up and write; my gut feeling is that writing and
sorting 70 GB of data on a spinning disk would be slower, though I may be
wrong.

Another issue with using sort that you still need the sum of the numbers. You
could store this as "wordnum;realnum", but now you've got over 100 GB of data
to sort, and you still need to parse this out when you do your single pass
after to get to the correct byte count. You could spend some time optimizing
the sorting format (again using tokens), but now you're spending time doing
work that's getting you even closer to an actual solution in code.

The coded solution is also scalable, although you could argue the necessity of
scaling in this context. If a company were to introduce this puzzle looking
for some byte in the first trillion numbers sorted (instead of billion), you'd
be looking to sort several terabytes of data on disk; the constant memory
solution, once tweaked to include the billions case in the same way the
millions case is handled, would take roughly 15 minutes to run.

~~~
andrewcooke
ah, sorry - i had missed the sum of numbers. also, after reading the sort man
page, for fast sorting you may need to write out the data in a set of files
(each of which fits in memory), sort each separately (well, sort before
writing), and then use sort --merge to join them.

i'm impressed at how quickly you wrote the code. i guess it would have been
better to say "i bet it would be quicker for me...".

[edit] actually, you can probably work it out. say you get 10MB/s to a disc.
you need to write 10 files (each 7GB) which takes 700s or about 10 minutes
each. so it's about 100 minutes for writing them (sorted in memory first) and
then about 100 minutes reading them in the merge (i'm thinking you can filter
the output to find the answer without writing again). so you'd expect around 3
hours, or about the time the code took to write.

[edit 2 to avoid yet more posts] thanks + good luck with the job application
(i did one of their questions a year or two back and, while it was really
interesting, all i got as a reply was "we've finished hiring this year"...
although in their defense - and perhaps like you - i was doing it more for fun
anyway)]

~~~
NathanWong
Upvoted you for the interesting discussion.

After I saw your first comment, I started writing out all billion numbers. It
appears to be getting exponentially slower, which I suppose is to be expected
as the file gets too big to fit in available contiguous blocks (there was only
100 GB free before it started).

It's been running for about an hour and is only half done, although writing 7
GB files would definitely be much faster; I don't know if sort --merge creates
one big file at the end or not, but there are ways around that since it would
be the bottleneck. It does sound like it could be done with sort in a semi-
reasonable amount of time, although there'd still be more work to do once it's
sorted.

In terms of writing the code itself, the parser is pretty similar to a toy
computer algebra system I had written in the past (it's much simpler), so I
had some pre-disposition to it in that sense, and the traversal is fairly
straight forward given the grammar. I chose a problem that I thought I could
solve in an afternoon because I knew I had real work to do the following
Monday and still had to write the blog post, and Parkinson's Law may have
helped me push through it a bit too. :)

Edit to reply to edit: This puzzle is actually retired from ITA's fleet, and
I'm not interested in applying to ITA/Google anyway. I'm happily employed, and
actually this is sort of the reverse: I'm looking to hire remote software devs
at BuySellAds.com and was hoping this would pique the interest of the
qualified segment of devs looking for a job.

~~~
np1782
My thoughts are this problem is complex regardless. It's pretty good that you
could solve this all in one afternoon.

Are there any blogs, programming books, that you recommend that would help,
someone improve their programming skills?

Thanks

~~~
NathanWong
I subscribe to the school of thought that you learn by doing, so my best
advice is to put yourself out there and write some code. I've read lots of
blogs, and a small handful of programming books (K&R, parts of SICP, parts of
CLRS, Programming Pearls), but at each step I took it upon myself to actually
do what the author was doing. Reading code or watching lectures isn't going to
make you a better programmer any more than watching tennis will make you a
better tennis player. You presumably already know the rules, you just need to
practice to get to the next level.

Try writing anything that sounds fun. Write your own JSON parser, or a trie,
or a B+ tree, or an implementation of the travelling salesman or knapsack
problems. Wikipedia will get you started on all of these, and from there you
can write the code. Too often people promote the idea that "Well, JSON
libraries exist, why write your own?", but that misses the point. For one,
it's fun. Professional tennis players exist too, and I am definitely never
going to be as good as they are, but that doesn't mean it's not worth playing
tennis if I enjoy doing so. More importantly, though, if you've written your
own JSON parser, then the next time a new protocol or file format comes out
and your language of choice isn't officially supported, you'll be able to
write your own and be ahead of the game. All of these skills are transferable.

An added bonus is that when you go to apply for programming jobs, your
portfolio of fun side projects will speak volumes about your ability to code.
Don't worry that you don't have the best, most efficient, most popular JSON
parser. The important part is that you spent your personal time bettering
yourself, and that you are interested in being great at what you do.

------
anonymoushn
This is pretty cool. It's a shame they haven't retired the problem I solved to
get an interview there.

~~~
lurker14
Why is it a shame that they haven't retired something cool?

~~~
anonymoushn
If they had retired it, then we could discuss it :)

------
danielharan
Hint: Gauss. Sum of numbers 1 to 100.

Using Gauss's insight, Ruby can solve 10,000 times faster than a naive C
solution.

I solved this a while back, while high on drugs my dentist gave me. Blog post
is down, but code is here: <http://refactormycode.com/codes/8-integer-puzzle>

~~~
lurker14
What of the magic numbers in this solution?

ThousandRange: def child_values @thousands * 1_000_000 + 499500 end

    
    
      def child_sizes
        NumberWriter.write(@thousands * 1_000).size * 1_000 + 18440
      end
    

MillionRange:

    
    
      def child_values
        @millions * 1_000_000_000_000 + 499_999_500_000
      end
      
      def child_sizes
        NumberWriter.write(@millions * 1_000_000).size * 1_000_000 + 44_872_000
      end

~~~
danielharan
The size of the strings for that range of numbers. So, e.g. all of the strings
starting with "twenty" will have that length of string (here, 6) * 10, plus
the "one" in "twentyone", etc. The "one", "two", "three", etc have the same
length whether they're preceded by "twenty" or "thirty".

Account for the repeated strings "hundred", "thousand" or "million" and what
follows them and you'll get the magic numbers above.

