
Python code to solve xkcd 1313 by Peter Norvig - weslly
http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb
======
temuze
This is a great article. It's pretty fun to play around with this heuristic:

    
    
      lambda c: 3*len(matches(c, uncovered)) - len(c)
    

Here's a trivial way to explore it: say we generalize the heuristic to H(a,
b).

    
    
      H(a,b) = lambda c: a*len(matches(c, uncovered)) - b*len(c)
    

The original heuristic is considered H(3,1) by this definition. Then we can
play around with a and b to see if we'd get smaller results.

    
    
      def findregex_lambda(winners, losers, a, b):
          "Find a regex that matches all winners but no losers (sets of strings)."
          # Make a pool of candidate components, then pick from them to cover winners.
          # On each iteration, add the best component to 'cover'; finally disjoin them together.
          pool = candidate_components(winners, losers)
          cover = []
          while winners:
              best = max(pool, key=lambda c: a*len(matches(c, winners)) - b*len(c))
              cover.append(best)
              pool.remove(best)
              winners = winners - matches(best, winners)
          return '|'.join(cover)
    
      >>> findregex_lambda(starwars, startrek, 3, 1)
      ' T|E.P| N'
      >>> findregex_lambda(starwars, startrek, 3, 2)
      ' T|B| N| M'
    

Or, to automate this:

    
    
      def best_H_heuristic(winners, losers):
          d = {(a,b) : len(findregex_lambda(winners, losers, a,b)) for a in range(0,4) for b in range(0,4)}
          return min(d, key=d.get)
    
      >>> best_H_heuristic(starwars, startrek):
      (3,1)
    

Looks like H(3,1) is pretty good for this case. What about the nfl teams?

    
    
      >>> best_H_heuristic(nfl_in, nfl_out)
      (3, 2)
      >>> findregex_lambda(nfl_in, nfl_out, 3, 1)
      'pa|g..s|4|fs|sa|se|lt|os'
      >>> findregex_lambda(nfl_in, nfl_out, 3, 2)
      'pa|ch|4|e.g|sa|se|lt|os'
    

Not the best heuristic there. H(3,1) wins or ties for the boys/girls set,
left/right set and drugs/cities set, which just goes to show you that picking
a heuristic off a gut guess isn't such a bad approach.

You could also explore heuristics of different forms:

    
    
      M(a,b,d,e) = lambda c: a*len(matches(c, uncovered))^b - d*len(c)^e
    

Or trying completely different formats:

    
    
      L(a,b) = lambda c: a*log(len(matches(c, uncovered))) - b*len(c)

------
throwaway_yy2Di
I don't know why Randall's regex incorrectly (?) matches "Fremont", but it's
worth noting Wikipedia's primary spelling has an accent aigu "Frémont":

[https://en.wikipedia.org/wiki/John_C._Frémont](https://en.wikipedia.org/wiki/John_C._Frémont)

~~~
fwenzel
Frémont still matches "[rn]t", so the problem persists regardless of this
spelling.

------
j2kun
One thing not mentioned in this article:

1\. The greedy algorithm has an O(log(n)) approximation ratio, meaning it
produces a regex guaranteed to use a number of terms within a multiplicative
O(log(n)) factor of the optimal regex.

2\. Unless P != NP, set cover cannot be approximated better than the greedy
algorithm. In other words, the only general solutions you'll find (unless
you're using some special insight about how regular expressions cover sets of
strings) will be no better than a constant factor improvement in produced
regex size than the greedy algorithm.

That being said, regexes (esp disjunctions of small regexes) are not arbitrary
sets. So this problem is a subset of set cover, and certainly may have
efficient exact solutions.

------
blt
I love Norvig's Python posts. He really gets the spirit of the language and
has fun with it.

~~~
mkesper
Yes, I love expressions as this one:

    
    
      winners, losers = (winners - losers), (losers - winners)

~~~
echion
Was that in an earlier version? I can't find that in the article now.

~~~
mikeash
Yep, he originally threw out all names that appeared on both lists, while he's
now arbitrarily assigning them to the winners' set. Not sure about the
reasoning behind the change, but there it is. The quoted line computed new
winning and losing sets that excluded all commonalities.

~~~
e12e
I actually also really liked that line, it helped flesh the article out to be
an impromptu tutorial on using sets in python (which it still is, to a certain
extent). I remember reading that sets overloaded some sensible operators a
while back, but I really appreciated this reminder.

In other words, very nice of you to highlight it in a comment, as new readers
wouldn't have seen it otherwise (as it turns out).

------
ddebernardy
This was posted a few days ago on Code Golf:

[http://codegolf.stackexchange.com/questions/17718/meta-
regex...](http://codegolf.stackexchange.com/questions/17718/meta-regex-golf)

That link includes a perl 10-liner to do the same.

~~~
jamesaguilar
I dunno if calling a pre-implemented solution is quite the same as writing one
yourself.

------
haberman
I thought it was going to be meta-meta-regex golf, and couldn't imagine how
that would be possible. But meta-regex golf is an interesting exercise, and is
far more tractable. :)

~~~
siddboots
"Write the shortest regex that only matches programs that successfully
generate the shortest regex given lists of inclusions and exclusions."

~~~
Laremere
Such a regex cannot exist:

1\. Take an existing solution to finding the shortest regex given a list of
inclusions an exclusions.

2\. Take another unrelated arbitrary program.

3\. Combine the two, when the arbitrary program terminates, use the existing
solution to solve the problem.

4\. Such a regex would have to solve the halting problem to know if the
arbitrary program terminates and the existing solution solves the problem.

5\. Since the halting problem cannot be solved, no such regex can exist.

~~~
JoshuaDavid
So long as the unrelated arbitrary program is of finite length, and the regex
golfer / arbitrary program composite is running on a machine with a finite
amount of state, the halting problem can be solved by running each program for
2^(bits of state) steps or until a state is repeated.

So as long as you can show that a program that is not finite in length will
not halt in a finite amount of time (I think this is the case, but am not
actually positive), I think that a rexex-finding-program-finding-regex should
be possible to write, at least in principle (though not in practice).

~~~
Laremere
Yes, the halting problem can be solved for any programs which can only exist
in a finite number of states, but most common definitions of a program would
include things such turning machines, which can hold an infinite number of
states. Really it comes down to the semantics and bounds of the problem.

However I think the most interesting result is mine :P

------
firegrind
When I read 'subtitles', i wondered about the .srt files of the movies.

~~~
StavrosK
Same here, and was blown away.

~~~
Groxx
"subtitles" ? I'm not seeing that anywhere here nor on norvig.com. link?

~~~
cubicle67
first panel of the cartoon

~~~
Groxx
d'oh. thanks :)

------
tlarkworthy
Exercise for the reader, write a regex to distinguish random noise from
English

EDIT: possibly down-voted because someone though it was sarcastic???

I was actually thinking of this problem before the XKCD comic, for detecting
hashes on hardrives efficiently...

~~~
jxf
Almost any regex containing a common English phrase of a few words, like /of
course/, would give you a very high accuracy rate on a large enough dataset
and a very low false positive rate. English has low entropy per bit.

~~~
tlarkworthy
That strategy would return results for any phrase not containing "of course".
Which would be rather a lot of results in any document. I am envisioning a
regex that plucks hashes from emails, for example. Some parts English, some
parts cryptographic "noise". Pick the noise from the English

(note the inspiration was a nefarious regex scanner for finding bitcoin
hashes. I have no intention of building such a thing, but the idea of a random
detector regex intrigued me. Is it possible?)

My idea was more like calculate the statistics of character n-grams in english
(all of which exist in random noise too), but count the most unlikely
occurrences until you hit some probabilistic threshold that indicates some
well thought out decision boundary.

~~~
thaumasiotes
I had to recognize English for a cryptography project in college. This was my
entire strategy, which the professor advised for rejecting non-English (it
turns out recognizing non-English is easier than recognizing English), and
works well:

1\. Parse the full text of some long book available on Project Gutenberg.
Record every trigram that occurs. Frequency is irrelevant; we just want to
know whether a trigram occurs or not.

2\. Go through your text, counting the number of trigrams that didn't exist in
the sample text. If the number of strange trigrams exceeds a threshold (for my
project I used a threshold of 3, but you can tune this), reject the text as
non-English.

Given the constant finite threshold I used, that does amount to a regex, but I
don't recommend trying to write it out explicitly.

~~~
pbhjpbhj
How effective was your algo?

Presumably it would false-negative texts with neologisms and spelling errors
and such.

There must be trigrams which simply don't appear in dictionary-English? Do you
have results posted somewhere.

------
a3_nm
Interestingly, finding a minimal-size regexp satisfying a set of positive and
negative examples (words that should match, and should not match) is NP-hard.
Here is a nice discussion: [http://cstheory.blogoverflow.com/2011/08/on-
learning-regular...](http://cstheory.blogoverflow.com/2011/08/on-learning-
regular-languages/)

------
z-e-r-o
Can someone explain what does this line mean and why does he use it as
heuristic?

    
    
      key=lambda c: 3*len(matches(c, uncovered)) - len(c)

~~~
natch
IMHO, sadly, that part is serving as a shibboleth.

The clue is how he explains all the simple parts of the script with comments,
but leaves this part unexplained, making some of us feel left out of the cool
club. Implying: if you don't grasp it immediately you must not be up to it.

Overall it's a great post, but as to the cryptic parts of it, it's
disappointing to see this kind of thing on the part of someone we all look up
to. It would be more generous of spirit if he were to have written this in a
readable, self explanatory way, like the rest of the code... as python should
be written.

Sigh. Norvig has a posse, but I thought it had to be said.

    
    
        >>> import this
       ...
       "Readability matters"
       ...
       "Sparse is better than dense"
       ...
       "If the implementation is hard to explain, it's a bad idea"

~~~
magicalist
First, it is explained in the text above it.

Second, seriously?

A line of code that isn't as clear as it could be in an ipython notebook he
probably dashed off in an hour is somehow a reflection on his generosity of
spirit? I think you need to recheck your expectations.

~~~
dded
> First, it is explained in the text above it.

A minor correction: the explanation for this heuristic was added later. (I
first opened this in a tab, and then didn't read the tab till much later. I
also wondered about the heuristic and didn't think to hit reload till I saw
your reply.)

But you're right, there is certainly nothing ungenerous from Norvig here.
Quite the opposite.

------
joyofpi
I think it fails for: findregex(set(['abc']), set(['abcd']))

~~~
xentronium
Pretty sure that's impossible within given constraints (only disjunction
positive regexes allowed).

~~~
jameshart
trivially fixable by including start-of-string and end-of-string as tokens in
the initial string breakdown, so instead of analyzing {abc, ab, bc, a, b, c}
as candidate regexes, you start out analyzing {^abc, abc$, ^ab, abc, bc$, ^a,
ab, bc, c$, ^, a, b, c, $}; would rapidly home in on c$ as an optimal
solution.

Thus would fail against a 'must fail' target of abcabc, but then you fix that
by extending the maximum allowable regex fragment length from 4 to 5 and it'll
find ^abc$. More generally, you extend the maximum allowable regex count to
the longest 'must match string' plus 2, and it'll always succeed, even if it
has to create a regex consisting of ^word1$|^word2$|^word3$...

------
donniezazen
Is Python Peter Norvig's preferred language (along with Lisp, I suppose)?

~~~
dekhn
[http://norvig.com/python-lisp.html](http://norvig.com/python-lisp.html)

practical tradeoffs, but yes, his preferred language is Python.

------
fwenzel
I am not sure why Norvig omits president Obama. That said, "[mtg]a" does match
him, so at least Munroe tries.

~~~
RyJones
His source page doesn't list Obama, for whatever reason.

[http://www.anesi.com/presname.htm](http://www.anesi.com/presname.htm)

~~~
stbullard
It's just out of date. It says there have been 53 presidential elections since
1789; there have been 56.

------
gwern
It's too bad he didn't try to tackle the optimal regexp problem and settled
for approximations - it may be a NP-hard problem, but all the example
solutions are short enough that the instances might be still tractable.
Would've been nice to know for sure.

~~~
abecedarius
I coded a brute-force one before:
[https://github.com/darius/sketchbook/blob/master/regex/find_...](https://github.com/darius/sketchbook/blob/master/regex/find_min_re.py)

Though it cares about the size of the AST rather than the concrete syntax. I
can't try running it now, I'm on an iPad.

------
josephlord
If you just want to play regex golf this site appeared before Christmas and
there was quite a discussion [1] although there are a few more levels now:
[http://regex.alf.nu/](http://regex.alf.nu/)

I'm still not happy with my 214 on Alphabetical including one false match (I
was 202 or something with everything correctly matched).

[1]
[http://news.ycombinator.com/item?id=6941231](http://news.ycombinator.com/item?id=6941231)

------
j2kun
What tool does Norvig use to create this json file? Does iPython have this as
a feature (somehow allowing formatted text)?

~~~
jofer
By "this json file", do you mean the ipython notebook itself? (It's saved as
JSON behind-the-scenes. That's what you'll get if you click "download
notebook".)

If so, yes, that's what a saved ipython notebook is. See:
[http://ipython.org/notebook.html](http://ipython.org/notebook.html) for an
overview of ipython notebooks.

You can export it to html, latex, etc using "ipython nbconvert --to
<format_you_want>". See: [http://ipython.org/ipython-
doc/dev/interactive/notebook.html...](http://ipython.org/ipython-
doc/dev/interactive/notebook.html#notebook-documents)

Basically, the website you're seeing (nbviewer) hosts ipython notebooks (json
files) and converts them to static html for viewing.

~~~
dded
What's the best way to install ipython on a Mac? Mine's from Macports, and I
have ipython notebook, but when I try to use nbconvert I'm told I'm missing
pandoc. A 'port install pandoc' wants to install yet another copy of gcc (I
have two--including one from Macports for another ipython dependency), ghc,
and about three dozen packages prefixed with 'hs'.

Is there any nice bundle of everything you need for ipython, for the Mac,
that's both easy to install and uninstall?

~~~
jofer
If you just want to install pandoc so you can convert things, see:
[http://johnmacfarlane.net/pandoc/installing.html](http://johnmacfarlane.net/pandoc/installing.html)
(There's an OSX binary there)

Seeing as how you already have a functioning install of ipython and a setup
you're otherwise happy with, this is probably the easiest route.

Also, just so it's clear what's going on, ipython notebooks and (especially)
format conversion are optional (but very useful) parts of ipython. They're not
part of the core functionality, so `pandoc` isn't a strict requirement to
build ipython.

Otherwise, as long as you don't mind switching the python interpreter, etc as
well, look into Anaconda or Canopy. I'm not 100% sure if either comes with
pandoc, but they're stand-alone scientific python distributions that make it a
good bit easier to get ipython, numpy, matplotlib, etc set up.

~~~
dded
Thanks for the pandoc link, I'll try that tonight. My ipython does already
have notebook, numpy, matplotlib, and more set-up; I didn't realize I was
missing anything till I tried nbconvert.

But I've just recently installed all this, and have only kicked it around a
little. Mostly, I wanted plotting.

------
shdon
With the given set,

    
    
      /M | [TN]|B/
    

is suboptimal, but could be

    
    
      / [TMN]|B/
    

But that (and the article) leaves out the subtitle for Star Trek 1: "The
Motion Picture". For that, Randall's original expression works.

------
sushirain
What would be a use for finding a minimal discriminating regex? Perhaps
understanding the difference between boys' and girls' names?

~~~
bjourne
Performance. For example if you are syntax highlighting a programming language
with hundreds of keywords, then using a regexp like "(kwd1|kwd2|...|kwdn)" is
not very efficient. An optimized regexp can do the same matching much faster.

~~~
jules
You would be much better off just generating the DFA of the straightforward
regex and minimizing that DFA. This is simpler _and_ faster to generate _and_
faster to execute. Furthermore, in a parser for a programming language you do
not want to match some things in one list but not in this other list, you want
to match some things in this list _and nothing else_.

------
thewarrior
Could this be used as an alternative to a bloom filter ?

~~~
gwern
If you know everything ahead of time, you could probably just use something
like a
[https://en.wikipedia.org/wiki/Perfect_hash_function](https://en.wikipedia.org/wiki/Perfect_hash_function)

------
LambdaAlmighty
<meta-comment on meta-golf>

Judging by the amount of fawning here, this guy must be a HN celebrity.
Interesting post nevertheless!

I can only hope, one day, I'd be writing and publishing joyful little hacks
like this, to such general applause, instead of eking out a living. I have to
say I'm a bit envious here!

Well done to the dude. An inspirational post, in many ways.

