
My favorite regex of all time - cleverjake
http://www.catonmat.net/blog/my-favorite-regex/
======
noonespecial
As someone who makes much of his living rehabilitating old perl scripts,
please, if you must use such things, use them like this:

[ -~] #match only printable characters

It takes 5 seconds longer and with regexes, just knowing what the damn thing
is trying to do is half the battle. When you use a regex, use a comment. Its
the civil thing to do.

~~~
hnriot
Google is by far the best "comment"

~~~
citricsquid
Are you saying people should google regular expressions? in my experience
(correct me if I'm wrong) that doesn't work, I've never been able to get
google to return relevant results even with quotation marks.

~~~
hnriot
I'm saying that usually comments are either wrong or out of date, developers
code one regex, comment it, then fix a bug later and don't, then there's a
discrepancy between the comment and the code. It's nearly always easier to
just google the code and see what it does, if (as in this case) it's not
obvious.

~~~
cjfont
Your response doesn't address what citricsquid said, googling for a regex will
almost never return helpful results.

~~~
hnriot
Google regex and you'll find plenty of resources including tools to testing
patterns. You won't find much for any specific pattern but read the docs and
it will be apparent what this regex does. Familiariy and competence with regex
is a basic component of being a developer.

------
jimwise
This will not only miss non-ascii printing characters, but it's not even much
shorter than typing

    
    
      [[:print:]]
    

to use the explicit character class.

~~~
DrCatbox
The [[:print:]] will match any printable characters like åä, while the [ -~]
will not.

I used this once as another safeguard against pushing binary data into the
database. It was a poor system to begin with where you even have that
possibility... and it happened at least once before the fix and my safeguard
was in place.

~~~
e12e
"å" is perfectly valid text input in my locale.

~~~
EvilTerran
There will be situations where you need to check specifically for 7-bit ASCII
printable characters only. I've worked with APIs that require everything
outside that range to be escaped/encoded into it.

Email could be an example, I guess, although I haven't worked with it enough
to know whether the whole "7-bits only" thing is still an issue these days.

------
jack-r-abbit
Jeepers... cut the guy some slack. He didn't say this is the bullet proof way
of doing everything YOU want to do in all situations, every time, forever. He
said "I thought I'd share my favorite regex of all time". And then explained
what it does. Why does everyone have to poop on his favorite thing?

------
boyter
My favorite regex is the following,

/^1?$|^(11+?)\1+$/

Which finds prime numbers. Although, I can't for the life of me think of a
reason for using it.

[http://stackoverflow.com/questions/3296050/how-does-this-
reg...](http://stackoverflow.com/questions/3296050/how-does-this-regex-find-
primes)

~~~
lubutu
I do dislike people calling that expression a "regex", because it isn't:
regular expressions cannot contain backreferences, and must be computable in
linear time, whereas primality tests are polynomial.

~~~
baddox
I'm not a big fan of your explanation. To be more precise, true "regular
expressions" are computationally equivalent to deterministic finite automata,
which indeed can test an n-character string in O(n) time.

~~~
MileyCyrax
NFAs and DFAs both recognise the regular languages (and only them).

------
lambada
Are people seriously still deliberately using ASCII-reliant code?

~~~
csense
Every time I've had to deal with Unicode and internationalization, it's been a
problem.

For example, a few years ago I grabbed a source tarball from somewhere, I
forget what or where. It had the author's name in a comment, which included an
O with dots over it. That was the only non-ASCII character in the source code.
No matter what I did, both Eclipse and command-line javac refused to compile
the source.

Finally I wrote a script to delete his name from every source file manually.
It compiled flawlessly.

Then there's the time I found some text files with two characters of binary
junk at the beginning, followed by completely normal text. Again, I forget
what I was doing, but some program was refusing to process them correctly. It
was something internationalization-related called the BOM. Eventually I ended
up writing a script to walk a directory and remove the first two bytes of
every file. (This can probably be done with dd and xargs on UNIX, but I was
using Windows at the time, which means that something like this will require
spending an hour or so in your favorite programming language.)

These experiences lead me to believe that, for bootstrapped USA startups at
least, you shouldn't worry about a market outside the English-speaking world.

If you need to worry about junk like accented characters or moon runes
(Chinese/Japanese/Korean characters), it means you're big enough to afford to
hire someone specifically to address the problem.

~~~
e12e
I assume this is a not very subtle troll? Java source is unicode? (The offhand
reference to dd and xargs is a bit too much).

How do you define "English-speaking world", btw? Those too ignorant to have
heard of non-ascii-characters (ie: excluding Canada, as anyone doing business
there should at least have heard of French)?

Anyway, for anyone actually burnt by something similar on a GNU system try
looking up recode(1).

------
skrebbel
It's clever, but it's also completely unreadable for anyone who didn't read
this article. Regexes have serious maintainability issues as it is; let's not
make it worse by putting clever tricks in them.

~~~
Terretta
I don't understand why this is "completely unreadable".

What else could this have been besides match the character range from space to
tilde?

~~~
mikeash
Most people would have to check an ASCII table to know what that range is,
though.

~~~
slavak
Which takes for granted the fact that your input stream is even ASCII to begin
with. I'm too lazy to check, but I'm pretty sure this isn't going to catch all
printable Unicode characters, for example - and then you're left scratching
your head over what the hell the original author was trying to achieve.

------
ceejayoz
My favourite is this one: <http://www.ex-parrot.com/pdw/Mail-
RFC822-Address.html>

~~~
jballanc
Ah, this is my favorite also. If seeing this doesn't make you second guess
using a RegExp when a parser is more appropriate, well...you might be a Perl
programmer?

------
carlio
This seems to be a T-shirt advert, why am I reading this on HN?

~~~
pkrumins
I'm sorry that it sounds like it. It's really not. I commented about it on
this thread <http://news.ycombinator.com/item?id=4775100>.

------
sk5t
I suppose a single regex can be both "favorite" and "worst" at the same
time... it's only slightly interesting to know where ~ appears in the ASCII
character set, and while someone might recall that space is kinda near the
beginning but after the control characters, is it the first helpful printable
character? Who knows?

~~~
NegativeK
> I suppose a single regex can be both "favorite" and "worst" at the same
> time...

We definitely aren't the only ones who appreciate horrible things.

INTERCAL comes to mind here.

------
gpvos
My favourite regex is actually:

[^ -~]

Not to be used in a serious program, but only in an editor (or maybe one-shot
data massage perl scripts), to find possible errors or unexpected stuff.

~~~
tathastu
Also it's more interesting to put unprintable characters on a t-shirt.

------
IsTom
This works for ASCII only, use unicode character classes instead.

~~~
csense
That only matters if you need to process Unicode.

See my comments [1] [2] [3] for why Unicode / internationalization should be
avoided.

[1] <http://news.ycombinator.com/item?id=4369323>

[2] <http://news.ycombinator.com/item?id=4541039>

[3] <http://news.ycombinator.com/item?id=4775440>

~~~
kolinko
So, do you propose that u.s. bootstrapped startups have a disclaimer on the
registration page saying: "you cannot put foreign characters anywhere in our
system"?

Even if you focus on u.s., you will have problems. If you're doing a CRM, even
u.s. users will put in foreign names from time to time. If you're building a
CMS, users may want to put in a quotation in french, or will simply use
copy&paste from Word, which replaces "-" with "—"...

I honestly have a hard time finding a u.s. centric startup which could afford
to ignore unicode. The support requests, the fires caused by errors, and the
disclaimer that you'd have to put on the registration page, would cost much
more than simply learning how to code the f'n utf.

Building MVP is good practice in Lean. Saying "I'm bootstrapping hence I don't
have the time to learn the programming tools" is just ignorance and
incompetence. It's not like Unicode gives you extra work, it just requires you
to learn a few basic concepts. If you try to build a site which doesn't
support Unicode, you'll have to put a lots of safeguards everywhere to cover
up for your incompetence.

------
unjinxable
I like to use a variation of this in vim to quickly see if an html doc I'm
working on contains weird characters that I might want to replace with &html;
entities:

    
    
      /[^ -~]

------
Mahn
[\p{L&}] <\- unicode version, in case you were wondering.

~~~
e12e
No, you don't need '[' and ']' ? Also you don't need the '&' to get the
equivalent of the above, but might have to add support for spaces?

    
    
      > \p{L} or \p{Letter}: any kind of letter from any language.
    

vs

    
    
      > \p{L&} or \p{Letter&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
    

Along with:

    
    
      > \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
    

Considering the op matches everything printable, including whitespace (or
actually just space, not tab), numbers and punctation, I think the equivalent
would be "\X" ?

All this based on glancing at:

    
    
       http://www.regular-expressions.info/unicode.html

------
tripzilch
Wouldn't it be smarter, in that case, match something like [\x20-\x7F] (unsure
if that's valid regex, but you get the point), it's more explicit that way,
being very obvious about what characters are included, as well as immediately
being clear (to me) about the intention of the character class. "0x20 to 0x7F"
triggers the idea of "Printable ASCII" a lot sooner than <space> to <tilde>.

------
kamakazizuru
what a plug - obviously the only real purpose of this was to sell those
t-shirts.

~~~
pkrumins
I'm sorry that was not the purpose. I just wanted to share this regex trick
that I had in my mind. I added the shirts only later when I saw that the
article is getting very popular. People really seemed to like my previous tees
and it's great to make a little extra money and continue doing what I love -
coding and writing blog posts.

~~~
bhanks
well obviously people are interested in it. Can't fault you for trying to make
a little scratch

------
kaokun
Not all strings are ASCII! :-(

------
jervisfm
Can anyone explain how this regex [- ~] matches ASCII characters ?

~~~
boyter
It's pretty simple. Assuming you know regex... Im going to assume you don't
since you are asking.

The bracket expression [ ] defines single characters to match, however you can
have more then 1 character inside which all will match.

    
    
      [a] matches a
      [ab] matches either a or b
      [abc] matches either a or b or c
      [a-c] matches either a or b or c. 
    

The - allows us to define the range. You can just as easily use [abc] but for
long sequences such as [a-z] consider it short hand.

In this case [ -~] it means every character between <space> and <tilde>, which
just happens to be all the ASCII printable characters (see chart in the
article). The only bit you need to keep in mind is that <space> is a character
as well, and hence you can match on it.

You could rewrite the regex like so (note I haven't escaped or anything in
this so its probably not valid)

    
    
      [ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~]
    

but that's not quite as clever or neat.

------
hsiaobrandon
I don't want to be "that guy" and it's probably just my own stupidity, but
what's so special about this for it to frontpage HN?

