
5 Regular Expressions Every Web Programmer Should Know - soundsop
http://immike.net/blog/2007/04/06/5-regular-expressions-every-web-programmer-should-know/
======
silentbicycle
Instead of learning $bullet_point_count regular expressions by rote, just
learn _how they work_ and make your own as you need them. They really aren't
hard to use, they're just incredibly compact, which makes them look
unintelligible at first.

_Mastering Regular Expressions_ is the obvious classic, but a decent intro to
Perl or Python will probably get to them eventually. (If you use Emacs, it's
particularly easy to learn them by experimenting with M-x re-builder.) Also:
Not everything uses the same RE implementation, some use non-standard
extensions.

Don't become too drunk with power yet, though! There's a lot they're just not
capable of doing (most notably, balancing tags around arbitrarily nested
expressions). Once you have REs down, learning a lexer/parser is the next
step. :)

~~~
gruseom
_If you use Emacs, it's particularly easy to learn them by experimenting with
M-x re-builder._

Hey! I did not know about this, and it looks really useful. I've used
[http://www.newartisans.com/blog_files/regex.tool.for.emacs.p...](http://www.newartisans.com/blog_files/regex.tool.for.emacs.php)
and it's good, but not as well integrated.

I wish I knew a better way to get a survey of what's available in the Emacs
world - something that would have clued me in to this command, for example...
any tips?

Edit: I want to comment on this as well:

 _learning a lexer/parser is the next step_

Recently I've had occasion to write a few parsers. In the past, I'd always
used parser generators (yacc and antlr) on the assumption that they made
things easier. For this project that wasn't an option, so I bit the bullet and
did the recursive descent thing. To my surprise, it turned out to be way, way
easier than I expected. Moreover, in at least one case (a complex one), the
hand-written parser code turned out _shorter_ than the (cl-) yacc version, as
well as handling more cases and reporting errors better.

The morals of the story are: (1) Don't assume something is hard without trying
it; (2) It's not hard to write a classical recursive-descent parser and the
skill is a lot more useful than you may realize. Wish I'd learned that a long
time ago.

~~~
mark_h
_I wish I knew a better way to get a survey of what's available in the Emacs
world - something that would have clued me in to this command, for example...
any tips?_

<http://planet.emacsen.org/> often has useful stuff.

The emacs wiki (<http://emacswiki.org/>) is often where you end up after
googling and has a heap of information (although even it isn't complete!). You
could try <http://www.emacswiki.org/cgi-bin/wiki/RandomPage> when you're bored
:)

------
BrandonM
The first tip is okay, but the latter 4 are just horrible. Any decent language
will have libraries for parsing HTML or e-mails addresses. A regex is sure to
come up short and be very fragile.

I once had to maintain some screen-scraping code that was written in Python
using regular expressions. By the time I inherited it, half of the
functionality no longer worked. It would have been much better off using a
library like BeautifulSoup, both in terms of development time and
maintainability.

BeautifulSoup alone takes care of REs 2 and 3, and there are standard
libraries in Python that take care of 4 and 5. Why reinvent (less robustly, I
might add) the wheel when a simple API already exists in many languages?

------
Erwin
I wouldn't use a single regexp for complete username validation -- if it
fails, all you can display is a generic "username is not valid, it must obey
rules X, Y & Z" message. I'd check min and max length separately and display
an appropriate error message for that.

Also ignore leading/trailing spaces; or you'll end up puzzled why you have two
"bob@example.org" users in your database even with appropriate database
constraint, and bob mails you saying he can't login on the account he just
paid for.

------
antirez
5 Regular Expressions Every Web Programmer __should now how to write __... if
he claims I knows regexps. The problem is that in my experience a minimal part
of web developers really know regexps.

------
tocomment
meh disappointed. Could any of you guys give me one liners for email and URL
validation?

~~~
hs
i use this for email:

/[-\\.\\\w]*@\\\w+(\\.\\\w+)+/

~~~
thwarted
DO NOT USE THIS REGULAR EXPRESSION FOR ANYTHING.

This is a terrible example. First off, you're using double-quoted syntax, but
it's not in double quotes. Additionally, you're confusing when regular
expression metacharacters need to be escape because they're metacharacters and
when they need to be escaped because they are in a double-quoted string. The
PHP documentation is particularly terrible in this regard (telling you to put
regular expressions in double quoted strings rather than single quoted,
because PHP doesn't have a regular expression type).

[-\\.\\\w] means match a dash, a dot, a backslash and the w character.

Secondly, it's not anchored to the start or end of the input.

Thirdly, the LHS can be empty according to this regular expression.

Lastly, we've finally uncovered who it is that's keeping everyone from using +
on the LHS to do sendmail style +folder references.

~~~
hs
actually i didn't use PHP, the code i submitted is 'ported' using /regex/ ...
i haven't tested it in PHP (i don't have) ... but that what i might use if i
were forced to use PHP

the actual code is in newlisp: <code> (set 'p1 (regex-comp {[-\\.\w]
_@\w+(\\.\w+)+})) ;compile (map (lambda (f) (replace p1 (read-file f) (push $0
E) 0x10000) ) (directory "." "php|htm|html")) (set 'E (unique E)) (save
"email.db" 'E) </code>

my newlisp code is not used for validation, it's for scrapping emails from
~30000 html pages totalling 870 MB <pre> ~/dl $ find . | wc 29546 29548
1423727 ~/dl $ du -h 870M </pre>

I used more complex regexes (slow) but only marginal improvement (< 0.1%) and
a lot of noises

This code is pretty fast, almost an order magnitude faster than more complex,
complete email regex

[-\\.\\\w]_ means ... match - or \\. or word (a-zA-Z0-9) character ad nauseum
(can put more chars inside [] but the regex is slower)

i didn't use it for validation, so no anchoring needed ... but it's an easy
modification

anyway, this is the code i use for real world - i won't pretend that it's
perfect for every corner cases, but hey it works fast for me :D short code
solving 90+% cases

~~~
thwarted
I don't know about the lisp code (where you use \w), but \\\w matches a
backslash and a w, as the first \ escapes the second. I wasn't saying you were
using PHP, but I know the PHP docs were confusing on how to escape things
because backslash is overloaded for string escaping and for regular expression
escaping and character classes.

I can't see how a more complete (complex?) correct regular expression would
produce MORE noise in your output. The very fact that you allow the empty
string on the LHS means that you risk getting invalid addresses.

Continue to use that busted regular expression for parsing email addresses out
of these 30,000 HTML files and please remove all my email addresses from the
list you generate while you're at it.

~~~
hs
don't worry, the htmls are ads pages where people deliberately put phone and
email to contact to (public info)

the correct regex can indeed produce noise, consider this string in an ads
where the poster uses '+' to join info: "please contact the owner, her info:
4041234567+owner@gmail.com ... thx!"

the correct 'boilerplate' regex will happily take 4041234567+owner@gmail.com
... which is not what i want; thus, noise

the less pedantic regex i use will just take owner@gmail.com, which is what i
want

similarly with "?" char (legal for local part), the correct regex will catch
"sendIM?owner@gmail.com"; which again, not what i want ...

the 'busted' regex will only take 'owner@gmail.com' ... which is what i want

that's the main reason the omissions of +,? and other legal chars in local
part of my email regex, the second reason is speed

for email validation where what u get is just a string of less than 100 chars,
it's fine to use complete email validation regex; however, for email scrapping
of close to 1gb data, boilerplate won't cut it

i'm willing to sacrifice completeness for speed

my regex doesn't allow empty string, there's metachar '\s' for spaces,
equivalent to [^ \t\n]

and i was wrong, '\w' is [^a-zA-Z0-9_] ... i forgot '^' and '_' are included
in metachar '\w'

thx, your criticism forced me to reopen my regex book

