
Learn Regex The Hard Way  - karlzt
http://regex.learncodethehardway.org/?
======
jdthomas
Not to sound like a pretensions CS guy ... but how can you have a book on
regular expressions without even mentioning DFA/NFA? Maybe its just me, but
(back in my day) I found learning about finite automata made regex just click.

Edit: _ahh, I see now mentioned in intro that "I'm going to be very practical
and straight forward about it. No NFA to DFA conversion. No crazy explanations
of push down finite state automata." :) I still argue that thinking of regex
as state machines is a very useful tool._

~~~
mekoka
I'm not a CS graduate. I first learned of state machine from an article posted
here on HN a few months ago.

I learned regular expression from the regular-expression.info site and from
the first 4 chapters of Jeffrey Friedl's book over 5 years ago. I don't recall
an introduction to NFA/DFA from them either. I'm pretty comfortable with regex
and use them regularly (no pun intended) and so far, I don't feel that I've
missed much.

I'm not disagreeing with your view, but I think it's worth pointing out from
someone having developed an understanding of regex from a different
perspective, that NFA/DFA might not be such a strong prerequisite and people
should not feel like they're going in at a disadvantage.

~~~
spp
Chapter 4 of Jeffrey Friedl's "Mastering Regular Expressions" goes in depth
about NFA/POSIX NFA/DFA. It even has car analogies!

~~~
mekoka
It must have been the 3 first chapters then. At the time, they were enough for
me to understand and "get going" with regex. Various other online resources
have filled little gaps here and there over time. This thread makes me want to
revisit the book and resume my reading of the next 3 chapters, a long overdue
promise to myself. I'm glad that the book will cover the topic of FA then.

------
scrame
Thats pretty awesome. Regex's the tool (vs. the concept of "regular
expressions" that comes up in interpreters/compilers) is a supremely useful
utility when used sparingly, and surprisingly portable between languages.

Regex's are in a (not-)sweet spot of practical languages that aren't taught in
school. For all the hype of ruby, et al. being able to make "DSL's", its
surprising how mystical an _actual_ DSL, that is portable between languages
(through PCRE) is held.

Unfortunately, like many other tools, can be used sparingly after a lot of
practice writing crappy regex's, and having to eat your own dogfood (so you
learn its not an "ultra-fast parser", or that you shouldn't chain
substitutions on user-generated input).

A rote tutorial -- in the style of LPTHW -- is excellent.

~~~
itmag
What are some other examples of DSL's that are useful in the same way as
regexes?

I am thinking LINQ might qualify. Also, JQuery is strictly not a DSL, but it
kinda feels DSLy.

~~~
lmkg
Personally, I classify any sort of sequence comprehension as a DSL, including
LINQ and python's list and generator comprehensions. JQuery is basically as
close to a LINQ-like DSL as you can get in a language like JavaScript that
doesn't have metaprogramming support. Related to sequence comprehensions,
there is Common Lisp's controversial LOOP macro, which is basically an
iteration comprehension DSL.

I'm somewhat on the fence about whether format control strings qualify as
DSL's. They don't as much, but they tend to have very complicated and
specialized syntax.

I feel that Monads are a DSL in Haskell, but whenever I see "Monads in Blub!"
I don't feel the same way about them. Part of it is probably that Haskell
provides not only infix operators, but do-notation for supporting monads. Part
of it is also that monads don't do as much in other languages, because their
type systems aren't rich enough to fully express them, nor restrictive enough
that their capabilities are useful.

~~~
itmag
A DSL doesn't have to be Turing-complete IIRC so format strings could probably
qualify.

------
delwin
What do you guys think of Zed Shaw's teaching style? As I'm not a beginner
programmer, I can't really evaluate correctly — to me it seems like a odd
approach to programming pedagogy, but perhaps it works. Any non-programmers
want to shed light on why his style works?

~~~
AgentConundrum
I'm about 75% through his Python book, and I have to say I like it. He assumes
you're an absolute beginner, and I'm not, so I can't really give it a fair
appraisal from the perspective of its target audience.

That said, the book definitely makes learning fun. For the first half of the
book, you have a really quick learn-reward cycle. The exercises are short and
to the point, and he makes you type in the lesson and run it. This means that
you type a few lines, then immediately get to see what it did, on your
computer. It very much reminds me of when I was in grade 10 and we were doing
simple stuff in QBasic. It would definitely grab and keep the interest of an
absolute beginner dipping his or her toes into the programming pool for the
first time.

From the perspective of an experienced programmer - a term I apply to myself
with some trepidation - things move a bit slow, but the experienced programmer
is definitely not Zed's target with the book. Even being on the slow side, the
exercises are quick enough that you can knock out a bunch of lessons quickly
if you're so inclined, and I've definitely skipped over a bunch of the "Extra
Credit" stuff because of that.

I actually did spend the "recommended" week doing Exercise 37, which is sort
of a "learn at your own pace" exercise. You're given a list of operators,
string formats, keywords, etc. and asked to spend some time looking into them,
defining them for yourself, and playing with them. I spent a good amount of
time reading into the way Python's floor/integer division and modulo
operations work / differ from other languages and why (if you're interested,
Python floors towards negative infinity, not zero, and performing a modulo on
negative numbers takes the sign of the divisor, not the dividend, in contrast
to some other languages), and spent time toying around with decorators and
lambdas.. I kept a spreadsheet of everything I was supposed to learn and my
description of it. My interpretation of that lesson is probably a lot
different from a beginners, but I think it's a nice example of "you get out of
it what you put in."

Zed also likes to shove you into the deep end and let you learn to swim. The
extra credit assignments don't have answers, they're all about making you play
around and learn on your own terms. The lesson above is a good example of
"here's a bunch of terms, now go play on your own for a while", which was an
interesting change of pace. Also, he gives a solid emphasis on reading other
peoples code. There have been at least two lessons so far dealing with that,
and I've read over a bunch of reddits code and learned a good amount from that
- this was during the first "go find some code" exercise, where you only know
the bare minimum of conditionals and looping, so there were a lot more scary
"i don't know what that code is doing, but I'm going to find out!" moments,
which I found helpful.

~~~
zedshaw
> It very much reminds me of when I was in grade 10 and we were doing simple
> stuff in QBasic.

That's pretty much the feel I was going for when I wrote it.

~~~
AgentConundrum
It really shows. A friend of mine asked me about it, and my abbreviated review
was just "it makes programming fun again."

I hesitated to use that in my earlier comment because some might use it to
paint a bad picture of me, contrasting it with the idea that all good
programmers love their craft and are always writing code because it's always
fun for them. I just think that as you get into meatier works and start
architecting larger applications, there's always going to be some hair pulling
and you're going to fight wars with your compiler/interpreter/whatever. LPTHW
isn't like that. You just sort of cut through the bullshit and get on with the
fun stuff, and it makes the learning process easier.

So thanks. I look forward to reading your other works as they finish.

------
mturmon
The regex shell is a neat idea.

In many situations I'd like to generate strings accepted by the regex, to
check that my regex is tight enough.

I know this is hard in general (for lots of reasons, basically all reducing to
"there are a lot of strings and no easy way to iterate through them in a
satisfying way").

But, has anyone made any progress on this for the "easy" cases?

~~~
scorpion032
Your wish has been granted: <http://txt2re.com/>

I use this all the time. No wonder I don't know to hand write a Regex.

~~~
haraball
Here's another tool I use when testing my regex: <http://regexpal.com/>

Nice tool for testing the regex on a given data set.

------
leeoniya
just decided to randomly click a chapter. in 12.1 wouldn't

^[0-9]+|[A-Z]+$ in fact need to be written as

^([0-9]+|[A-Z]+)$ or non-capturing

^(?:[0-9]+|[A-Z]+)$

i doubt the intent was to alternate NL/EOL assertion. seems like a novice
oversight and does not instill confidence in the rest of the material.

or am i missing something?

~~~
zedshaw
Hmm, well the book is being written so there's potential for some errors, but
I believe you are wrong here. You're confusing match with search semantics.
Your above works because by default Regetron searches. If you turn on match it
doesn't match your proposed test below. Try this:

<http://codepad.org/uPpwKPJS>

Notice when you turn on !match it doesn't find your test.

Also, keep in mind that this is just introducing the concept of alternating.
Captures are covered later.

~~~
spicyj
I agree with the GP --

<http://codepad.org/YmsaEtJS>

~~~
zedshaw
Great, another example of regex engines not doing anything you tell them. I'll
look at changing it but () isn't going to work there it has to be taught
later.

~~~
leeoniya
the solution seems pretty easy. simply leave out the beginning/end-of-line
assertions. [0-9]+|[A-Z]+ would suffice for that example. i agree that
introducing capture groups IS too early, but you can add the parentheses
without mentioning capture groups, even if they do capture. grouping still has
great value as a precedence indicator which is taught in gradeschool
arithmetic.

it's rather unfortunate that the syntax to capture is "(" while the less
complex no-capture is "(?:", i'm not sure who thought that through, but here
we are.

~~~
zedshaw
Yes, I _hate_ that (?:) syntax. Who the hell thought that crap up.

But, I will point out that you attributed the error to NL/EOL assertion, when
actually it was order of precedence of | being greater than $ and ^. It's a
simple nearly 1-2 character mistake, not a "novice" mistake that discredits
the entire book.

~~~
leeoniya
yeah i apologize, perhaps i was a bit harsh. i _did_ understand what the issue
was and never claimed that it was the newline/EOL assertion itself. i said
that your newline/EOL assertion is being treated as PART of the alteration,
which is exactly what happens - yes because of pipe's higher operator
precedence when no explicit grouping is defined.

novice or not though, 2 misplaced chars in a regex can make the diff between a
security feature and a security hole, in this case matching a plethora of
inject-able, potentially malicious characters which appear nowhere in the
regex, not even as a wildcard ".", i would never make the claim that a 2-char
mistake which matches this far beyond its intent is a minor oversight in a
real-world, public facing application - and isn't that the ultimate goal?

the only issue in the context of a book for a regex beginner is that the regex
gives the appearance of newline/EOL assertion hugging an alteration, but does
something quite different. removing those assertions would clear things up for
the better.

------
suivix
The (?i) is a confusing part of that example expression, and I don't see why
it's necessary.

~~~
zedshaw
Which one? That might be an error.

~~~
rcthompson
I think it's the one on the "cover page" (the one linked to from HN.

