Hacker News new | past | comments | ask | show | jobs | submit login
Learn regular expressions in about 55 minutes (qntm.org)
277 points by melloclello on Mar 10, 2014 | hide | past | web | favorite | 78 comments



You will learn about regular expressions in 55 minutes. To learn how to use regular expressions, in as long as it might take you, it might be better to:

Read the wikipedia page which has a great description of what regular expressions are and why you would use them: https://en.wikipedia.org/wiki/Regular_expression

Read the JS docs: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guid...

Test your regexes: http://regexpal.com/

Visualize your regexes: http://www.regexper.com/

Try some challenges: http://callumacrae.github.io/regex-tuesday/

And remember, writing regular expressions can be very difficult, reasoning about your regular expressions can be even more so, defining your problem can be the most difficult of all. Think before you regex.


The most difficult part of regex isn't defining the problem. I'd say that one's easy. The hardest part is figuring out your regex months or years after you've written it!


Hopefully you at least wrap the regex in a function, then it's easier to reason about given the function name. Regexes randomly placed in long functions are harder to grasp.


I comment them at the time of creation and add them to my snippets file with the comment.


extended regexes with comments (aka /x) should be the default and enforced behaviour in all newly designed languages.

Sadly, the only one that does is perl6.


One trick I've used that saves life when designing and reviewing regex and makes them readable even for later programmers who know nothing about regex is to chunk them into smaller bits with significant names.

For example, using C, file-local, pre-processor:

#define MATCH_NAME_RE "[a-zA-Z]+"

#define MATCH_SPACES_RE "\s+"

#define CAPTURE_NUMBER_RE "(\d+\)"

And then:

   re = QRegex( MATCH_NAME_RE MATCH_SPACES_RE CAPTURE_NUMBER_RE );
It both documents what each part does, makes it easier to debug and change them later on and makes them grokable. It only breaks down when you're doing very complex matching, but most of the time RE can be broken into smaller pieces.


Python has them:

    a = re.compile(r"""\d +  # the integral part
                       \.    # the decimal point
                       \d *  # some fractional digits""", re.X)


Coffeescript has similar functionality. From the Coffeescript docs:

    OPERATOR = /// ^ (
      ?: [-=]>             # function
       | [-+*/%<>&|^!?=]=  # compound assign / compare
       | >>>=?             # zero-fill right shift
       | ([-+:])\1         # doubles
       | ([&|<>])\2=?      # logic / shift
       | \?\.              # soak access
       | \.{2,3}           # range or splat
    ) ///
When turned into Javascript it becomes:

    var OPERATOR;
    
    OPERATOR = /^(?:[-=]>|[-+*\/%<>&|^!?=]=|>>>=?|([-+:])\1|([&|<>])\2=?|\?\.|\.{2,3})/;


yes, any decent PL nowadays has it, my point was that it should be the default.


Good list of links.

I had a look through the regex tuesday list. It looks like fun for people who know their way around regexs but if you're just learning I'd recommend starting with something simpler.

These look like a more reasonable starting level : http://regexone.com/


I'd recommend perlretut as well: http://perldoc.perl.org/perlretut.html

It goes into more detail about some things that were glossed over in the OP, which may be overwhelming for a newbie but necessary for advancing.


Looks an interesting set of examples. I couldn't run the first one in http://perldoc.perl.org/perlretut.html#Named-backreferences (need parens and = assignment it seems)


http://debuggex.com/ is another fantastic resource for visualising regex's.


The tutorial I learned regular expressions from, which is longer but more detailed than this one, was http://www.regular-expressions.info/. It’s free, thorough and well-organized. Its only flaw is that its section on support in various languages is out of date. It was written to sell a Windows-only regex tool, but it’s very non-pushy with the advertisements.

You can see a list of online regular expression testers for various languages at the Stack Overflow “regex” tag wiki, in the “Online sandboxes” section: http://stackoverflow.com/tags/regex/info. For JavaScript regexes, http://regexpal.com/ is easy to use.


Another excellent resource, which I believe was highlighted a few years ago here on HN: http://regexone.com/


For a non-online version of checking and analysing the structure of regexes, see http://www.weitz.de/regex-coach/

It's excellent, and together with with regular-expression.info a way to learn regexes that at least worked for me.


regular-expressions.info has been a good reference site for me. Another useful tool is an online regex checker. Out of a handful or so options, I find this the best : http://regex101.com/?


Yeah, I mainly learned from this one, too


Alternative titles :

"How to make your code unreadable in 55 minutes"

"How to make your code hard to debug in 55 minutes"

"How to introduce bugs by copying a regex that you don't understand and neither did the author"


That's a stupid reply. Regular expressions are very useful.


Regular expressions are useful, but on a much smaller set of problems than people typically use them for.

For example Chrome's script skipper in the debug tools completely unnecessarily uses regexes rather than a simple list of 'contains'.

It's definitely one of the worst "everything's a nail" hammers a lot of programmers wield. And quite often they've been holding it upside down for years without even realizing it.


Another situation where regexes are possibly abused - syntax highlighting. Yes, there are a lot of languages where you can sanely highlight /most/ code with regexes, but there's a non-zero number where you need a little more context, such as that provided by an AST (which would also help with autocompletion anyway if the highlighting is being done in the context of an editor).


> For example Chrome's script skipper in the debug tools completely unnecessarily uses regexes rather than a simple list of 'contains'.

That is your example of overkill? More like, exactly the sort of scenario where you'd want to be able to specify a regexp.


I'm consistently infuriated every time a search box in a GUI lacks regex support (which is very nearly all of the time).


Maybe the next fight on HN about the interview process should be whether we ought to expect the candidate to know about regexps.

I know them, so of course every good developer should know them. (My proof: I am a good developer, and I know them, therefore good developers should know them!)

(NB: In case it's not obvious this comment is slightly tongue-in-cheek.)


For the author: I've learned regex for a while, for years, I always come back to them and I'm always annoyed to forget about them if I don't use them for a long time.

I use mostly cheat sheets now,

but still, your tutorial was the best I ever read and I actually read it. It reads better than a "learnXinYminutes.com" and I actually learned some new stuff.

Notes:

validating email adresses: most MVC have a function that does that. I know that PHP has that natively as well.


Handy. This does gloss over some of the notable differences between implementations (not everything has non-greedy matches or identical {m,n} or {m,} syntax), but it's still by far the best tutorial introduction I've seen for regular expressions.


Would you mind linking the best tutorial introduction that you've seen for regular expressions?

I found the parent link incredibly useful because it listed common, useful expressions and examples of what you do (or don't) get when you use them.

I don't think of it as a tutorial at all since it wasn't really -- more of a reference or a quick guide.


In my comment above, I said this is the best tutorial introduction I've seen, even with its few limitations regarding differences between regex implementations.


Whoops! Sorry for misreading that.

I completely misread "Handy" as "Hardly" and then saw "Hardly. This does gloss over some of the notable differences . . ." which changed the entire tone of your comment for me.

My bad.


Here's a regex tutorial http://www.grymoire.com/Unix/Regular.html There are also tutorials on sed and awk as well


> \w means the same as [0-9A-Za-z_]

I've got news! There are some characters outside ASCII


To be fair, this depends on the implementation. Java, JavaScript and PCRE will only match ascii characters.


I found http://regexcrossword.com/ to be a good learning tool.

Here's one if you really want to test your skills: http://i.imgur.com/qLh2gcK.jpg


SPOILER ALERT: answer to regex crossword referenced by the 2nd link: http://www.mit.edu/~puzzle/2013/coinheist.com/rubik/a_regula...


Really like this idea!

Off topic, but, I don't see why people don't make alternatives to this

> Notice! In order to save your progress you have to login with Facebook.

Let me save by cookie as well, please.


i'm not sure if this puzzle is meant to be a joke, but it pretty much looks like it.

things like (O|RM|HHM)* which means match anything. who the hell would every write (.)(.)(.) when even capturing those is pointless in a crossword.

meh, i say.


I'm not entirely sure what you are saying, but I actually find this puzzle quite interesting.

(O|RM|HHM)* doesn't match "anything", but rather "nothing" or something quite specific. [0]

There are multiple rows in the puzzle where things _can_ match anything, but they are pretty fast limited by the opposing row/column.

[0]: http://www.regexper.com/#%28O|RM|HHM%29*


yeah sorry, wrote that in a hurry. i meant, if it can match nothing and still be valid, what's the point? you may as well write `.*`


Well, remember that it also has to fit in the row/column. If there's three spaces available, the string has to have three characters. So (O|RM|HHM)* can only really match ORM, RMO, and HHM.


And OOO.


Forgot about that one.


The time I finally _really_ learned regex was the time I _really_ needed them for a project that was beyond something I could stack overflow. I think having some problem in front of oneself, and testing over a ton of input data is a great way to learn.


I find regular expressions intensely annoying. I come across situations where they are a perfect fit, but, as I've never got around to learning the syntax I spend ages looking for a perfect, or sometimes not so perfect, answer on SO that gives me the arrangement. More often than not, in order to tweak the answer to make it fit I find myself in the position of having to learn the syntax in order to do so. Catch 22.

It seems there are no shortcuts and you can't practice cut and paste.

Note to self. Learn the syntax.


I had to learn RegEx because of Google Analytics. The folks from LunaMetrics wrote a handy guide - http://www.lunametrics.com/regex-book/Regular-Expressions-Go...

It's still just basic RegEx, but if anyone is looking for stuff with more context related to GA, this is the one for you.


This is awesome. It would be even awesomer if it came with some practice problems after each section!


Yes, you are right. Learning needs interaction with the material.

Any experts here want to set an exercise for each section? We can argue about the best solution in true HN style, and the result will be a good resource.


Done.


To really know and understand regular expressions, write a regular expression pattern matcher.


I have problems with linux and escape characters.

grep '".*\?"' test.txt

"test1" test2 "test3"

matches entire string while article says that ? should force it to match as little as possible. (All "test1", test2 and "test3" are highlighted as red) It works correctly http://regexpal.com/. What am I doing wrong?


This is when you learn that not all implementations are the same, and realize it's better to read the documentation for your particular implementation.

I'm guessing your grep defaults to POSIX EREs, but always check the manual.

  The behavior of multiple adjacent duplication symbols ( '+', '*', '?', and intervals) produces undefined results.
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_...


You don't need the \ (because it's in single quotes), and as mentioned by clarry all regex implementations are slightly different.

   grep '"[^"]*"' test.txt
should be about the same, and not reliant on less-standard features.


I don't particularly need a regex course, but I have to say, this site is so nicely designed, so easy for the eye.

A good way to practice regex is to use one of the many online regex tools to validate your understanding. My current favorite is http://www.gethifi.com/tools/regex# because it shows the group.


That's how I "learned" regex.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guid... and https://www.debuggex.com and lots of patience :)


RegEx 15 minutes to learn, a lifetime to master.


Where mastering hopefully means to just avoid them whenever possible.


Why is it that I find it much easier to just build a lexer/parser than I do using regex? Something about my brain, every time I see them I usually do everything in my power to not have to try to understand them.

On the other hand, I've worked with some regex ninjas who appear to just innately get it.

Maybe it's like cilantro, separating us by genetics.


Well I think you're in the "good" side of the discussion. Sometimes whenever people learn regular expressions, they try to solve everything with them, but for any kind or medium sized problem or DSLs, writing a custom parser is a better option.


IIRC yacc's documentation advocates its use for tasks where people would go regex only. I guess in the old days yacc got more light, now it's an obscure beast for compiler writers ... not F5 to refresh the webpage things.


I love the summary at the bottom. Concise and to the point... He could have wrote how to learn regex in 55 seconds.


I do some online training - I'll refer people over to this for a good refresher/intro to regex. Thanks.



That was really concise and well written. I particularly appreciated the clean warnings and subtle grips about the confusing pats of the syntax.


Might be nice to include the %r{} style in the "Excessive backslash syndrome" section.


The most important things to learn about regex are greed and boundaries.

Nobody cares about 'cat' for that you use a basic replace function in your preferred language, so when you learn about masks and patterns you need to understand when to start and when to stop.

Finding everything between quotes ".*" is not gonna give you what you expect.


I'm still waiting for "Reg-ex considered harmful".



We will need more than a funny sig before the regex madness is ended ! ;-)


"...in about 55 minutes"

about 55 minutes, meaning +/- 30 years.


Wish I had this resource when I had to learn regex.


"...and now you have two problems."


This is a very nice overview, but, knowing the rules of chess doesn't make you a chess master.


Agreed, but 55 minutes isn't going to make anyone a master of anything. The title is honest.


I love this!


i've seen many regx sites, this one is DEFINITELY my favorite

i especially loved the page - loaded SUPER fast, no distractions - GREAT job!

love it!


Commenting to also save some of the other recs HNers have provided.


This is not reddit so you get downvoted for comments like this because you are not contributing anything to the discussion. In the future you can upvote the story and then click on your profile and there is a link to saved submissions. Any submission you upvote will be saved in the saved submission list. Furthermore on reddit you can just click the "save" link to save things. I never understood why it was acceptable to post comments like this at reddit but that is just one of many why does reddit put up with X questions.


Am not a reddit user, nor did I know this -- but thanks (now I know).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: