Hacker News new | comments | show | ask | jobs | submit login
Learn Regex the Easy Way (github.com)
323 points by shubhamjain 6 months ago | hide | past | web | favorite | 66 comments



My biggest issue with regular expressions is remembering the exact syntax of each of the different common regex engines. Javascript, Perl, `grep`, `grep -E`, vim, `awk`, `sed`, etc...

Each one seems to have slightly different syntax, require different characters to be escaped, has different defaults (is global search enabled by default? Multiline? What about case sensitivity?), some don't support certain lookarounds, how does grouping work, and so on.


Completely agree. Emacs too.

It's a real pain, especially when you want a quick one-off regex. In that case the learning curve changes the economics of what the right tool is for the job. Often I'll just end up writing a program using a tool I already know, even though I'm aware that it's a less efficient choice—at least its inefficiency is predictable. Of course if you do that enough times then you've spent more than the original learning curve would have cost! I have done this in painfully many contexts. Regexes are an obvious case, probably because they're so obviously doing the same thing, just differently enough to waste your time.

Some people actually like searching for and paging through documentation to learn how, e.g., regex format X does character escaping. And they tend to remember such things, too. I don't and don't.


Relevant XCKD: http://www.xkcd.com/1205

This is a very big issue and one I recently faced in the workplace. Needed a script to parse some log files to generate CSV reports for the business users. I knew jq (https://stedolan.github.io/jq/) and hence was able to write less than 10 lines of jq to do it combined with some preprocessing using sed.

I then realised that NOBODY on the team knew jq apart from me and I had to rewrite it in Python which took me 4 days to do correctly and handle everything that jq did for me.


There are three types of regex as far as I know: basic (aka GNU), extended and PERL. Grep uses GNU as the name implies. egrep or grep -E uses extended. PERL is used elsewhere like JavaScript. Typically you'll see pcre which is the library for Perl Compatible Regular Expressions.


The name "grep" is unrelated to GNU. According to The Jargon File, the etymology is:

from the qed/ed editor idiom g/re/p, where re stands for a regular expression, to Globally search for the Regular Expression and Print the lines containing matches to it

In fact, grep predates the GNU project by almost a decade.

Also, it's the first time I hear about "GNU" regexps.


Thanks, just learned something new. grep predates GNU by about a decade it seems.


a more accurate term would be "BRE with GNU extensions", e.g. \(\)


\(\) are part of BRE syntax. They are not GNU extensions.

Source: http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html#tag...


er, I meant \{\}.


\{n\}, \{n,\} and \{n,m\} are all in POSIX.

\{,m\} is a (pretty obscure) GNU extensions.


hm, I could've sworn it wasn't. http://www.regular-expressions.info/gnu.html says that \?, \+, and \| are GNU extensions though.


There are many differencese between "compatible" regular expressions, for (many) examples read the end of section "Important Notes About Lookbehind" on [1] or compare 24 dialects on [2] for example.

[1] http://www.regular-expressions.info/lookaround.html [2] http://www.regular-expressions.info/refadv.html


Erlang requires doubling up on escape backslashes, which generally causes me grief even after I flail about and finally remember it. Subtleties abound.


Another strange one is the (old) Visual Studio syntax with its :i (actually often quite helpful) and $1 instead of \1 (even in the new syntax).

https://msdn.microsoft.com/en-us/library/2k3te2cs(v=vs.110)....


I wonder if there is a tool that lets you enter a regular expression from one engine and get back the equivalent regular expressions for other engines?

Some of the look arounds might be too powerful to do in some engines so it would not always work, but it would still be quite useful if it just handled the differences in escaping, specifying modes like case insensitivity or multiline matching, and referencing match groups in replacements.


Not exactly, but that's a(n internal) feature of my Strukt[1]. It's how the optimizer is able to take any of {exact string, glob, ICU regex} as user input, and convert them into efficient queries for various data sources.

For example, if you do ListFiles[folder=~/Downloads] -> FilterString[field=filename, regex=\.pdf$], it will parse the regex, verify that it only uses features which are available in Spotlight's query syntax (roughly, a glob with slightly funny syntax), and rewrite that adjacent pair of operations into a single API call behind the scenes:

    kMDItemFSName LIKE[c] "*.pdf"
I can report, having written parsers for a few common regex dialects, that there are a ton of obscure regex features, with different semantics everywhere. 100% conversion is almost never possible.

A lot of the work in deciding if a translation is possible is in identifying if something is good enough, e.g., SQLite can't do case-insensitive search (in general), but if your regex happens to be /([0-9]+)/ then case-sensitive search will work just fine. Fortunately, for Strukt, if a conversion is impossible, I can just run the operations as written: it's much slower, but still correct.

I've thought about breaking this part of Strukt off into a tool just for regex editing, but that always seemed rather esoteric. Would there be any use or demand for that, do you think?

[1]: https://freerobotcollective.com


That's an interesting idea - interesting enough that it would be fun to at least throw together a little proof-of-concept...


Also replacement (sed, emacs, idea, …), is it $1, is it \1, is it something else, does it support named groups?


    >  remembering the exact syntax of each of the different common regex engines
I have the same problem. There's no way around those little differences other than to suffer them. It simply isn't feasible to remember all the nuances of the different flavors.

I just know one really well and rely on getting it wrong for the others: guess, test, revise, repeat.


Try http://www.regexplanet.com/ for testing. It supports a bunch of different regex engines & switching between them.


My go to resource for this issue is:

* http://www.regular-expressions.info/tools.html

nooks and crannies of pretty much every major flavor and variant, prepared as an easy reference.


I wanted to write that exact same first sentence (what a weird experience). :D :D

Another thing is escaping: when you write sed regexs in a bash script you have to use slightly different escaping sequences sometimes...


That's one of my pet peeves too. Another pet peeve is that I only touch regex once every few months. When you have a regex set to parse a domain of data you move on to other things. After a while you forget the regex syntax and when you have to parse another set of text or have to debug the older regex, I end up having to relearn a bit of it. But I guess it comes with the territory.


Python has a nice trick for commenting regexs with re.VERBOSE. I use it for all non-trivial regexs.

https://docs.python.org/2/library/re.html#re.VERBOSE


For me, the best Regex ressource is still http://regexr.com

It explains what each character does just by hovering over a regex. Best tool to learn or to fine tune your regular expression (with testing included).


I'll add https://regex101.com as an alternative.


This is a paid app, but the debugger on this has been my favorite

https://www.regexbuddy.com/

You can even step through the matching process and see how your matches are made


Love RegexBuddy. That feature along with the "use" tab which generates code for whatever language you select. I don't have to remember the specifics in all the languages I work in. I just select from the dropdowns, for example, "JavaScript (Chrome)" then "Use regex object to get the part of a string matched by a numbered group". Replace the placeholder variable names, and you're good to go!

It will also do things like warn you if you use named groups if your selected language doesn't support them, and the "Use" dropdown won't provide that option.

I really wish it wasn't Windows only.


Hah wow... I've had this app for over a decade and never noticed that feature... right next to the Debug button which I have used numerous times on really gnarly regexes


https://www.debuggex.com/ is a neat one for showing a syntax chart (or railroad chart) visualization.

I generally use https://regex101.com for its display of matched groups (when I'm dealing with complex groups and/or replacement backreferences), and http://regexstorm.net/tester when I specifically need to check a regex that will be running in .NET or Powershell.


I've been using The Regex Coach [0] for years. Simple, free, does the job.

[0] http://www.weitz.de/regex-coach/


For advanced users (and those who want to become regex gurus), the most helpful regex site for me is http://www.rexegg.com/. Also the O'Reilly book "Mastering Regular Expressions" is probably worth gold.


> Also the O'Reilly book "Mastering Regular Expressions" is probably worth gold.

The book is worth it's price if only for chapter 1, definitely seconding this recommendation.


I'll add http://www.ultrapico.com/expresso.htm as another alternative! Writes out the regex steps in English too.


I like https://regexper.com/ as it gives you a visual flow of what's happening.


I've been teaching regular expressions for years, and offer a free online e-mail course on the subject (http://RegexpCrashCourse.com/).

This site is a very nice summary of regexp syntax and is written well -- but it's missing two crucial pieces that help people learn: Examples and exercises. Without practice, there's no way that people can remember the syntax.


It isn't missing those. Every section links to regex101.com prefilled, ready for experimentation.


I'll revise my criticism, then: There are some examples, but not enough. And the links to an interactive regexp system is very smart and nice.


Good article (didn't realize there were other kinds of lookaround), but maybe the bottom should link to well-tested standards-based regexes instead.

    URL: ^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$
I recently encountered a case where a URL had an underscore at the end of a subdomain name. It seems underscores are okay anywhere else, but while my friend on Windows was able to load the website, I wasn't (on Linux) using Firefox, curl, remote screenshot service which presumably ran Linux etc. According to various RFCs, they should be okay anywhere within the subdomain name.

Has anyone encountered this behavior? Couldn't find anything on the internet; maybe it's just my computer?


It seems to mostly come down to differences in how things are defined. DNS itself can handle almost arbitrary data https://tools.ietf.org/html/rfc2181#section-11, while an Internet Hostname was defined to be more strict https://tools.ietf.org/html/rfc1123#section-2. The same issue also exists with dashes at the end of domain components.

I'm not enough of a history boffin to know how Microsoft came to support it differently (perhaps something from the Netbios and NT era). At this point in time though, I don't see either party changing their default validations to agree on a single definition.

Edit: If you're curious, this is the first commit that appears to be the first glibc commit limiting dashes at the end of URLS https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fa0bc.... I don't know about BSD libc, or windows however.


Wait a second...does this imply that if I put downloads that should only be of interest to our Windows customers on a server named something like downloads_.ourdomain.com, it might keep out all those annoying bots that ignore robots.txt and make a lot of noise in my logs? I'm guessing that most of the bots are not running on Windows.


That's a pretty bad idea, you shouldn't rely on this kind of stuff.

If there are people running OSX or Linux that want Windows downloads, or someone is behind a captive portal or proxy (like squid), they probably won't be able to reach it anymore.

If you have a real problem with bots, I'd look at what IPs they are coming from, and how often they try to connect. Something like IP blacklisting, or fail2ban might work for your use case.


Wow, how did you find that commit?


Both git and this git web view allow you to view all the commits that have modified just that file. Eg. https://sourceware.org/git/?p=glibc.git;a=history;f=resolv/r.... So it's a simple matter of looking at the diffs between commits.

Of course that's assuming you know the right file, which is often the harder problem.


Yes, I feel that the "Bonus" section (with no explanation even) is rather encouraging beginners to mis-use regular expressions in general, and - more specifically - contains errors.


I personally avoid regexes where possible, including in this situation. IMO the right way to validate a URL is to feed it to a URL parser and see if it errors out. I can see errors in this regex right away - and in many other regexes you find from Googling. People just drop them into their codebase and their eyes glaze over when you ask them whether or not it's actually correct. How many websites fail on user+whatever@gmail.com because they copied a bad regex?


hmm interesting, do you still have the domain/url? You could search for it in your history using regex :)



https://regexone.com/ helped me finally learn Regex in 2-3 hours.

It's a step-by-step interactive site. One of the best educational programming sites I've been to.

Testing Regex is a lot easier when you have the fundamentals down and there's a million resources to test Regex.


I learned regex the "Ambient" way.

i.e. Encountering them here, there and everywhere. Then one day realising you have a good knowledge of the subject without ever having set out to learn it.


I spent some time reading some resource or another on how regexes work, but the vast majority of my learning has been trying things in https://regex101.com/ and seeing if they do what I want. The breakdown on the side of the page is especially helpful.


Waiting for the additional "Read someone else's Regex the easy way". I'm not holding my breath :)

Agree with others in that RegexBuddy is indispensable for a windows dev learning this magic stuff.

Some useful and interesting regex developments coming in the next version of JavaScript. Support for international text and (bleh) emoji incoming.


And when you think you've learned regex, learn that you haven't: http://fent.github.io/randexp.js (a regex "reverser" of sorts) [1]

Seriously. Test every non-trivial regex with something like this, you'll probably be surprised at how permissive most regexes are.

Regexes are great. They're super-concise and perform amazingly well. But they're one of the biggest footguns I know of. Treat them as such and you'll probably do fine.

---

[1] for instance, the URL regex they use is incorrect, and it's super obvious when you plug it into that site:

    ^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$
`[[a-zA-Z0-9]\-\.]` you can't nest character sets like that. So this matches the letters "[]-." as well as all a-z,A-Z,0-9 ranges.


I didn't truly understand regular expressions until I saw how they are executed. There are simple algorithms for executing them, so that might not be such a bad way to teach.


I learned from the book The Unix Programming Environment but I also didn't truly understand regexp until I read how they were executed, in Programming in Standard ML the first chapter shows how to implement a complete package/parser for regexp http://www.cs.cmu.edu/~rwh/isml/book.pdf


The lack of readability of regex makes me wonder if there isn't a better way. I've seen Elm's parser which introduces a few neat concepts like parser pipelines. https://github.com/elm-tools/parser


Have you looked at Perl 6 at all? Whitespace in regexen is insignificant if not quoted, so not only can you add a little space between sections, you can split a regex over several lines and add comments throughout.

It also has first-class grammars, so you're less tempted to reach for regex when something more powerful would help.


For some people Regex Golf [1] might be an interesting way to learn. You are actually building increasingly complex regex as you go, and can just look up bits of syntax you don't know as needed.

[1] https://alf.nu/RegexGolf


I really enjoy this type of tutorial format, concise and easy to follow. Thanks for take time to produce it.


A picture is worth a thousand words. How about a visual regex tester - http://emailregex.com/regex-visual-tester/#a%5Cbc%5Cd*


Very good post. It has all the important information, provides clear examples, doesn't try to get too fancy or showboat. Well done.


Could use some work explaining capture groups


Capture groups are where I get 99% of value from regexs. Being able to quickly transform data is where I find regexs perform best. Just matching is not something I have to do all that often.


Previously: https://news.ycombinator.com/item?id=14846506

I'm afraid not much has been improved since then.

This is not a good learning source.


For an even-more-beginner's guide, see the slides to a session I teach to Econ postgrads once a year[1].

These introduces the metacharacters gradually, using a task-based approach. We start by finding street addresses, per https://xkcd.com/208/.

[1] http://mackerron.com/text/text-slides.pdf (page 19 onwards) with supporting resources at http://mackerron.com/text/


For email, regular expression, there's http://emailregex.com




Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: