Hacker News new | past | comments | ask | show | jobs | submit login
Regexper – Regular expressions visualizer (regexper.com)
510 points by xchip on Jan 30, 2018 | hide | past | web | favorite | 102 comments



It's always neat to see where one's ideas go! AFAIK, I was the first person to create dynamic railroad diagrams for regular expressions (maybe 12 or 13 years ago). I got the idea from json.org, which I think was Douglas Crockford's brainchild.

My initial implementation was strfriend.com (in Lisp: well under 1,000 lines, including views), and I think its main claim to fame was that Jeff Atwood made fun of it on Twitter. (I was truly clueless at promoting myself back then. Not only did I not have a Twitter account, but it didn't occur to me to submit it to HN.)

Every so often, I toy around with the idea of making it into a proper local/native application -- maybe someday. In the meantime, my obsession with regular expressions lives on in my current application (see bio) where I parse and rewrite the user's entered regex syntax (ICU) into whatever regex syntax the backend requires. I don't know of any other application that does this, but I predict that in 15 years it'll be commonplace (and I'll still be poor)!


Railroad diagrams for regular expressions were common long before 2005, and generating such images on demand isn't that unusual, so your work is unlikely to have been a major influence here. Similarly, automatically translating regular expressions from one engine to another is something that people have done before (out of necessity, for compatibility).


Do you have an example? I love reading original sources. I'd never seen a regex as a railroad diagram before that, though I admit it's entirely possibly I'd seen it somewhere and forgotten.

I don't know of any software that translates regular expressions, either, though I'm sure I can't be the first.


This fairly common piece of software does it.

https://www.regexbuddy.com

4th result on google for "regex conversion" for me. Seems like there are quite a few others.


I'm familiar with that program, but I see no railroad diagrams there.


They're replying to the "convert" part, which regexbuddy does [1]

[1] https://www.regexbuddy.com/convert.html


You seem to be chasing the idea that somehow your idea was formative, even though pretty much nobody will have seen it, even though you yourself don't know whether or not there were earlier instances.

Railroad diagrams are also called syntax diagrams [1] and were used in print as early as 1973 [2]. They are an obvious kind of diagram to use for regular expressions, which express a grammar/syntax.

Basically, many people will have had the idea, independently, because representing a regex as a railroad diagram is an obvious invention, as is converting them between variants. Doing these on demand is also obvious.

Trying to take some sort of credit (even as inspiration) for other people's work is not likely to make you many friends. The world is a vast place with many things going on independently. Keep on inventing.

[1] https://en.wikipedia.org/wiki/Syntax_diagram [2] Niklaus Wirth: The Programming Language Pascal. (July 1973)


I think the first place I such a diagram was in the Smalltalk blue book. It has railroad diagrams for the language grammar. So not regular expressions, but very similar.


According to Wikipedia's article on railroad diagrams one of the first appearances of it was in "Pascal User Manual" written by Niklaus Wirth in 1973. Hardly a new idea. It's been used in academia for ages when first teaching regular expressions and Extended Backus–Naur form to students.


Strfriend was very good, it helped me learn regex in a big way. Thanks!


couldn't find jeff atwoods comment on twitter but he did have this to say on Stack Overflow back in 2009 [ answered Jan 13 '09 at 10:15 ][ https://meta.stackexchange.com/a/79880 ]:

It's even worse: those strfriend URLs in the form of

http://strfriend.com/vis?re=(Zip%3A\s*\d\d\d\d\d)\s*(State%3...

are really painful.You'll have to encode almost everything in it. See my edit.* Works now, but not worth it IMO.

str friend? more like str ENEMY.


I was taught the algorithms to do this stuff in my Computer Science class over 20 years ago (RE is equivalent to DFA). You weren't the first person to implement regular expression visualizations.


A dynamic visualization on the web 13 years ago? He may have been the first.


It was pretty hard to do viz on the web 13 years ago. In 2006 I made a web tool to turn regular expressions into NFAs and DFAs and animate their states as you typed. It took a lot of code (drawing and animating along beziers, AJAX to a server for graphviz and a regex compilation and minimization package I wrote for this). https://imgur.com/gallery/Yqqoh

These days there’s a lot more tooling and components that can snap together to make this kind of thing.


> A dynamic visualization on the web 13 years ago?

Depends on what you count as dynamic, but I think so yes. There was one in use at my Uni ~2000. Obviously there were no fancy canvas/svg options to work with client-side (though IIRC flash was very much a thing by then so that could have been used) so it produced an image server-side that was updated when you submitted a change.

Not entirely dynamic due to the manual post to the server for each update of the diagram, but it counts IMO. It could have been more automated via the JS/Dom methods available at the time I'm sure, I can think of a couple of ways, but I don't remember it being so.

> He may have been the first.

I'd say not the first. He may well have come up with the combination of ideas independently. How many times have you thought "X would be brilliant, I'm a genius" only to find when describing the idea to other that several "geniuses" preempted you and it already exists (or worse: it has been tried and proven to be a terrible idea in practice)! It has certainly happened to me a fair few times, and back then it wouldn't have been as easy to search out similar ideas/implementations.


I studied computer science, too. DFA graphs are not the same as railroad diagrams.



And it didn't even choke on the email one!

You have to reformat the answer from stackoverflow to remove the line returns. Will this link work?

Edit: no, "comment was too long" :(

Copy regex from https://stackoverflow.com/a/801239/481788 and remove the 81 line returns.


Full text of the email regex:

(I had to put it as quoted text, otherwise HN's markdown interprets the * as emphasis)

    (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)


Thad first one doesn't seem very good. It seems like there are many places states could be merged. Eg, there are 4 different "^0" states and 3 "^1" states. or am i misreading something?


Amazing how being able to visualize the problem reveals available optimizations!


A regular regex doesn't need to be optimized how it is written if the matcher is dfa based because the minimal dfa is unique. Regex engines however are more complex than that, and the structure this shows isn't going to be how it is actually recognized.

This should reallt only be used for understand what regex matches, not as am optimization tool. In that regard it should display the simplest graph possible to aid understanding.


That last link shows an error.


Add ')' at the very end (you can see it is not included in the clickable link and it's also what the error says).


Thanks, fixed!


Neat! I also suggest: https://regexr.com/


I'm fond of https://regex101.com/ as well.


I'm 100% behind you here.

Comparing [1] and [2], there's no competition.

[1] https://regexper.com/#%5E(%3F%3A(%3F%3A(%3F%3A0%3F%5B13578%5...

[2] https://regex101.com/r/LzdGpw/1

I can use 2, but 1 just looks cool.


This site is what finally made me good at regex. I found it painful to learn otherwise.


Regex101 is actually awesome. It's what I reach for whenever I've got to do stuff in regex.


Regex101 is the daddy. I really don’t know how it could be improved.


That is cool. https://www.debuggex.com/ has nice visualization.


Wow, that is really cool.

I actually started typing in random characters and tracing the matches through the graph for the Regex from StackOverflow mentioned earlier. I didn't know what the regex matches, so I played it like a game , trying to reach the finish by typing one character at a time.

When I finished, I saw that I had typed "01/31/1691" and then realised that it's a regex for dates.


Sounds like you'd love https://regexcrossword.com/


+1 for regex crossword. That's a lot of fun.


That is so fast!


I used regexr quite a lot during my daily work.

I think regexper more suited for complex structures whereas regexr is better for checking the meaning of simple expressions because of the hover-over function.


Agreed. I use regexr daily for building my regex's without having to execute the programs they'll be part of.


emacs also has a visual regexp builder mode (M-x re-builder) that shows the 200 first matches in the current buffer to validate that the regular expression does what you want. But AFAIR there are syntax differences between different regexp flavors and emacs uses the elisp flavor of course.


After 20 years of software development I‘ve come to adopt a best practise:

Whenever I start writing a regular expression, I stop and write a „manual“ domain specific parse function instead.

Saved me a LOT of debugging time.

Since I can now use kotlin pretty much anywhere (jvm, browser, shellscripts) this is easy because of the superb stdlib („startsWith“, „lastIndexOf“, „substringBeforeLast(...)“)

The time saved I invest in Unittests for the parser.


I can't shake the feeling that Regexp could be written just as efficiently as a fluent interface with a more human friendly syntax.

I've been telling Jr devs bucking for promotion for years to explain what they're doing in plain english, then write code that looks like that. Basically telling them to skip right over the "gee look what a clever fuck I am" stage and write good code instead of creating riddles.

The Regexp problem just screams this at me. What am I doing? I'm looking for a line that starts with a capital T, then has some quantity of alphanumeric characters greater than n (if n is not 0, 1, or infinity, this requires extra work in Regex), followed by an equals sign with or without whitespace characters around it.

Give me an API that does exactly that, instead of Regex. Something gets lost in translation every time.

I think the fact that the origin of Regex is the command line interface is pretty telling. We didn't and we don't have a convenient way to type in imperative code on a command line. So an arcane syntax was created so you could do the whole thing in a quarter line of text.

Speaking as someone who has had a Unix shell for 25 years, and routinely works on mini tools for their fellow developers, I don't think we actually type stuff into a shell that often anymore. The difference between documenting a one-liner in a README and just building a shell script that does the same thing is not that big. There's a difference in development effort but building a script can allow you access to a debugger. Personally, I'd be willing to pay that tax any day.


> I can't shake the feeling that Regexp could be written just as efficiently as a fluent interface with a more human friendly syntax.

You can use SRL - Simple Regex Language (https://simple-regex.com/) for making readable regex/matching rules. It is supported in C++, Java, C#, PHP, Javascript, and Python. Also, you can use the web version to generate equivalent regex if your language is one of the above.

Here is an example from the website for matching an e-mail address:

  begin with any of (digit, letter, one of "._%+-") once or more,
  literally "@",
  any of (digit, letter, one of ".-") once or more,
  literally ".",
  letter at least 2 times,
  must end, case insensitive
Regex to do the same:

  /^(?:[0-9]|[a-z]|[\._%\+-])+(?:@)(?:[0-9]|[a-z]|[\.-])+(?:\.)[a-z]{2,}$/i
The first one is readable, second one is cryptic. https://simple-regex.com/examples has more examples.

SRL was previously discussed here in 2017, see https://news.ycombinator.com/item?id=12384862

Also, the parse feature (DSL) of Rebol language is an excellent regex alternative:

1. Why Rebol, Red, and the Parse dialect are Cool (http://blog.hostilefork.com/why-rebol-red-parse-cool/)

2. Rebol's answer to Regex: parse and Rebol types (https://rebol-land.blogspot.in/2013/03/rebols-answer-to-rege...)


Interesting, thanks for mentioning! Looks similar to VerbalExpressions (https://github.com/VerbalExpressions/JSVerbalExpressions/wik...)


Parsing combinators are the API you want.

    regexReplaced n = do
       char 'T'
       spaces
       x <- concat (replicateM n alphaNum)
       y <- concat (many alphaNum)
       spaces
       char '='
       return (x ++ y)
The above is a Haskell function that does the parsing required above, returning the alphanumeric characters if the parse succeeds and returning an error if it does not. You may not speak Haskell, but this is probably still more readable than (n) => {new RegExp(`T\s(\w{${n}}\w)\s*=`)}, which is the Javascript function that does a similar thing.


But are they as efficient as regexp? I personally prefer regexp combinators.


Optimizing them is a bit more work, but they can outperform hand-rolled C code:

http://www.serpentine.com/blog/2014/05/31/attoparsec/


> "I can't shake the feeling that Regexp could be written just as efficiently as a fluent interface with a more human friendly syntax."

You may be interested in the Parse dialect of Red:

http://www.red-lang.org/2013/11/041-introducing-parse.html

Also worth noting that Red can be embedded in any program that supports a C function interface, through using LibRed.

http://www.red-lang.org/2017/03/062-libred-and-macros.html?m...


I agree. That’s why I pointed out that a necessary addition to get more productivity out of writing custom parsers is

a) easy usability in shellscripts B) easy tooling for unittests

I started writing shellscript in kotlin, so i get the unittests for free


Yep. I primarily use Go, so you are often forced down this way because it uses a simpler regex engine. I used to complain but in hindsight I realized it was a blessing in disguise. Particularly in Go's case, it has excellent character set library support, especially unicode, so those really tricky corner cases with unicode characters are non-existent now as well. I will be happy if I never see a regex with a unicode range again.


Which IDE do you use?


I just use a text editor with plugins.


Here is a regexp to match an IPv4 address - looks quite nice and easy to understand compared to the regexp! In fact the visualisation makes it easy to spot the mistake.

https://regexper.com/#'%5Cb((25%5B0-5%5D%7C2%5B0-4%5D%5B0-9%...

(From https://stackoverflow.com/q/5284147/164234 )


That covers the most common format accepted by the BSD and POSIX inet_* functions, but misses the less common ones.

If anyone wants to have a go at a more complete one, here are some test cases for you that it misses, using Google's well known public name server 8.8.4.4. These all work in the classic command line tools like ping on MacOS, Linux, and Windows:

  134743044
  8.525316
  8.8.1028
POSIX and BSD also allow the numbers to be written in hex:

  0x8080404
  0x8.0x8.0x404
  0x8.0x80404
  0x8.0x8.0x4.0x4
or octal:

  01002002004
  010.2002004
  010.010.02004
or mixed:

  010.8.0x404


The visualization tool shows that the regex is not correct. It allows 000.000.000.000 as an IPv4 address


I don't see the problem. Though not a host address, that's known in the sockets networking API as INADDR_ANY, useful for binding a socket to listening on all networks, for instance.

The GNU C Library getaddrinfo accepts 000.000.000.000 with the leading zeros and all; I just tried.

It is important to support special addresses like 255.255.255.255 and 0.0.0.0 in the dot notation. For instance, in the configuration of some daemon, you may need to be able to specify that the bind address is 0.0.0.0. The value can't be rejected due to not being an address.

Also, you need to be able specify network as opposed to host addresses, and netmasks. You know, like 10.0.0.0 and so on.


I created this instead: https://regexper.com/#((%5B0-9%5D%5C.)%7C(%5B1-9%5D%5B0-9%5D...

The repetition count seems to be displayed off-by-one though.


On mobile so can't (easily) test it, but doesn't this produce a false negative for `246.{snip}`, for example?


Yes, you are right...the following better? https://regexper.com/#((%5B0-9%5D%5C.)%7C(%5B1-9%5D%5B0-9%5D...


Does not match: 10.1 -> 10.0.0.1


In IPv4 addresses, as far as I know, this is not done -- it is only IPv6 addresses that use the double colon to indicate a sequence of zeros.


Where are you seeing a double colon? He write "10.1", not "10::1".

IPv4 addresses, according to POSIX, can be written in 4 forms:

  A1.A2.A3.A4
  A1.A2.B
  A1.C
  D
where Ai is an 8 bit number, B is a 16 bit number, C is a 24 bit number, and D is a 32 bit number. "10.1" is A1.C form.

See the inet_addr man page if you are on Unix or a Unix-like system.


Great explanation thank you! The 24 bit and 16 bit variants are highly useable but they are often overlooked by regexps, biggest problem is web apps. Being able to use octal and hex is even less common.


I referred to the double colon as something I saw used with IPv6 to indicate a sequence of 0's.

I wasn't aware that inet_aton supported all these forms and the regex I provide won't parse them. It seems like inet_aton supports specifying the numbers in octal and hex too.


The could really use a bunch of "try it" links or examples to show what it can do.


Looks neat, but after a quick look I think I still like https://www.debuggex.com/ better. Going step by step through the regex for a given string is really a killer feature for me.


Yep, and not having to click "Display" button each time.

Also, partially highlighting the text you write is a pretty hard feature to implement, I did it once. Kudos to debuggex.com for working correctly even with browser zoom on.


The Email::Valid Perl distribution ships with a more than 6000 character regex to validate E-Mail addresses.

Both goo.gl and bit.ly refused to shorten it, and when I tried to paste the link here HN refused to accept my comment.

But on a machine with Email::Valid installed do:

    perl -MEmail::Valid -wE 'say $Email::Valid::RFC822PAT'
And copy the output into the Regexper form. It takes a while to render, but it'll eventually complete.


Wow. Despite the utterly insane complexity of a regex of that size, it doesn't seem to do an insane amount of branching. Maximum depth of choices seems to be about 4, which is less than a lot of other regex examples I've seen here.

That being said... That regex is just noise. Nobody can tackle it all at once or by themselves, unless they specialise in just regex. It's 6599 characters, at least on my system. So at a wild stab, its the equivalent of 4,500 lines of obfuscated code.

You can't audit it, you just kinda have to trust it, and hope.

But with something like regexper, I can at least read it.


Here, have a 1800 line regex which parses perl: https://metacpan.org/source/DCONWAY/PPR-0.000003/lib/PPR.pm#...


I can't remember where, but there's some version of that regex somewhere that uses variables for interpolation in subsequent regexes.

Viewed like that it's really not that complex, most of it is repetition of previously used regex sequences, it's only when fully expanded that it becomes so humongous.


The pattern is Jeffrey Friedl's, from his book Mastering Regular Expressions.

And as clear as the source could be made, I feel the fact that so many people have just copied and pasted it means that any understanding is lost, and they're just praying and hoping, because a regex of that size is actually difficult for them to comprehend.


This short link seems to work:

http://1b.yt/exfqj

(regexper.com doesn't seem to cache results, so this is going to hit their site hard...)

[edit: oops, it's all client-side so no problem for the site]


>Both goo.gl and bit.ly refused to shorten it, and when I tried to paste the link here HN refused to accept my comment.

github gist?



There's also http://emailregex.com/regex-visual-tester/#(%3F%3A%5Ba-z0-9!...)

Edit: Add the missing ) to the url


This is a nifty tool, and to be honest I was unaware that regex visualization was a thing before now. I usually write comments in BNF next to my regex, so that I can make sense of them later. I'm going to keep doing that, but visualization is going to be great for debugging and for figuring out other people's less carefully commented regex.


Here's an example of a regex to validate us phone numbers

https://regexper.com/#%2F%5E(1%5Cs%3F)%3F(%5B0-9%5D%7B3%7D%7...

From

https://www.freecodecamp.org/challenges/validate-us-telephon...

I found these 11 videos most helpful when learning regex from codecourse

https://www.youtube.com/watch?v=GVZOJ1rEnUg&index=1&list=PLf...


Nicely done!

I use https://www.debuggex.com/ on a daily basis.

If you want to improve your regex skills, Regex Golf is the place to go! https://alf.nu/RegexGolf


Nicely done.

Next step, can you do it in reverse? Would be cool to create the regex in a graphical editor and then generate the actual expression.

Feature request: export the visualization to ascii, so I can copy paste it into a comment above the regex in my code.


Suggestion: put a link to one or two examples to quickly get an idea of what it does.


And base64 share urls.


I love the graphics. I would like the equivalent for Python and the GNU flex (lexer) RE's.

No... wait.... I know what I want! I want a sphinx extension that allows me to include an RE in a Python docstring and have it render as a railroad track graphic in the generated documentation:

  some_re_string = r'ab[0-9]*z'
  """:regex: Any string starting whith 'ab', followed by 
      digits, ending with 'z'.
  """
That should be able to go pick up the documented string and render a railroad track diagram along with the text in the generated documentation.


I'd love to have it the other way around. Human speech to RegExp.I find it really hard to write these expression, e. g.: A colleague needed to write a RegExp for a nickname alias for a URL. It had to have only letters and numbers or a specific number. So either name34 na34me or 23 would be valid.


This is very well made. I don't know what I was expecting, but this really impressed me.


Does not look like it supports negative lookbehind https://regexper.com/#(%3F%3C!a)b


It does do negative lookahead though - https://regexper.com/#%2F%5Cd%2B(%3F!%5C.)%2F

Negative lookbehind only appeared in Chrome in v.62 (https://v8project.blogspot.co.uk/2017/09/v8-release-62.html) it's not too surprising that tools haven't caught up yet.


I'm sure they'd appreciate a ticket - There's already this which is similar: https://github.com/javallone/regexper-static/issues/26


Back when https://xkcd.com/1930/ was posted, I made a regular expression to create a generator using a regex sampler (for instance http://dwickern.github.io/regex-sample/ ).

I've put the regex at https://gist.github.com/kmill/17c5ef4f99bd9ef7ad799f0b487448...

The amusing thing to me is that this regex visualizer can reproduce the comic.



searched for some javascript regex's to test it with and found the regex chapter from Eloquent javascript https://eloquentjavascript.net/09_regexp.html

Coincidence that the diagrams seem to be generated by regexper !


has a bug: {2,4} displayed as 1..3 times


I noticed it translates repeats into a form with a required first character/group so 5{2,4} became 55{1,3} - see https://regexper.com/#5%7B2%2C4%7D


Should I expect to have to to escape "/" as "\/"?


Complex regexes do render (must escape "/").

https://regexper.com/#%5B%5E%3C%5D%2B%7C%3C(!(--(%5B%5E-%5D-(%5B%5E-%5D%5B%5E-%5D-)-%3E%3F)%3F%7C%5C%5BCDATA%5C%5B(%5B%5E%5D%5D%5D(%5B%5E%5D%5D%2B%5D)%5D%2B(%5B%5E%5D%3E%5D%5B%5E%5D%5D%5D(%5B%5E%5D%5D%2B%5D)%5D%2B)%3E)%3F%7CDOCTYPE(%5B%20%5Cn%5Ct%5Cr%5D%2B(%5BA-Za-z_%3A%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5BA-Za-z0-9_%3A.-%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5B%20%5Cn%5Ct%5Cr%5D%2B((%5BA-Za-z_%3A%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5BA-Za-z0-9_%3A.-%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)%7C%22%5B%5E%22%5D%22%7C'%5B%5E'%5D'))(%5B%20%5Cn%5Ct%5Cr%5D%2B)%3F(%5C%5B(%3C(!(--%5B%5E-%5D-(%5B%5E-%5D%5B%5E-%5D-)-%3E%7C%5B%5E-%5D(%5B%5E%5D%22'%3E%3C%5D%2B%7C%22%5B%5E%22%5D%22%7C'%5B%5E'%5D')%3E)%7C%5C%3F(%5BA-Za-z_%3A%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5BA-Za-z0-9_%3A.-%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5C%3F%3E%7C%5B%5Cn%5Cr%5Ct%20%5D%5B%5E%3F%5D%5C%3F%2B(%5B%5E%3E%3F%5D%5B%5E%3F%5D%5C%3F%2B)%3E))%7C%25(%5BA-Za-z_%3A%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5BA-Za-z0-9_%3A.-%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)%3B%7C%5B%20%5Cn%5Ct%5Cr%5D%2B)%5D(%5B%20%5Cn%5Ct%5Cr%5D%2B)%3F)%3F%3E%3F)%3F)%3F%7C%5C%3F((%5BA-Za-z_%3A%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5BA-Za-z0-9_%3A.-%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5C%3F%3E%7C%5B%5Cn%5Cr%5Ct%20%5D%5B%5E%3F%5D%5C%3F%2B(%5B%5E%3E%3F%5D%5B%5E%3F%5D%5C%3F%2B)%3E)%3F)%3F%7C%5C%2F((%5BA-Za-z_%3A%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5BA-Za-z0-9_%3A.-%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5B%20%5Cn%5Ct%5Cr%5D%2B)%3F%3E%3F)%3F%7C((%5BA-Za-z_%3A%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5BA-Za-z0-9_%3A.-%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5B%20%5Cn%5Ct%5Cr%5D%2B(%5BA-Za-z_%3A%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5BA-Za-z0-9_%3A.-%5D%7C%5B%5E%5Cx00-%5Cx7F%5D)(%5B%20%5Cn%5Ct%5Cr%5D%2B)%3F%3D(%5B%20%5Cn%5Ct%5Cr%5D%2B)%3F(%22%5B%5E%3C%22%5D%22%7C'%5B%5E%3C'%5D'))*(%5B%20%5Cn%5Ct%5Cr%5D%2B)%3F%5C%2F%3F%3E%3F)%3F)

http://www.cs.sfu.ca/~cameron/REX.html


Horizontal scroll would be useful.


I would think it would be better to minimize the pseudo dfa to collapse states.

Also having each character of an alternation of subset in light blue box then laying them out vertically makes the hard to read.

Of the point of this is to make regex more easily understood, you would think you would want to make them compact.


^I love it$


In Emacs:

M-x regexp-builder


I'm an rx geek and can often craft what I'm looking to get without a lot of help, but I have used this tool many times before -- it's very slick. It would be nice if it supported something other than JavaScript[0], but hey, it's on the web, it probably makes a lot of sense to be that way (and it's nice that it's all client-side and I don't have to wait for it to ship my regular expression back to a server for processing).

Regular expressions are simply awesome and I'm continually surprised at how frequently I run into developers who have next-to-no understanding of them.Case in point, I ran into some code a few months ago that spanned two methods and 20-lines to do something that a 6-character regular expression could have solved (and would have done so more performantly[1]); the best part was that part of what I was responsible for handling was a bug that ended up residing right inside one of those methods. And then there's all of the things related to "dealing with strings" that many regular expression libraries just handle, such as "\d" vs "[0-9]" in a world with unicode strings[3]. It feels cryptic[4] when you encounter it and you're not familiar with the syntax, but to learn the "80% most useful parts", you needn't study much more than content that would fit on a single printed sheet of paper (and to get the last 20%, you'd need, maybe 2, ... 3?)

All of that said, there's also the other side of the coin; if ever the saying "If all you have is a hammer, everything looks like a nail" had application, it's with regular expressions. I'm not sure how many times the question "How do I write a regular expression to parse HTML" has to be responded with "don't" before folks quit trying[2]. It tends to be the first thing I reach for when I have a need to process text, even when there are better tools; heck, all of my find/replace dialogues in every application that supports it have the "Regex" box checked by default (and it really throws me off when I hit up "Find" in the browser and need to search for something with a ( or ) in it which I escape due to muscle memory)

[0] I have an occasional need for PCRE and .NET style; and I really miss named-groups when I have to do something complex in JavaScript.

[1] While it's easy to accidentally end up in hell, ala https://blog.codinghorror.com/regex-performance/, poorly written string-search code can be worse when the complexity of the pattern your searching for reaches a certain point, and that's to say nothing of the errors per x lines of code and readability (not that rx is particularly readable under complexity).

[2] And hey, I've got a shell script that downloads a few status pages on my server at home that uses awk with regular expressions to extract values from a web page. I wouldn't say it necessarily qualifies as "parsing HTML" since it's really only concerned with looking for a small string which it filters a second time to get the value -- horribly inefficient, but it's worked for 5 years through page changes without requiring adjustment.

[3] At least in the C# world, are about twice as slow due to handling digits "correctly" https://stackoverflow.com/questions/16621738/d-is-less-effic...

[4] While it's usually written cryptically, many (most?) implementations support flags to ignore whitespace and support comment features. I've had a few crazy-ugly rx's that I had to use to extract data from a ticketing system's "blob field" to insert into a structured format; were it not for that feature, it would have been impossible to write and support.


Isn't this just debuggex.com? That's hacker but not news.


The guidelines say you can post whatever is interesting to hackers, whether it's new/news or not.


My best friend for Regex: http://www.txt2re.com/

With code examples!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: