Hacker News new | past | comments | ask | show | jobs | submit login
Source of the famous “Now you have two problems” quote (2006) (regex.info)
154 points by ColinWright on Dec 15, 2020 | hide | past | favorite | 117 comments

Earlier posted in 2015 with 60 comments https://news.ycombinator.com/item?id=10083420

Jeffrey Friedl has recently updated his blog post [1] (mentioned in the "great top comment"), referencing the various HN discussions over the years.

It's all oddly satisfying, in a way...

[1] http://regex.info/blog/2006-09-15/247

I think JWZ may be one of the few people that can criticize Perl and I'll be like "fair enough." Most people have never used Perl and just parrot the standard line about it being write-only without actually having used it.

> It combines all the worst aspects of C and Lisp: a billion different sublanguages in one monolithic executable.

Believe it or not, I used to be a Perl hater and a Lisp (or Scheme) weenie. But having worked with Perl, Common Lisp, Ruby, NodeJS, and Scheme, I really have to say that I'd take Perl over all of those. The worst aspect of C would be pointers and unsafe memory access, which Perl doesn't have. And the worst parts of Common Lisp would be the massive, inconsistent, designed-by-committee language, CLOS, and macros. Oh, and LOOP (I shouldn't need to dig out a reference book to remember how to do looping each time I need to. Seriously.) S-expressions have also worn their welcome on me. I'm just tired of brace matching (yes, even with your silly fancy IDE).

Perl is by no means perfect, but I've found it does more things right than wrong. Blaming Perl for regexes being useful is like blaming a hammer for nails being ubiquitous. What? You're going to hand code a lexer each time you need to parse a phone number or a zip code? If there was a better alternative we would be using it.

At one point, I didn't know Perl and had to work on a project with it.

For a while I was cursing the arcane syntax such as $_, $\ and even $ on a variable to the left of the equal sign...

and then something weird happened. Those things became idioms in my mind and I started achieving a sort of fluency.

And when I started to think in perl, I found it to be the highest-level language I had EVER worked in. I could express myself because there were multiple ways to do something and one always matched what I was thinking.

For example, instead of having to think IF NOT SOMETHING I could UNLESS SOMETHING and then express it as

  unless(something) { do_something(); }
or even:

  do_something() unless something;

Now I like the idea of regular expressions, but I find the implementation leaves a lot to be desired. or should I say implementations plural.

To me, the main usability problem with regular expressions boils down this: it is hard to distinguish between a literal and a regex control character.

This is exacerbated because regex control characters are usually the same characters other languages treat specially as well.

So you have regular expressions - which are somewhat straightforward, but you have to escape them for the place you are USING the regular expression - perl, shell script, sed, awk, grep or egrep, etc.

So I've used regular expression for decades, and even now it's always subject to a little trial and error and requires testing.

100% agree. Perl is highly underrated in today's software development climate.

Yep, this. I find it sad even after all these years... I put it down to 2 things - 1) the Perl 6 promise, and 2) Google, who were the new hot startup in Silicon Valley, chose Python...

Anyway... if you’re interested in Perl, especially how to write nice Perl, I encourage you to get yourself a copy of Perl Best Practice by Damian Conway for Christmas

FWIW, the Perl 6 promise was delivered in 2015. Because too many people did not want Perl 6 to be the future or Perl, it was renamed to Raku (https://raku.org using the #rakulang tag on social media). Meanwhile, Perl 7 appears to be the next promise.

Sure, but I’m talking the era around 2005, when CPAN was way ahead of everyone

Interesting side note here the Tiobe Index, which I know isn't the most accurate measurement of language use and popularity, currently has Perl ranked at 14 up from 20th place last year and is now ranked higher than Ruby and Go.


note: I did switch to python. Even though it's less expressive I found the regularity works out better for me in the long run.

I didn't. I don't write Perl anymore either, though.

When Python started gaining traction, I was so incredibly turned off by the fans of the language that I swore I would never write code in it. That was 1998. If Python was 10% as good as those people were claiming, it would be humankind's greatest achievement for the next 1000 years, and it isn't.

Please note that I'm not saying anything bad about Python here, only its advocates in the late 1990s. I still have a bad taste in my mouth because of those conversations, over 20 years later.

I suspect it's not just you - look at Guido! :)

I can't recall when I started using python, but it was probably 2010 or so, when it was more mature.

That said the 2.x -> 3.x migration was anything but mature.

There was a perl vs python religious war back then that generated a lot of online vitriol. Which made some sense since they were competing for mindshare in the same space. Obviously Python won that war, but it's interesting to think how things would have worked out differentlywithout the self-destructive tragedy that was Perl 6.

That tragedy meanwhile has been renamed to Raku (https://raku.org using the #rakulang tag on social media) and is doing very well. Meanwhile, it appears at least some form of tragedy is continuing with the Perl 7 efforts. :-(

I can understand that, even just a few toxic personalities can poison a community, or even perception of it. To be fair though, there were a lot of Python haters flipping the table over significant white space back then as well.

I made an interesting transition a few years ago, and started using Unicode in my programs.

(1) I have an i18n mapping which provides native support for a lot of helpful characters. Both Mac and Linux use all the keys, so in addition to the half-dozen characters with accents, I have access to a lot more directly. Mac and Linux do remap the remaining keys differently, though.

(2) My code got a lot more readable. Δx is easier to read and more concise than delta_x.

(3) I have a lot more characters to pick from.

A lot of the issues you raise with regexp are due to limits of ASCII.

For more esoteric characters, I do Google and either enter alt codes or cut-and-paste, which might sound like , but remember that you write code once, and read it many times. And editors are increasingly good.

Beyond Unicode, one does have text formats with support for colors, bold, and whatnot, either embedded (escape codes in text) or as syntax highlighting (editor parses).

The CS-level reg exps seem like a good idea. The implementation -- where they're encoded as bizarre ASCII escape sequences -- could use a good revamp.

Oddly enough, I think this is my first HN post where I went beyond ASCII....

Edit: And HN filtered out a bunch of Unicode. It kept only the delta.

> What? You're going to hand code a lexer each time you need to parse a phone number or a zip code? If there was a better alternative we would be using it.

STOP PARSING WEB FORM INPUT. Please. For the love of God just stop.

This is why emails of the form "foo+bar@fakeaddress.myemail" rarely work. Somebody's regex didn't allow "+" and everybody copied it. If you really need to check an email address, send a verification email to it. It's the only method that actually works.

Why are you parsing a phone number. The user typed in a phone number, obey it. Are you sure that you know every combination of valid phone number?

Zip code ... maybe ... if you are US only. Otherwise a postcode can't be parsed either.

And this is before we talk about the Cardinal Sin of using regexes as parsers.

> Zip code ... maybe ... if you are US only. Otherwise a postcode can't be parsed either.

Also, don't store zip code as an integer. There are valid zip codes that start with zero that will likely get truncated if you store a zip code as an integer.

Excel being helpful that way has been the Bain of my existence

Set the type of the cell to text:

1 Highlight cell, or range/column

2 ctrl+1 or right click + format cells

3 select text from the type combo box

> The worst aspect of C would be pointers and unsafe memory access, which Perl doesn't have.

In a practical sense, it doesn't. But it's there if you specifically go digging for it:

    $ perl -le 'print unpack("p", pack("p", "Hello, world."));'
    Hello, world.
    $ perl -le 'print unpack("p", reverse(pack("p", "Hello, world.")));'
    Segmentation fault (core dumped)

IMHO, grammars in Raku are much better than regexps in many situations, except oneliners.


There is! I do! It's Parsing Expression Grammars!

There's a reason Rakudo adopted them in its quest to out-Perl Perl (5).

Regex is fine as an adequately-powerful shorthand for command-line munging and text editors, where the terseness is useful. It turns on you as soon as you want to use it in a programming language... which is where Perl comes in.

> What? You're going to hand code a lexer each time you need to parse a phone number or a zip code? If there was a better alternative we would be using it.

Use a parser combinator library. I've yet to find a case where they weren't a better replacement for regexes.

The case where one can drop in a battle-tested, documented regex in 15 seconds and get on with your life.

In my experience even the "battle-tested, documented" regexes often have surprising edge cases.

> And the worst parts of Common Lisp would be the massive, inconsistent, designed-by-committee language, CLOS, and macros.

I've always wondered what would happen if there was a CL standardization where everything was CLOS'ified in the :common-lisp package, so the #'+, #'map(can), etc. were generic functions along with a nice set of collections (dictionary, etc.). And the more specialized/efficient functions were squirreled away in a non-default package. Well that, and a bunch of other things like a different module system. SBCL seems like such a nice compiler that doesn't get the respect it deserves for being able to compile fast binaries for such a dynamic language.

There are lots of good ideas floating around for a second CL standard. The problem is, either such a language has a CL compatibility environment, in which case it is probably implemented in CL using the features already in the language, meaning that it's really just another Quicklisp package; or it doesn't, in which case it has to fill some need that couldn't be filled the first way in order to attract a critical mass of interest. Julia is the closest thing I'm aware of to the latter.

It's hard to get people to switch from something they find good enough for their needs, if the new thing is incompatible. See: Perl 6.

In the case of Perl 6 this has resulted in a rename of the language to Raku (https://raku.org using the #rakulang tag on social media). An experience I would not recommend.

Never use LOOP. I don't! (Except in the trivial keywordless form in which it means "loop forever".)

JWZ> Fourth: I like PostScript, and it’s a safe bet that I’ve written more and hairier PostScript by hand than anyone reading this… but the syntax is as close to “write-only” as in any language I’ve ever used. Anyone who defends PostScript as being “readable” is a monster raving loony.

Pththththth.ps! PostScript is too readable. (But he's right I'm a monster raving loony.)


Here's a PostScript Pretty Plotter (among other things):



Many years ago I had some fun with PostScript: https://chris.pacejo.net/stuff/postscript/index

Hunt the Wumpus I'm most proud of, but you need to run it on a PS-enabled printer, or in GhostScript. The BrainFuck interpreter, prime number calculator, and the larger quine all produce paper output on modern systems (just tested in Preview on macOS).

Don, you are one of the few people deranged enough to write handwritten Postscript and surely have written more than jamie. Too much time spent reading the NeWS.

I've written some good PostScript over the years. It's a wonderful, clean and very simple language. Stack languages tend to be. I find it very readable because it's so simple.

Of course others like elaborate closures and syntax that seems simple but ends up hiding tons of complexity about the data structures. To Each his own, I say.

Leigh Klotz has written more PostScript than Jamie too, while working at Xerox! But "KLOTZ IS A LOGO PRIMITIVE [BEEP BEEP BEEP]". He wrote a 6502 assembler in Logo!


Leigh Klotz's comment on the regex article:

>OK, I think I’ve written more PostScript by hand than Jamie, so I assume he thinks I’m not reading this. Back in the old days, I designed a system that used incredible amounts of PostScript. One thing that made it easier for us was a C-like syntax to PS compiler, done by a fellow at the Turning Institute. We licensed it and used it heavily, and I extended it a bit to be able to handle uneven stack-armed IF, and added varieties of inheritance. The project was called PdB and eventually it folded, and the author left and went to First Person Software, where he wrote a very similar language syntax for something called Oak, and it compiled to bytecodes instead of PostScript. Oak got renamed Java.

>So there.

>And yes, we did have two problems…

>— comment by Leigh L. Klotz, Jr. on June 7th, 2008 at 3:22am JST (12 years, 6 months ago) — comment permalink

Arthur van Hoff (the author of PdB and the original Java compiler written in Java) has also written more PostScript than Jamie, especially if you count the PostScript written by programs he wrote, like PdB and GoodNeWS/HyperNeWS/HyperLook.

Here's the README file (and distribution) of PdB, Arthur van Hoff's object oriented C to PostScript compiler:


Also a paper by Arthur van Hoff about "Syntactic Extensions to PdB to Support TNT Classing Mechanisms":


Some before and after examples, like menu.h menu.pdb menu.PS:

HyperNeWS 3.0 prototype source: https://www.donhopkins.com/home/archive/HyperLook/Turing/hn3...

menu.h: https://www.donhopkins.com/home/archive/HyperLook/Turing/hn3... menu.pdb: https://www.donhopkins.com/home/archive/HyperLook/Turing/hn3... menu.PS: https://www.donhopkins.com/home/archive/HyperLook/Turing/hn3...



> He wrote a 6502 assembler in Logo!

I'd love to see the turtle JMP.

Please do not disassemble the turtle, kids!!!

Arthur van Hoff (the author of PdB and the original Java compiler written in Java)

And the original AWT, if this is to be a full Airing of Sins.

Agreed, AWT was a horrible compromise in an impossible situation!

But he made up for it by creating "Bongo" at Marimba.

Bongo is to Java+HyperCard as HyperLook is to PostScript+HyperCard.


>Arthur van Hoff [...]

>Marimba Castanet and Bongo

>Eventually Arthur left Sun to found Marimba, where he developed the widely used Castanet push distribution technology, and the under-appreciated Bongo user interface editing tool: a HypeLook-like user interface editor written in Java, that solved the runtime scripting extension problem by actually calling the Java compiler to dynamically compile and link Java scripts.

>Nobody else had ever done anything remotely like Bongo before in Java. Dynamic scripting with Java was unheard of at the time, but since he had written the compiler, he knew the API and how the plumbing worked, so had no qualms about calling the Java compiler at runtime every time you hit the “Apply” button of a script editor.

>Danny Goodman’s “Official Marimba Guide to Bongo”


>Danny Goodman, the author of the definitive HyperCard book, “The Complete HyperCard Handbook”, went on to write the “Official Marimba Guide to Bongo”, a great book about Bongo, described as the “reincarnation of HyperCard on the Internet”.

>[TODO: Write about Bongo’s relationship to HyperCard, HyperLook and Java.]

>Java applets are everywhere on Web pages these days, but if you’ve made the move to Java from a contemporary programming environment you’ve probably been dismayed by its relative immaturity. The Official Marimba Guide to Bongo covers Marimba’s Bongo environment, which is designed to allow rapid development of Java user interfaces. The book shows you how to use the large library of graphics “widgets” supplied with Bongo, how to wire them together with simple scripting, and how to integrate other Java applets. It also explains how Bongo can be used to build channels for Marimba’s Castanet system. -Amazon.com Review

>Java users should be rejoicing at the promise of programming aid Bongo, which is is the reincarnation of HyperCard on the Internet. It is fitting that the first major book about Bongo comes from Goodman, the author of the definitive HyperCard book of days gone by (The Complete HyperCard Handbook, Random, 1994). His background is as a journalist, not a technologist, and readers will make good use of this first-rate introduction. This book will circulate. -Library Journal Review

Unfortunately Marimba's Bongo got overshadowed by Sun's announcement of "Java Beans" which Sun was pushing with much fanfare and handwaving as an alternative to "ActiveX", but which eventually turned out to actually be just a server side data modeling technology, not a client gui framework.



Marimba developed Bongo, a Java-based gui toolkit / user interface editor / graphical environment, inspired by HyperCard (and HyperLook), which they used to develop and distribute interactive user interfaces over Castanet.


>Feel the Beat with Marimba's Bongo, By Chris Baron

>In 1996, four programmers from the original Java-development team left Sun to form Marimba and produce industrial-strength Java-development tools for user interface and application administration. Bongo, one of Marimba's two shipping products, allows developers to create either a Java-application interface or a standalone Java-based application called a "presentation." A Bongo presentation resembles a HyperCard stack -- it allows developers to quickly create an application with a sophisticated user interface, but without the tedious programming of directly coding in Java or C/C++. Bongo's nonprogramming, visual approach makes it ideal for producing simple applications that don't involve a lot of processing, such as product demonstrations, user-interface prototypes, and training applications. Bongo is fully integrated with Castanet, Marimba's other product, a technology for remotely installing and updating Java applications.

Bongo was unique at the time in that it actually let you edit and dynamically compile scripts for event handlers and "live code" at run-time (in contrast with other tools that required you to recompile and re-run the application to make changes to the user interface), which was made possible by calling back to the Java compiler (which Arthur had written before at Sun, so he knew how to integrate the compiler at runtime like a modern IDE would do). Without the ability to dynamically edit scripts at runtime (easy with an interpreted language like HyperTalk or PostScript or JavaScript, but trickier for a compiled language like Java), you can't hold a candle to HyperCard, because interactive scripting is an essential feature.

Danny Goodman, who wrote the book on HyperCard, also wrote a book about Bongo. Arthur later founded Flipboard and JauntVR, and now works at Apple.

Here's a paper I wrote comparing Bongo with IFC (Netscape's much-ballyhooped Java Internet Foundation Classes). (Notice how IFC = Internet Foundation Classes was Netscape's answer to MFC = Microsoft Foundation Classes. Never define your product's name in terms of a reaction to your widely successful competitor's name. cough SunSoft cough)

NetScape's Internet Foundation Classes and Marimba's Bongo


>In summary, I think it was too early to write a Java toolkit before JDK 1.1, so IFC has gone and done a lot of its own stuff, which will have to be drastically changed to take advantage of the new stuff. Bongo is not as far down the road of painting itself into a corner like that, and if some effort is put into it, to bring it up to date with the new facilities in Java, I think it will be a better framework than IFC. Java Beans remains a big unknown, that I don't have a lot of faith in. Arthur says Java Beans does too much, and I'm afraid it may try to push competing frameworks like IFC and Bongo out of the limelight, instead of just providing a low level substrate on top of which they can interoperate (like the TNT ClassCanvas). If Bongo can pull off ActiveX integration with style and grace, then it wins hands down, because I doubt IFC can, and I don't trust Sun to deliver on their promises to do that with Java Beans.



>Wow, a blast from the past! 1996, what a year that was. [...]

Hah yeah, I remember Bongo and IFC and did some work on one of their competitors. In retrospect, yes, it's not an approach that really had much of a chance in the long term because at the end of the day, it's building dynamic dispatch/message passing and all sorts of other scaffolding out of more-or-less strings on top of a language/runtime that's supposed to provide these.

And then Eclipse happened.


IBM named it Eclipse just to poke Sun in the eye. And it worked, it totally tweaked them! A friend of mine working at Sun at the time once lamented to me that "The name 'Eclipse' is so unfortunate. And why won't anybody use NetBeans?" ;)

Eclipse is a big bold name with an obvious meaning related to Sun, that expressed IBM's goal in one word: they intended to eclipse Sun in the Java world.

Their plausibly deniable cover story was that they meant to be competitive with Visual Studio on Windows (which was a passive-aggressive dig at Sun, because Sun had nothing to compare with Visual Studio, which was more ambitious than NetBeans to eclipse).


>Eclipse: Behind the Name

>Whats in a name?

>Back in 2003, when Sun Microsystems Inc. was considering whether it might join the then soon-to-be-independent Eclipse Foundation, one of the key concerns, aside from technical issues, was the name Eclipse.

>Sun said it would not join an organization named Eclipse, and the foundation agreed to change the name. The Santa Clara, Calif., company didn't want to join an organization whose name was perceived as encouraging the demise of Sun, company executives said at the time. [...]

>"We decided to do what it would take to be competitive with Visual Studio on Windows," he [IBM's Nackman] said.

>So the target then was and now is Microsoft, not Sun, he said.

>But the name seems so perfect a knock against Sun. How could it not be? Well, according to a source, some of the early Eclipse originators had a retreat where one of the themes was the universe and many code names emerged involving celestial themes. Eclipse stuck. And while Sun was not necessarily the primary target, "these were really smart people, and I don't think the visualization and competitive implication was lost," a source said. [...]


>Eclipse Casts Shadow on Sun

>The Eclipse open-source development platform is outshining Sun's NetBeans in terms of developer and vendor support, but Sun vows to continue to innovate around NetBeans.

>Does Sun have lunar envy? [...]

>And of the tools landscape, a Microsoft source said: "The game is not over, but when we think of developer ecosystems other than Visual Studio we think Eclipse. We don't think NetBeans."

Oh, snap!

Nice, nice. Just in case people aren't reading the actual source code on the last link, here are some highlights...

  % This is *gross* code. I mean UUUUUGLY! (And it used to be
  % even more contorted, if you can believe that.) 
  % Brute force debugging hacks
  % Ignore this stuff. It was written when I was very frustrated.
  (% ifdef PISSEDOFF
  % For use when mildly irritated:
  % This is nasty evil vile implementation dependant hackery. 
  % Good god this has gotten bigger than dictbegin can handle!
  XNeWS? not { 300 currentdict extend pop } if
    % This code was designed to be rewritten!
  % XXXX: Here be the start of the trouble.
      MyProcess type /processtype eq {
        pause pause pause % maybe it will kill itsself
Also it has a definition of quicksort.

It's almost worse than sendmail.cf

Some people, when confronted with a problem, think “I know, I'll use floating point.” Now they have 1.999998 problems.

I think the only correct quote here is "Now they don't have two problems."

The Java variant of this is "I had a problem so I used Java, now I have a ProblemFactory"

And thus, Javascript was born.

I must admit, I find the idea of the origin here (invented by coders) unlikely; it seems far too generic and likely to go back considerably further.

The suggestion of a WW2 soldiers origin (a la foo bar) in the article is more believable, though really I don't see why it should have not predated that as well.

> ... comp.lang.emacs

Wasn't the quote from comp.religion.emacs?

But seriously, regular expressions are often ill-advised, as solutions based on them are frequently:

- over-matching: they match things in cases where they shouldn't, because a situation wasn't anticipated by the developer

- under-matching: not catching some input they should because a regular pattern was formulated to narrowly by the developer

- matching the wrong thing altogether because they are hard to stay on top of, even when it's one you wrote

- matching something, but you don't understand what and how it works, because regexes are "write only".

There is even another case:

- They work but they are the wrong instrument, because you try to model something that requires context-free generative capacity (e.g. syntax coloring)

Interestingly, at Xerox Research Centre Europe (now Naver Labs) in Grenoble a much more elegant (more powerful yet more readable, and symmetric between inputs and outputs) has been developed as an alternative, mostly to write linguistic rules with it.

For more info:

* http://users.itk.ppke.hu/~sikbo/nytech/gyak/05_morfo/xfst/bo...

* https://press.uchicago.edu/ucp/books/book/distributed/F/bo36...

* There is an open-source implementation called FOMA: https://fomafst.github.io/

It's in the article: alt.religion.emacs

My relationship with RegExp seems to be against the grain here. They're almost always a quick, simple, clean solution to any text parsing issue I've faced. Perhaps without tools like https://regexr.com/ I would feel different though?

Have you tried parsing an email address with a regex?


Have you tried parsing an email address (in an RFC compliant way) without a regex?




It's a fair few lines of code but most of it comes directly from the RFC. It's also longer than it could be because it covers the entire RFC822 syntax, including groups, lists, routes, comments and To: header styles.

to be fair, the email RFC is a silly document that does silly things. The user part of an email address "may" be case-sensitive.

But in practice, no one does that. Because imagine grandma on the phone with her bank not remembering if her email address was Mary123@gmail.com, MARY123@gmail.com, or mary123@gmail.com. The horror.

Some systems (used to) map the local part to local unix usernames, which are case-sensitive. Hence the sender would get a 550 No such user (or similar) if he got the case of the local part wrong.

There are a lot of really useful use-cases between not using regexp and using regexp to parse email addresses.

For example, our system needs to handle container numbers[1] as part of an official form, and we check them for correctness to avoid rejection of the form.

As far as we're concerned the container number consist of four letters and seven digits. However the user is allowed to submit a container number with a hyphen or a space between the letters and the digits.

Writing a regex that checks for a properly formatted container number is quick, easy to understand and a lot less convoluted than most alternatives.

[1]: https://en.wikipedia.org/wiki/ISO_6346

Have you tried parsing a regex with a regex?



the language of regex expressions is not a regular language though? It has parenthesis matching.

Email is the rare "validate, don't parse" case.

With an email address, you almost always want a valid email address, and don't much care why it's valid.

So, sure, check for an @ in there, but then just... try and email it with a link, if the page gets visited, congrats, you have a valid email address.

This will allow your program to be used as a vector to attack the things your program depends on.

It's true that those bits of eMail infrastructure are probably more robust but it's still strictly bad practice.

Even if you're "just" storing it in your database, you should sanitise it on the way in so that when someone does something "unexpected" with it, such as display it in a web browser UI, you're not going to suffer from injection attacks there either.

Relying on sanitization is bad practice. Your systems should work properly and securely even if every text field in your database is filled with Robert'); DROP TABLE Students;-- or <script>alert(document.cookie)</script> - if displaying them in web UI leads to an injection or XSS, then the web UI code is horribly broken and needs to be fixed, input sanitization is at best a temporary workaround.

You're not strictly wrong...

But what I'm trying to say is that what's in your database should be well defined. You shouldn't just put any old stuff in there. You should have a standard for exactly how everything is escaped (or not) so that your consumers have a spec to work to.

You'd do the same with, say, character encodings. One option is to convert everything to a single character set on the way in, such as UTF-8. Another option is to annotate everything with the character set it uses. You must chose one.

Relying on adhoc code spread across the codebase for the security properties of untrusted data leads to whack-a-mole security situations.

Being able to trust the data in your database is essential.

Note, that absolutely doesn't mean things like "Robert'); DROP TABLE Students;--" shouldn't appear in database fields.

It just means that if you define the type of a field to be "eMail address" then consumers of it really should be able to trust that it really is a legitimate and valid eMail address. What does "valid" mean in this context? Well, that's up to your spec. Perhaps just "legally structured". Perhaps "something that eMail can actually be delivered to". Perhaps "something that is known and assured to actually be associated with this particular user".

...but you must be explicit otherwise consumers have nothing to work from and you're building castles on the sand.


Apropos to OP, this article also quotes JWZ.

There's a certain amount of irony there where there's about a dozen different regexes that all match different languages. And none of them are right, for that matter (non-ASCII email addresses are a thing).

What, in your opinion, is the correct way to parse an email address?

Oh I have absolutely no good insight into what a better way to do it would be, if any. Just saying that doing it “properly” does not have a nice solution in the shape of a simple regexp


(This is a joke)

> (This is a joke)

...but it's closer to correct than most email regexes in the wild!

You're not wrong. I think OP is a generalized rant that is a bit dismissive of problem solving diversity. I think OP's gist is: if you don't understand a tool, don't use it. But even that is specious.

Normal regular expressions have been around since Kleene, Chomsky and Russell started framing them in the 40s/50s. They are a powerful tool for lexical analysis, PERL showed us that in the early 90's. Dismissing them is just gatekeepy, IMHO.

It may have something to do with the complexity/inconsistency of the strings you're matching. I feel like often regexes work fine for me with a little fine-tuning, but I have certainly run into examples in code I didn't write (or even code I did, months or years ago) which took me a lot of time to really wrap my head around.

IMHO. They are very practical as long as they stay simple. But there is a threshold you shouldn't cross.

You almost certainly would feel different.

I find that I can solve most of my jira tasks in 10 mins with string.length and string.charAt()

My estimated time to completion with regex is literally undefined

That's just unfamiljarity. Whatever works for you is good. Sometimes the hammer does work on the screw, you just have to use it right.

Or we might end up with https://xkcd.com/927/

We're all unfamiliar with assembly and electrical engineering too right, should we all invest the time to learn how to program bare metal? Only when the performance is needed, right?

No, you completely misunderstood and went off on a tangent. My comment means the exact opposite to what you understood.

I've never used tools like that and I feel the same. Regexes provide a simple, well supported solution to a huge number of problems, and I'm not sure what the second problem is supposed to be.

The quote is fun, and I've sometimes used variants of it, but I don't get why the target is regular expressions.

The second problem comes when the regex fails because of the edge cases you didn't think of shows up. :-)

Finite State Morphology was mentioned in the comments and was new to me. It really does look much more useful than regular expressions at the expense of some verbosity.

The way that FSM is inherently a forward/backward process seems super elegant and it's more readable (at first read) than regex's.

Anyone have insight why they aren't more common?


jwz is oddly everywhere. Netscape Navigator and Mozilla and Emacs obviously, but also owns this nightclub in SF I used to frequent, and he hates HN enough that going to his site with a `Referer` from here will give you a testicle in an egg cup.


What is remarkable is that he had a ~12 year career in professional SW development and hasn't really done much in the field apart from Xscreensaver updates since and is still remembered 20+ years later.

One of my favorite quotes from JWZ: "“In 1999 I took my leave of that whole sick, navel-gazing mess we called the software industry. Now I'm in a more honest line of work: now I sell beer.”

I've greatly appreciated his writeup on how to properly thread email conversations from the MIME files. By far the best resource on the subject I was able to find.

Paraphrasing for my latest favorite pet peeve...

Some people, when confronted with a problem, think "I will create a branch." Now they have two problems.

The Codeless Code - Where Angels Fear to Thread - http://thecodelesscode.com/case/121

> The nun Satou approached Java master Banzen and said: “There is a processing bottleneck which I believe I can eliminate by means of multiple worker threads. Yet I have heard this proverb: If you are confronted with a problem and say, ‘I shall use threads’, you now have two problems.”

... and it goes from there to explore it a bit.

Some people, when confronted with a danger, think "I know, I'll call the police!"

I have a variation where I start with no problems. But someone thinks I have a problem (in reality they have the problem). So I need to create a bunch of complex debug code to test the situation. Of course the debug code has bugs. Now I have a bunch of debugging to do on my debugging code itself. To prove there's no bug.

Sounds like job security, the opposite of a problem!

I always had a hard time remembering the regex syntax and form until I learned about scanners/lexers/tokenizers and finite state machines. Then it really clicked and made a lot of sense.

James Clark's compact syntax for Relax/NG XML schema validation language is quite tastefully designed, an equivalent but more convenient alternative syntax than XML, for writing tree regular expressions matching XML documents. It's way more beautiful and coherent than the official "XML Schema" standard.


RELAX NG compact syntax cheat sheet



>RELAX NG compact syntax is a non-XML format inspired by extended Backus-Naur form and regular expressions, designed so that it can be unambiguously translated to its XML counterpart, and back again, with one-to-one correspondence in structure and meaning, in much the same way that Simple Outline XML (SOX) relates to XML. It shares many features with the syntax of DTDs. Here is the compact form of the above schema:

    element book {
        element page { text }+
>With named patterns, this can be flattened to:

    start = element book { page+ }
    page = element page { text }
>A compact RELAX NG parser will treat these two as the same pattern.

There's a wonderful DDJ interview with James Clark called "A Triumph of Simplicity: James Clark on Markup Languages and XML" where he explains how a standard has failed if everyone just uses the reference implementation, because the point of a standard is to be crisp and simple enough that many different implementations can interoperate perfectly.

A Triumph of Simplicity: James Clark on Markup Languages and XML:


"The standard has to be sufficiently simple that it makes sense to have multiple implementations." -James Clark

Jeffrey Friedl's regular expression book (mentioned at the top of the article) is fantastic and I highly recommend it. If you want to not only learn regular expressions, but have them be clear upon first glance and now how to optimize them based on your regular expression engine's implementation, then his book is gold.

The first half is about using them, and the second half is focused on optimizing. You can skip the second half and still get a lot out of it, but it really is some fantastic stuff.

Agreed. I think his book is one of the best technical books I've ever read, it's clear, easy to read and thorough.

I love that quote and throw it around pretty often to great grievance of my colleagues. I never really got on the regex train, I get the use-case but I think they are abused and most people should be writing some kind of parser instead. We even live in the age of parser combinators so why wouldn't use that? I find that the majority of the people attracted to regexes don't understand parsing.

But by all means use them for a quick hack or script.

Regex, like SQL is a black art I'll never understand. At best I tolerate it. Usually easier to split the string into an array and manipulate it then.

Carefully read through (bottom up--main, slre_compile, etc) this short implementation and all will be clear: http://slre.sourceforge.net/1.0/slre.c Also a great introduction to writing compilers.

It's surprisingly more logical and straightforward than you might think. Just start small and work your way up. No doubt you already use wildcards when interacting with the shell; now just add little bits of syntax as you go until you can match more complicated things.

You don't have to go as far as pattern extraction and backreferencing and all that to make some good use of it.

What's so strange about regex? Or SQL for that matter?

Basic regex is readable, at a glance, for the most part, with a little practice. Anything marginally complex and you have to stop what you’re working on and really read the regex to try to understand what it’s doing.

SQL isn’t hard, but trying to remember all the use cases for the various ways to query something always trips me up. “Wait, do I do a LEFT JOIN here? Ok, now I gotta to look up the different kinds of joins so I can be sure I’m writing the most efficient query.”

Again, it’s just a matter of practice. I’d take SQL over a regex any day.

As for SQL, I think I just don't put the time into learn it.

At the higher levels, it's definitely it's own programing language with a radically different herritage from the JS/C# I'm used to.

They're definitely different paradigms of thinking than typical coding. I enjoy regexes, but definitely get how other people would find them bizarre.

Personally, I've never liked SQL, though. As with regexes, a little does no harm. But building anything substantial with it always turns into what feel like absurd bank shots to me. I don't just have to cast everything into the relational paradigm, I have to try to guess how a couple million lines of DB server code is going to map that back into operations I actually understand.

That said, as with most technologies that I don't personally like, people seem to build sufficiently effective systems and profitable businesses on top of them. So I think it's down to personal cognitive differences and matters of taste. Let a thousand flowers bloom, I say.

Lol, "alt.religion.emacs" ?

I'm old, and I'm finding language drift to be something that I regularly see, and frequently to be baffling. This is an example.

To me, "lol" means "laugh out loud". So I ask, and it's a genuine question, did "alt.religion.emacs" really, honestly, make you "laugh out loud"? It may have done, but I'm interested to know.

The follow-up is this.

If it did make you "laugh out loud", what is it that you found funny?

If you didn't "laugh out loud", what do you actually mean by "lol"?

I'm interested in language, its uses, and how it's changing, and this is one case where I'm definitely in doubt as to the intended meaning of the comment.


Meaning has drifted over time, and in present usage "lol" usually means something more like "heh" - an expression of mild amusement.

"lol" has morphed into an onomatopoeia, sometimes expressed out loud as an interjection in the same way that "heh" or "ha" might, but with a touch of "I found that droll".

Usenet, man. You kinda had to be there.

I know somebody who named their dog Emacs.

are they wrong?

It is impossible to parse HTML with a regular expression even in Perl

It depends on what you’re trying to do with the html.

Web scrapers aren’t too bad with perl regexps.

Relevant XKCD:


I get a 404.

Same here in Firefox. The website configuration is broken.

The https site returns 404 but the http site has the content. This breaks Firefox's preferences for https, as I keep get redirected from the http to the https site since the https site responds (albeit with a 404).

It might not help you, but the link is working for me.

The site's https site is broken so those of us who have enabled Firefox's preference for https are getting 404s.

It is intentionally so. Go to the site root to see.

Useful to know ... thanks.

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact