Hacker News new | past | comments | ask | show | jobs | submit login
The unexpected Google wide domain check bypass (xdavidhu.me)
644 points by notRobot on March 9, 2020 | hide | past | favorite | 114 comments

Many joyous moments of genuine positive hackery. Notable for me were:

- At several points, unusual things were done "for some reason", following an intuition.

- The one character that does the trick was discovered by brute-forcing the browser's URL parsing logic.

- After finding that this little trick works, other related Google apps were explored (for "some reason") to see if the trick is generally applicable, which indeed it was.

- Deep diving into Chromium source to confirm why the trick works.

- How nicely the ends are tied up at the end, seeing the result of his work have an effect deep in a Closure library commonly used at Google.


Loved this part (my emphasis):

> I was always “afraid” to dig this deep, and would never have thought about looking into the source code of a browser, but after browsing around a bit, you realise that this code is also just code, and you can understand how it works. This is super and interesting and can be really helpful in many situations.

The impact of this bug is probably very wide including server-side - not just a Google or Closure because as Someone1234 said: this buggy RegX was from the [RFC3986] standard on URL parsing: https://tools.ietf.org/html/rfc3986#appendix-B

Edit 3: tl;dr: [/?#\] in browsers (including backslash) ends the host/authority but the RFC only specifies [/?#] as ending characters, so for example https://xdavidhu.me\.corp.google.com/blah is recognised as host `xdavidhu.me` by the browser, and host `xdavidhu.me\.corp.google.com` by the RFC - so host name whitelisting or matching can be borked for security purposes.

Edit: maybe do a search of github for this Regexp, and see what significant finds there are. Certainly bad actors are already doing so...

Edit 2: presumably there will be a gold-rush to find other bug bounties for the same underlying issue in other software or services!

Also, Google’s $6000 payout seems extremely disappointing: Google should payout on the total impact of the bug that they measure (not just the one place he happened to find). Why should the reporter have to investigate the total impact of the bug on Google in order to get a fair payout?

On reflection: the RFC is not buggy (by definition) and the correct fix is to never whitelist a hostname that contains a backslash. There is already a large variety of other checks that should be made on hostnames (e.g. no control characters, Unicode, form, length, where .’s are, blah blah blah), so this issue should only affect poorly written code.

The “fix” made by Google to Closure is weird - because it now breaks the RFC to match the broken browser behaviour... Seems like asking for trouble (although a proper generic fix is hard). Throwing an exception on backslash seems dirty, but possibly safer...

> Also, Google’s $6000 payout seems extremely disappointing: Google should payout on the total impact of the bug that they measure (not just the one place he happened to find)

My understanding is that this is what Google's bug bounty normally does. If for example, this was trivially extentable to access to a production user, that would be the payout.

URL parsing is just a complete botch. I’m pretty surprised this sort of thing is so rare.

Most parsing in web standards is. There doesn't exist a way to validate an email address since the official spec is quite complex and then email servers can just ignore it and accept emails on addresses that shouldn't be valid.

This often becomes a pain in the ass when a single piece of software is strict on the rules that everything else ignores. Officially subdomains can't have underscores in them but the browser and everything else will allow it just fine except when you try to get a certificate and lets encrypt tells you that you have an invalid domain.

I think that's correct, or at least helpful. The business of a browser is to get data from the internet regardless of what stupid domains people use. The browser shouldn't stop you visiting an invalid url if that url will return some data. The business of Let's Encrypt is to certify the security of a domain, which includes checking if the domain is valid. LE should stop you signing an invalid URL. That is also correct.

The real culprit is the domain registrar and the DNS server software. Neither of those should be enabling invalid domains to be accessible.

> The real culprit is the domain registrar and the DNS server software. Neither of those should be enabling invalid domains to be accessible.

They already don't. DNS does not forbid underscores, and places hardly any restrictions on labels.

Hostnames do forbid underscores, and that muddies the water quite a bit. While disallowing A-records on labels with undersores would make sense, a CNAME on a label with an underscore can point to a domain without one, and the one it points to may or may not have an A-record.

It is really up to the client to decide that a domain name is invalid for their use case, but clients don't want to disallow something that appears to otherwise work fine. And so we end up in a situation where none of the specifications are truly followed.

This has long been known as Gresham's Law trumping Postel's Principle. For example:

* http://jdebp.uk./FGA/qmail-myths-dispelled.html#MythAboutBar...

> Also, Google’s $6000 payout seems extremely disappointing

Agree. Considering that they would probably make this money under a microsecond they could pay a generous reward. The impact is large though the bug is present in a single line of code.

$6,000 per microsecond would be $6Bn per second, big and profitable as Google is, i'm going to say it's a little bit less than that :)

>Google should payout on the total impact of the bug that they measure

And what is that total impact?

As said: only Google has the resources and motivation to measure the potential impact. David already identified a few places the bug might affect:

“I found this regex in the Google Cloud Console’s JS, Google Actions Console’s JS, in YouTube Studio, in myaccount.google.com (!) and even in some Google Android Apps. A day later I even found this line in the Google Corp Login Page (login.corp.google.com)”.

>only Google has the resources and motivation to measure the potential impact.

And apparently they have judged the impact less than folks who think $6000 is too small.

They may have decided, for example, that the majority of cases were not security related. If they were just cosmetic or something...

One way to estimate it is to ask how much that bug would've sold for on the black market.

That'd definitely be more than USD 6k.

Who would the buyer be and how much would they pay?

One of these days, I'm building an integration test suite of all the weird corner cases in HTTP and whatnot to throw at libraries.

God help the poor engineer here. Let's say he or she (reasonably!) looked up the grammar for a URI in RFC 3986[1], the standard that defines them. Well, according to that grammar, the browser is wrong: backslash isn't valid in that position, and the exploiting string isn't a valid production of the URI grammar.

But of course, the browser has its own standard — the WHATWG URL spec — which doesn't even define a grammar for a URL, but instead defines a program in pseudocode to parse them. Normally, \ isn't valid after an authority … unless the scheme is http or https or one of several other Chosen One schemes.[2]

Uuugh. We've wrought this upon ourselves.

Also included in this test suite would be the wild and wonderful things you can type into a browser's URL bar that result in malformed HTTP requests hitting the wire… AFAIK, WHATWG doesn't yet have their own copy of HTTP…

[1]: https://tools.ietf.org/html/rfc3986#section-3

[2]: https://url.spec.whatwg.org/#hostname-state

Maybe one should argue that regex should not have accepted the URI, since it's isn't valid according to the RFC.

I wrote 90% of a strict URI parser over a year ago, I know that RFC way better than I'd like to. I stopped at 90% after discovering the reason why my fuzzer was generating test failures: the ABNF is wrong. I believe it was around the query portion of URI segmentation, where two potential interpretations shared the same order of precedence and there was no rule on the order of execution. I'll finish that project one day... because I still don't know how scared I should be about that sort of ambiguity making it through more than one ABNF parser.

As someone who has also worked on various bits of HTTP and URI parsing I have also wished for such a test suite - or even that these RFCs and specs would come bundled with such a suite.

There are a lot of weird cases to consider. For example, how many tools correctly handle character encoding in HTTP headers?


I attacked it from both directions: an arguably supervised dataset of urls (hundreds of GB from various internet survey projects), and unit testing the insane markov gibberish the parser makes when you cram random data where it doesn't belong and run it in reverse. The dataset was way less helpful for a strictly conforming design, but I imagine it would be useful for a mostly correct parser.

Grammars are overrated—they regularly leave ambiguities and sometimes conflict with the text, despite both being considered normative.

I love WHATWG specs with their state machines. They’re all about defining exact behaviour based upon what has already been implemented by extant software, and define comprehensively how every possible input should be parsed. They’re easy to transcribe in code without error. These things get you compatibility, and avoid security bugs where people have implemented different behaviour in different places—as was the case in this instance. As a parser spec, the HTML spec is my favourite, because it makes implementing something that is fiendishly complex genuinely easy.

I would love something equivalent for HTTP/1 and MIME messages.

You could say pretty much the same for TCP (or even maybe lower down the network stack)

Windows again. "Let's use a backslash instead of a slash in paths" has to be up there with "NUL-terminated strings" in terms of tiny bad ideas from long ago that are still haunting us.

Of course it's not as though Google had a choice at the outset, I'm sure some non-trivial number of web sites magically didn't work (and maybe still don't work) if you insist that https://www.example.com\works\fine\on\my\pc doesn't do what the person writing it expects... I think we can blame some early Internet Explorer build for first having this work (I can't see Netscape allowing it) and then it becomes a choice whether to be a purist or sigh and move on once IE gets popular and real people do it.

The backslash as path dates back at least as far as CP/M which didn't have directories and used a forward-slash instead of a minus for command flags. Then when directories were added, a forward-slash was obviously unavailable.

[edit] Apparently CP/M took it from VMS, which used period as a path separator.

The backslash as path dates back at least as far as CP/M which didn't have directories and used a forward-slash instead of a minus for command flags.

I had a CP/M machine in the 70's, and I remember it quite differently.

My memory was that CP/M used normal slashes for directory separation, and the backslash was a Microsoft bodge so it could pretend it didn't have Seattle Computer Products copy CP/M 80 to make DOS.

I'm not at home, so I can't verify the slash situation on any of my CP/M emulators. It's also possible that I am not remembering correctly because I was also using Vaxen and PR1MEOS in those years.

Wonder where you got that information from, since CP/M is older than VMS by like 3 years. Can’t be... wonder where this keeps cropping up.

Besides CPM filesystem wasn’t hierarchical.

> Wonder where you got that information from, since CP/M is older than VMS by like 3 years. Can’t be... wonder where this keeps cropping up.

This stack-exchange post[1]

Looks like that's a corruption of the information from a MS blog[2], which does not mention CP/M but DOS 1.0 as taking it from DEC in general, including systems older than VMS.

> Besides CPM filesystem wasn’t hierarchical.

Which I mentioned in my comment.

1: https://superuser.com/questions/176388/why-does-windows-use-...

2: https://web.archive.org/web/20120304135002/http://blogs.msdn...

DOS couldn't use "/" for paths because it was already using it for command flags (where unix uses "-"). This predates both VMS and CP/M by being used in RT-11 (an older DEC OS).

NUL terminated strings are a great representation. We can take a suffix of a non-empty NUL-terminated string just by incrementing the pointer, which is great for recursion. A pointer to a NUL is itself a valid string.

E.g. tail recursive strchr:

  const char *strchr_rec(const char *str, int ch)
     if (*str == 0)
       return NULL;
     if (*str == ch)
       return str;
     else return strchr_rec(str + 1, ch);
strrchr (rightmost match):

  const char *strrchr_tail(const char *str,
                           int ch,
                           const char *seen)
     if (*str == 0)
       return seen;
     if (*str == ch)
       return strchr_tail(str + 1, str, ch);
     return strchr_tail(str + 1, seen, ch);

  const char *strchr_rec(const char *str, int ch)
     return strchr_tail(str, ch, 0);
That kind of thing makes me feel that null terminated strings are the shiznit compared to some clunky header + data or whatever.

The only problem is that NUL-terminated strings are so darn easy to program with, everyone who comes into contact with them wants their share of the action, instead of using someone else's well-tested code.

As we speak, someone is likely writing yet another loop that builds up a NUL-terminated string directly, and will forget to write the final NUL in some case, or go past the end of the buffer in another.

Want to break a string into tokens on delimiters? Just overwrite the delimiters with null bytes and you're done. (Oops, the caller didn't expect their string mutated. Worse, it was a literal ...).

If you make programmers maintain a length field followed by data, or some such thing, they will quickly go off scampering for library functions to everything.

What we should never do is compare unencapsulated NUL-terminated string manipulation with the manipulation of another representation using a robust, debugged library.

Compare open-coded to open-coded, wrapped to wrapped.

NUL-terminated strings are a poor representation, feature-poor and directly responsible for a great many entirely avoidable bugs.

I much prefer the approach used by languages like Rust, where &str and &[T] are (pointer, length) rather than just a pointer.

Usage can be identical to your classic NUL-terminated string, depending on the language. (It is in Rust, with CStr.)

There is only one potential disadvantage: pointers use twice as much memory; but in practice, this disadvantage is entirely theoretical.

But there are several serious advantages: it lets you represent zero values (U+0000 for &str, integer zero for &[u8], &c.), take slices that aren’t suffixes (reducing either code complexity or memory usage, sometimes drastically), calculate the string length in constant time (and that’s regularly a big deal), and avoid a pitfall that has been the cause of many critical security vulnerabilities (buffer overrun).

Want to break a string into tokens on delimiters? Easy, regardless of the length of the delimiter (it could be zero, two, seventeen—while with NUL-terminated strings it had to be one), without needing to worry about whether it’s safe to mutate the string.

Pointer and length representations are available in C, and we can find programs that get them wrong somewhere.

"Such and such a data structure ... in the middle of Rust" is not a discussion about string representation per se any more.

Rust must have null terminated strings at some level, because it has to integrate with platforms where null terminated strings are the API currency for text. Yet, Rust programs don't fall down because of null-terminated strings.

Likewise, if you have a pointer and length, but no language or library support for it, just a "high level assembly language", you will screw it up.

The pointer has to point to something. There will be cases where that something is a fixed buffer on the stack or in a dynamic object. Out of a hundred programmers coding their pointer and length with fixed buffers, a good many will cause overflows, and a good many will write memory-safe code that causes consequences due to silent truncation.

In C, we can cause problems like:

* returning a pointer to a stack-allocated { length, ptr } structure, which turns to garbage.

* leak dynamically-allocated { length, ptr } objects, or use them after freeing.

* making copies of a { length, ptr } structure (e.g. pass by value), and not keeping all of them in sync when ptr and/or length are updated.

* sticking a literal "string" into ptr, but then treating it as dynamic elsewhere and trying to free it

* sticking a literal "string" into ptr, and then somewhere trying to modify a character.

> You speak as though programmers scampering off for library functions to do everything is a bad thing. I consider it a wonderful thing.

No, that's not what they are saying.

They're just saying that to blame the representation for user mistakes and then present an alternative representation where user mistakes are prevented by library functions is comparing apples to oranges.

Ah yes, I did misread it there; thanks for the correction. Line deleted, the rest stands without it.

Is open-coded that much of a benefit over wrapped? With something like std::string_view you can take any substring without new allocations, not just suffixes.

I love screwing around with NULL terminated strings, but doing it correctly is a lot harder. NULL and the delimiter might not be the same length if the string is UTF-8 encoded for example.

IMO: if it’s not a hobby project just use a library for slicing and manipulating strings. Really you should think twice before doing it at all.

Oh, I still hear ads giving URLs with inexplicable backslashes in them. "Visit our website at foo.com backslash baz"

I think few people know the difference, so they say backslash when they mean forward slash.

The wrong way round always seemed more logical to me: you write either slash from top to bottom, so for a forward slash your pen moves backwards and vice versa.

It seems normal to me to speak of the angle of handwriting as slanting forwards or backwards, corresponding to slash and backslash. Most people’s handwriting is angled either upright or slightly forwards—leaning into the writing; backwards is uncommon.

The way I remember it is to think of the slash as a little person facing to the right (the direction that text flows). Is the person leaning forward or backward?

How often do you use this device?

I feel like I hardly understand the idea, because a / and \ is just as distinct as a b and a p. We don't make mnemonics for those?

Naturally, the occurrence rate of b and p in people's experience, as well as the circumstances in which they are semantically distinct, are also relevant...

Yeah, honestly not very often. Mostly they're second nature to me. I mainly just think of it when there's a topic of / vs \ like in this thread.

You now have me wondering whether I have, at any point in my lifetime, written a backslash with a pen. It's an interesting thought that possibly I never have.

Never had to write down a DOS or Windows path?

I still hear ads giving URLs with inexplicable backslashes in them.

Can confirm. Just yesterday on NPR.

The average person doesn't know the difference between a regular forward slash and a back slash. Why should they?

> Why should they?

The average person does know (or at least should know) the difference between an exclamation mark and a question mark; it's part of literacy. One could argue that knowing the difference between a forward slash and a backslash is part of digital literacy (and that it's a worthy goal for the average person to have digital literacy).

An exclamation point and question mark look distinctively different. A forward slash and backward slash may as well be identical.

The story goes that Microsoft advocated for a UNIX like path separator for DOS, and it was probably IBM that pushed for the backslash so that they wouldn’t be confused with the option switch character.

In any event, blaming Windows for this is a bit silly.

Dip your ink in Microsoft hate and blame them for something completely unrelated to them: check.

Seems like the real problem is using a regex to parse a URL instead of using a URL parser!

... this was within a url parsing library, if I understood correctly? Either way, urls need to be parsed somehow, and using a regex is a reasonable strategy (urls are a regular language) for implementing a url parsing library.

    new URL("https://user:pass@xdavidhu.me\\test.corp.google.com:8080/path/to/something?param=value#hash")

According to https://developer.mozilla.org/en-US/docs/Web/API/URL that works everywhere except on Internet Explorer, so keeping with the theme of the original comment, we can blame Windows again.

This was a really enjoyable, thorough and readable write-up. Thanks, and nice work!

I'll add that I usually have trouble following these, but it was a breeze here. Glad I read it.

This is the exact comment I would have written if @chatmasta had not already posted it.

Thanks for the write-up.

$6,000 seems low for such a severe bug across almost all Google web properties. I would expect an order of magnitude larger.

Google's bug bounty has a $7.5K max on "Vulnerabilities giving access to client or authenticated session of the logged-in victim." See here:


I guess they were awarded $6K instead of the full because they didn't demonstrate it against more substantial properties. But either way the ceiling is quite low on these classes of vulnerabilities.

That payout is indeed quite low, especially because access to an authenticated client session could lead to compromised Google servers when the vulnerability affects web services used by Google employees.

Not really. Anything prod related is behind VPN.

And it's not clear that such an exploit exists. Just because HenHouse happened to use the regex that way doesn't mean other places are, and that a similar exploit exists in the other products.

I'm curious to know, but I'd think that the author would've tried to also exploit it in those other apps, and the fact that it's not documented here may be because they didn't find any issues with the other ones? It's a pretty generic regex that has many uses, and may be used for other stuff in other places.

Why? Who else is going to outbid Google for this serverside URL parsing bug?

The "competition" that Google is bidding against is all the other ways that people capable of finding this sort of bug might choose to spend their free time.

If this were Android drive-by RCE, I'd agree with you. But they're bidding for the attention of the kinds of researchers who find URL whitelist validation bugs. $6k is a strong bounty for that kind of work.

(The work is important! But people who find web bugs like this tend to make their money on volume, not on a couple major hits a year, like the RCE researchers do.)

> But they're bidding for the attention of the kinds of researchers who find URL whitelist validation bugs.

No they’re not. Just because this happened to be a URL whitelist bug doesn’t mean that this is the primary work that the researcher who found the bug does. Did a URL whitelist validation researcher find this, or did someone researching user auth access find this bug? If Google knew exactly who they wanted to find this bug, then Google would have already found it themselves.

Yes, it does. Vulnerability research is a big field and it is specialized. There are different kinds of vulnerability researchers and, as an imperfect but generally useful rule, the ones doing RCE work aren't looking for web bugs. And the people running the Google bounty program are generally well plugged into the vulnerability research community; they are way, way, way, way, way, way more credible than the median HN comment.

It also just doesn't make sense as an objection, because it's not like the bounty rates on reliable clientside RCEs in browsers and mobile devices is capped at $6k, and everyone doing this work knows it.

As we’ve seen in the smartphone world: the competitors are people who would rather the bug stay secret so that they can exploit it.

Who else is going to outbid Google for this serverside URL parsing bug?

The black hats. $6,000 from Google, or $600,000 from some shadowy company in UTC+02. Google makes these choices too easy.

Shadowy companies in UTC+02 don't even pay $600k for Android RCE. They pay zero for this stuff.

Ey man, Africa is also in UTC+02 and we are not known to be as shady as that one UTC+02 country

Wouldn't this be considered clientside?

Technically yes, but in a practical sense the distinction is between bugs that can be fixed globally and instantaneously by pushing a server fix and those that can't.

Why not offer only a dollar then?

You joke, but lots of places that don't offer cash bounties still get good bounty submissions. Google offers bounties because they can, and to attract more submissions. Like I said below: for this kind of work, $6k is a strong bounty. Not a lot of places will pay you thousands of dollars for a URL whitelist bypass bug.


Who Google and other service providers are competing against are grayweb exploit dealers. That's the real market, and every popular site, api, and program is subject, regardless if they're on H1 or not.

Sure there's some whitehat types who'd only tell the companies directly. But the rest.... Well.

No. People who find drive-by RCE vulnerabilities are in a market with "grayweb exploit dealers". People who find URL whitelist validation bugs are not. The distinction is important and directly reflected in the bounty rates for the different kinds of bugs.

And again, it's not a status thing. You can probably make quite a lot of money on bugs like this; you simply find a lot more of them than anyone generates RCE exploits.

That was my exact thought when reading the payout amount.

Something something regex and two problems. \s

Seriously though, nice find! It's often surprising how tiny little curiosities can lead down the rabbit hole to such interesting problems. In this case, it was a fairly logical flawed assumption, but the mess of regex made it just hidden enough to be tough to spot.

This is one area where bug bounty programs can really outperform an internal security function. There are others, generally, but this deep rabbit hole exploration is one of the best.

> \s

Too soon.

I feel like Google is not the only company parsing domains to check security origins...

If a backslash can end a security origin, I am going to guess there are a lot of other vulnerable bits of code...

One more reason joining and splitting strings with security impact is a bad idea.

> One more reason joining and splitting strings with security impact is a bad idea.

Doing it in a standard/shared library is more secure than bespoke code on each page, since at the very least as we witness here it can all be updated together as needed. You can add additional abstraction, but someone eventually has to do this work.

Even web browsers themselves aren't immune from bugs in "joining and splitting strings with security impact" for URL comprehension. All you can really do is try not to re-write known secure code.

Plus this buggy RegX was from the standard on URL parsing:


> I put this regex in www.debuggex.com. This is a really cool website

What a great find.

There is also regex101.com. It is my favourite personal.

It highlights individual parts of the regex and provides human explanation, e.g. https://regex101.com/r/xKG7pE/1

This seems pretty brittle. Did anyone look into Closure to see what the motivation for the regexp based checking is? There's a browser native URL splitting function available to JS after all.

But I guess Closure dates from the IE6 era where trusting browsers for correctness was probably not a good idea..

> There's a browser native URL splitting function available to JS after all.

It's the URL object (https://developer.mozilla.org/en-US/docs/Web/API/URL), isn't it? We tried using it at work... and found out that, as always seems to be the case, it's available everywhere except on Internet Explorer.

Again, whitelist vs blacklist. Blacklist loses.

    blacklist: [^/?#]+
    whitelist: [a-z](?:[0-9a-z]|-(?!(?:$|\.)))*(?:\.[a-z](?:[0-9a-z]|-(?!(?:$|\.)))*)*

Your whitelist loses in this situation? It accepts local domains (e.g. ‘a’) but not fully qualified domain names (e.g. ‘myhost.example.com.’ nor IP addresses. Generally when checking origin, you also wish to check port (e.g. :8000) which the Google code works for; and @username:password may need thought for some mostly obsolete purposes. That said, the whitelist does avoid a bunch of other problems (control chars, Unicode, high bit, etc).

Also, your whitelist is vulnerable to redos DoS attack - according to http://redos-checker.surge.sh/

I’m just trying to point out that securing via Regexps is never simple, even for the very skilled. If someone posts a comment with a Regexp, it is often broken and one can usually find problems with it.

Edit: and what are the $ in there for? They look redundant.

This is not meant as a complete solution for all the worlds ills, it's just an illustration of the concept and how proper matching would avoid the issue in the OP. Adding \ to the exclusion list just plasters over the pile of garbage that is the original "magic" regexp.

$ are there because you can't have '-' at the end of the label. You'd need more chars there if you want to use this as a part of full URL parser. Or just use browser's facilities for parsing URLs.

Also noone cares if you can DoS your own browser tab/iframe. You probably can't anyway, because browser has other mechanisms to prevent that. But yes, you'd need to limit this too in real code, to the max length of the label part of the domain name.

It was not meant to be complete, just for the domain part. Just like the blacklist regexp was extracted from the domain part too. Local domains are ok.

I love that the fix is simply a quadruple-backslash (aka an "actual backslash, for real this time" per https://xkcd.com/1638/)

More than a couple of times have I chosen a different solution to a problem (especially shell scripts) to avoid having to double backslashes more than once.

Because fuck trying to explain how that code actually works to anybody else, or myself a year from now.

Thanks for the xkcd link, I loved it!

URL parsing is a mess. The specification is poorly defined, and different libraries and/or servers handle them differently. This is one kind of attack; many SSRFs exist which leverage that weakness.

The regex approach, btw, is even worse. Parse your URLs before doing things with them.

IMO the spec is well defined and simple. What's broken are the browsers. Permitting a backslash as a delimiter rather obviously breaks the syntax. No surprise that security issues are introduced when the browsers don't follow an otherwise simple standard, while everybody else does.

That's what makes this crazy. Generally I agree that people shouldn't abuse regular expressions for parsing security-critical data. But it just so happens that a standards-compliant URL can be trivially parsed using a naive regular expression that relies on the semantics of the gen-delimiters character set in the standard. In fact, it's really hard to break such a naive regular expression even when we move beyond the domain of standards compliance. One glaring exception is anything that introduces ambiguity between the userinfo authority section and the path, which the browsers have chosen to do. sigh

A new RFC should be issued that updates the syntax to include the semantics which browsers give to backslash. Among other things, this will help to publicize the issue.

The browsers are implementing the WhatWG URL spec, not the RFC, and it mentions \ is valid for "special" URLs (scheme is http/https, ftp, file, ws/wss): https://url.spec.whatwg.org/#hostname-state

There’s some comedy in that URL standard:

Calculating file origin: "Unfortunate as it is, this is left as an exercise to the reader.”

3.7. Host equivalence: “To determine whether a host A equals B, return true if A is B, and false otherwise.”

“The application/x-www-form-urlencoded format is in many ways an aberrant monstrosity”

Great find! True hacker news worthy.

Such a relief some really smart and persistent people are not evil.

Though you have to wonder, if there wasn't a bounty...

That regex explorer to map is pretty cool.

Good idea - I know this would work on some places. I never thought of it like this.

Its my opinion that when a regex gets this complicated it should be hand-coded as a state machine parser.

Complicated != complex ... it’s a straightforward expression.

My experience is regex seems like line noise to people. If you show them the multi line commented form they get it, and can then appreciate the one liners.

I’ve thought a neat IDE trick would be doing that expansion, into this visualization followed by multi line text you edit, and a column to paste sample text into and see matching / back-refs / captures happen. Stays one liner in code.

Standalone tools do it, slicker if inline.

State machine parsers are more difficult for most people to understand.

Thanks for the write up. Really interesting and approachable for non-security people.

And.. it's down..

It looks like regex originally came from Appendix B of RFC 3986 so this type of vulnerability may exist in other places. According to the RFC, the regex is "for breaking-down a well-formed URI reference into its components." In this case, the URL was invalid because the host component cannot contain a backslash unless percent encoded.

Well done!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact