
The unexpected Google wide domain check bypass - notRobot
https://bugs.xdavidhu.me/google/2020/03/08/the-unexpected-google-wide-domain-check-bypass/
======
lioeters
Many joyous moments of genuine positive hackery. Notable for me were:

\- At several points, unusual things were done "for some reason", following an
intuition.

\- The one character that does the trick was discovered by brute-forcing the
browser's URL parsing logic.

\- After finding that this little trick works, other related Google apps were
explored (for "some reason") to see if the trick is generally applicable,
which indeed it was.

\- Deep diving into Chromium source to confirm why the trick works.

\- How nicely the ends are tied up at the end, seeing the result of his work
have an effect deep in a Closure library commonly used at Google.

\---

Loved this part (my emphasis):

> I was always “afraid” to dig this deep, and would never have thought about
> looking into the source code of a browser, but after browsing around a bit,
> you realise that _this code is also just code, and you can understand how it
> works_. This is super and interesting and can be really helpful in many
> situations.

------
robocat
The impact of this bug is probably very wide including server-side - not just
a Google or Closure because as Someone1234 said: this buggy RegX was from the
[RFC3986] standard on URL parsing:
[https://tools.ietf.org/html/rfc3986#appendix-B](https://tools.ietf.org/html/rfc3986#appendix-B)

Edit 3: tl;dr: [/?#\\] in browsers (including backslash) ends the
host/authority but the RFC only specifies [/?#] as ending characters, so for
example
[https://xdavidhu.me\\.corp.google.com/blah](https://xdavidhu.me\\.corp.google.com/blah)
is recognised as host `xdavidhu.me` by the browser, and host
`xdavidhu.me\\.corp.google.com` by the RFC - so host name whitelisting or
matching can be borked for security purposes.

Edit: maybe do a search of github for this Regexp, and see what significant
finds there are. Certainly bad actors are already doing so...

Edit 2: presumably there will be a gold-rush to find other bug bounties for
the same underlying issue in other software or services!

Also, Google’s $6000 payout seems extremely disappointing: Google should
payout on the total impact of the bug that they measure (not just the one
place he happened to find). Why should the reporter have to investigate the
total impact of the bug on Google in order to get a fair payout?

~~~
wglb
>Google should payout on the total impact of the bug that they measure

And what is that total impact?

~~~
robocat
As said: only Google has the resources and motivation to measure the potential
impact. David already identified a few places the bug might affect:

“I found this regex in the Google Cloud Console’s JS, Google Actions Console’s
JS, in YouTube Studio, in myaccount.google.com (!) and even in some Google
Android Apps. A day later I even found this line in the Google Corp Login Page
(login.corp.google.com)”.

~~~
wglb
>only Google has the resources and motivation to measure the potential impact.

And apparently they have judged the impact less than folks who think $6000 is
too small.

------
deathanatos
One of these days, I'm building an integration test suite of all the weird
corner cases in HTTP and whatnot to throw at libraries.

God help the poor engineer here. Let's say he or she (reasonably!) looked up
the grammar for a URI in RFC 3986[1], the standard that defines them. Well,
according to that grammar, the browser is wrong: backslash isn't valid in that
position, and the exploiting string isn't a valid production of the URI
grammar.

But of course, the browser has its own standard — the WHATWG URL spec — which
doesn't even define a grammar for a URL, but instead defines a program in
pseudocode to parse them. Normally, \ isn't valid after an authority … unless
the scheme is http or https or one of several other Chosen One schemes.[2]

Uuugh. We've wrought this upon ourselves.

Also included in this test suite would be the wild and wonderful things you
can type into a browser's URL bar that result in malformed HTTP requests
hitting the wire… AFAIK, WHATWG doesn't yet have their own copy of HTTP…

[1]:
[https://tools.ietf.org/html/rfc3986#section-3](https://tools.ietf.org/html/rfc3986#section-3)

[2]: [https://url.spec.whatwg.org/#hostname-
state](https://url.spec.whatwg.org/#hostname-state)

Maybe one should argue that regex should not have accepted the URI, since it's
isn't valid according to the RFC.

~~~
richdougherty
As someone who has also worked on various bits of HTTP and URI parsing I have
also wished for such a test suite - or even that these RFCs and specs would
come bundled with such a suite.

There are a lot of weird cases to consider. For example, how many tools
correctly handle character encoding in HTTP headers?

[https://tools.ietf.org/html/rfc8187#section-3.2.3](https://tools.ietf.org/html/rfc8187#section-3.2.3)

~~~
tal8d
I attacked it from both directions: an arguably supervised dataset of urls
(hundreds of GB from various internet survey projects), and unit testing the
insane markov gibberish the parser makes when you cram random data where it
doesn't belong and run it in reverse. The dataset was way less helpful for a
strictly conforming design, but I imagine it would be useful for a mostly
correct parser.

------
tialaramex
Windows again. "Let's use a backslash instead of a slash in paths" has to be
up there with "NUL-terminated strings" in terms of tiny bad ideas from long
ago that are still haunting us.

Of course it's not as though Google had a choice at the outset, I'm sure some
non-trivial number of web sites magically didn't work (and maybe still don't
work) if you insist that
[https://www.example.com\works\fine\on\my\pc](https://www.example.com\\works\\fine\\on\\my\\pc)
doesn't do what the person writing it expects... I think we can blame some
early Internet Explorer build for first having this work (I can't see Netscape
allowing it) and then it becomes a choice whether to be a purist or sigh and
move on once IE gets popular and real people do it.

~~~
kazinator
NUL terminated strings are a great representation. We can take a suffix of a
non-empty NUL-terminated string just by incrementing the pointer, which is
great for recursion. A pointer to a NUL is itself a valid string.

E.g. tail recursive strchr:

    
    
      const char *strchr_rec(const char *str, int ch)
      {
         if (*str == 0)
           return NULL;
         if (*str == ch)
           return str;
         else return strchr_rec(str + 1, ch);
      }
    

strrchr (rightmost match):

    
    
      const char *strrchr_tail(const char *str,
                               int ch,
                               const char *seen)
      {
         if (*str == 0)
           return seen;
         if (*str == ch)
           return strchr_tail(str + 1, str, ch);
         return strchr_tail(str + 1, seen, ch);
      }
    
      const char *strchr_rec(const char *str, int ch)
      {
         return strchr_tail(str, ch, 0);
      }
    

That kind of thing makes me feel that null terminated strings are the shiznit
compared to some clunky header + data or whatever.

The only problem is that NUL-terminated strings are so darn easy to program
with, everyone who comes into contact with them wants their share of the
action, instead of using someone else's well-tested code.

As we speak, someone is likely writing yet another loop that builds up a NUL-
terminated string directly, and will forget to write the final NUL in some
case, or go past the end of the buffer in another.

Want to break a string into tokens on delimiters? Just overwrite the
delimiters with null bytes and you're done. (Oops, the caller didn't expect
their string mutated. Worse, it was a literal ...).

If you make programmers maintain a length field followed by data, or some such
thing, they will quickly go off scampering for library functions to
everything.

What we should never do is compare unencapsulated NUL-terminated string
manipulation with the manipulation of another representation using a robust,
debugged library.

Compare open-coded to open-coded, wrapped to wrapped.

~~~
chrismorgan
NUL-terminated strings are a poor representation, feature-poor and directly
responsible for a great many entirely avoidable bugs.

I much prefer the approach used by languages like Rust, where &str and &[T]
are (pointer, length) rather than just a pointer.

Usage can be identical to your classic NUL-terminated string, depending on the
language. (It is in Rust, with CStr.)

There is only one potential disadvantage: pointers use twice as much memory;
but in practice, this disadvantage is entirely theoretical.

But there are several serious advantages: it lets you represent zero values
(U+0000 for &str, integer zero for &[u8], _& c._), take slices that aren’t
suffixes (reducing either code complexity or memory usage, sometimes
drastically), calculate the string length in constant time (and that’s
regularly a big deal), and avoid a pitfall that has been the cause of many
critical security vulnerabilities (buffer overrun).

Want to break a string into tokens on delimiters? Easy, regardless of the
length of the delimiter (it could be zero, two, seventeen—while with NUL-
terminated strings it had to be one), without needing to worry about whether
it’s safe to mutate the string.

~~~
clarry
> You speak as though programmers scampering off for library functions to do
> everything is a bad thing. I consider it a wonderful thing.

No, that's not what they are saying.

They're just saying that to blame the representation for user mistakes and
then present an alternative representation where user mistakes are prevented
by library functions is comparing apples to oranges.

~~~
chrismorgan
Ah yes, I did misread it there; thanks for the correction. Line deleted, the
rest stands without it.

------
chatmasta
This was a really enjoyable, thorough and readable write-up. Thanks, and nice
work!

~~~
hnick
I'll add that I usually have trouble following these, but it was a breeze
here. Glad I read it.

------
ipsum2
$6,000 seems low for such a severe bug across almost all Google web
properties. I would expect an order of magnitude larger.

~~~
tptacek
Why? Who else is going to outbid Google for this serverside URL parsing bug?

~~~
mjw1007
The "competition" that Google is bidding against is all the other ways that
people capable of finding this sort of bug might choose to spend their free
time.

~~~
tptacek
If this were Android drive-by RCE, I'd agree with you. But they're bidding for
the attention of the kinds of researchers who find URL whitelist validation
bugs. $6k is a strong bounty for that kind of work.

(The work is important! But people who find web bugs like this tend to make
their money on volume, not on a couple major hits a year, like the RCE
researchers do.)

~~~
mygo
> But they're bidding for the attention of the kinds of researchers who find
> URL whitelist validation bugs.

No they’re not. Just because this happened to be a URL whitelist bug doesn’t
mean that this is the primary work that the researcher who found the bug does.
Did a URL whitelist validation researcher find this, or did someone
researching user auth access find this bug? If Google knew exactly who they
wanted to find this bug, then Google would have already found it themselves.

~~~
tptacek
Yes, it does. Vulnerability research is a big field and it is specialized.
There are different kinds of vulnerability researchers and, as an imperfect
but generally useful rule, the ones doing RCE work aren't looking for web
bugs. And the people running the Google bounty program are generally well
plugged into the vulnerability research community; they are way, way, way,
way, way, way more credible than the median HN comment.

~~~
tptacek
It also just doesn't make sense as an objection, because it's not like the
bounty rates on reliable clientside RCEs in browsers and mobile devices is
capped at $6k, and everyone doing this work knows it.

------
zeta0134
Something something regex and two problems. \s

Seriously though, _nice find!_ It's often surprising how tiny little
curiosities can lead down the rabbit hole to such interesting problems. In
this case, it was a fairly logical flawed assumption, but the mess of regex
made it just hidden enough to be tough to spot.

~~~
jcims
This is one area where bug bounty programs can really outperform an internal
security function. There are others, generally, but this deep rabbit hole
exploration is one of the best.

------
londons_explore
I feel like Google is not the only company parsing domains to check security
origins...

If a backslash can end a security origin, I am going to guess there are a lot
of other vulnerable bits of code...

One more reason joining and splitting strings with security impact is a bad
idea.

~~~
Someone1234
> One more reason joining and splitting strings with security impact is a bad
> idea.

Doing it in a standard/shared library is more secure than bespoke code on each
page, since at the very least as we witness here it can all be updated
together as needed. You can add additional abstraction, but someone eventually
has to do this work.

Even web browsers themselves aren't immune from bugs in "joining and splitting
strings with security impact" for URL comprehension. All you can really do is
try not to re-write known secure code.

Plus this buggy RegX was from the standard on URL parsing:

[https://tools.ietf.org/html/rfc3986#appendix-B](https://tools.ietf.org/html/rfc3986#appendix-B)

------
eternalban
> I put this regex in www.debuggex.com. This is a really cool website

What a great find.

~~~
gajus
There is also regex101.com. It is my favourite personal.

It highlights individual parts of the regex and provides human explanation,
e.g. [https://regex101.com/r/xKG7pE/1](https://regex101.com/r/xKG7pE/1)

------
fulafel
This seems pretty brittle. Did anyone look into Closure to see what the
motivation for the regexp based checking is? There's a browser native URL
splitting function available to JS after all.

But I guess Closure dates from the IE6 era where trusting browsers for
correctness was probably not a good idea..

~~~
cesarb
> There's a browser native URL splitting function available to JS after all.

It's the URL object ([https://developer.mozilla.org/en-
US/docs/Web/API/URL](https://developer.mozilla.org/en-US/docs/Web/API/URL)),
isn't it? We tried using it at work... and found out that, as always seems to
be the case, it's available everywhere _except on Internet Explorer_.

------
megous
Again, whitelist vs blacklist. Blacklist loses.

    
    
        blacklist: [^/?#]+
        whitelist: [a-z](?:[0-9a-z]|-(?!(?:$|\.)))*(?:\.[a-z](?:[0-9a-z]|-(?!(?:$|\.)))*)*

~~~
robocat
Your whitelist loses in this situation? It accepts local domains (e.g. ‘a’)
but not fully qualified domain names (e.g. ‘myhost.example.com.’ nor IP
addresses. Generally when checking origin, you also wish to check port (e.g.
:8000) which the Google code works for; and @username:password may need
thought for some mostly obsolete purposes. That said, the whitelist does avoid
a bunch of other problems (control chars, Unicode, high bit, etc).

~~~
robocat
Also, your whitelist is vulnerable to redos DoS attack - according to
[http://redos-checker.surge.sh/](http://redos-checker.surge.sh/)

I’m just trying to point out that securing via Regexps is never simple, even
for the very skilled. If someone posts a comment with a Regexp, it is often
broken and one can usually find problems with it.

Edit: and what are the $ in there for? They look redundant.

~~~
megous
This is not meant as a complete solution for all the worlds ills, it's just an
illustration of the concept and how proper matching would avoid the issue in
the OP. Adding \ to the exclusion list just plasters over the pile of garbage
that is the original "magic" regexp.

$ are there because you can't have '-' at the end of the label. You'd need
more chars there if you want to use this as a part of full URL parser. Or just
use browser's facilities for parsing URLs.

Also noone cares if you can DoS your own browser tab/iframe. You probably
can't anyway, because browser has other mechanisms to prevent that. But yes,
you'd need to limit this too in real code, to the max length of the label part
of the domain name.

------
kyle-rb
I love that the fix is simply a quadruple-backslash (aka an "actual backslash,
for real this time" per [https://xkcd.com/1638/](https://xkcd.com/1638/))

~~~
hinkley
More than a couple of times have I chosen a different solution to a problem
(especially shell scripts) to avoid having to double backslashes more than
once.

Because fuck trying to explain how that code actually works to anybody else,
or myself a year from now.

------
alanfranz
URL parsing is a mess. The specification is poorly defined, and different
libraries and/or servers handle them differently. This is one kind of attack;
many SSRFs exist which leverage that weakness.

The regex approach, btw, is even worse. Parse your URLs before doing things
with them.

~~~
wahern
IMO the spec is well defined and simple. What's broken are the browsers.
Permitting a backslash as a delimiter rather _obviously_ breaks the syntax. No
surprise that security issues are introduced when the browsers don't follow an
otherwise simple standard, while everybody else does.

That's what makes this crazy. Generally I agree that people shouldn't abuse
regular expressions for parsing security-critical data. But it just so happens
that a standards-compliant URL can be trivially parsed using a naive regular
expression that relies on the semantics of the gen-delimiters character set in
the standard. In fact, it's really hard to break such a naive regular
expression even when we move beyond the domain of standards compliance. One
glaring exception is anything that introduces ambiguity between the userinfo
authority section and the path, which the browsers have chosen to do. _sigh_

A new RFC should be issued that updates the syntax to include the semantics
which browsers give to backslash. Among other things, this will help to
publicize the issue.

~~~
rndgermandude
The browsers are implementing the WhatWG URL spec, not the RFC, and it
mentions \ is valid for "special" URLs (scheme is http/https, ftp, file,
ws/wss): [https://url.spec.whatwg.org/#hostname-
state](https://url.spec.whatwg.org/#hostname-state)

~~~
robocat
There’s some comedy in that URL standard:

Calculating file origin: "Unfortunate as it is, this is left as an exercise to
the reader.”

3.7. Host equivalence: “To determine whether a host A equals B, return true if
A is B, and false otherwise.”

“The application/x-www-form-urlencoded format is in many ways an aberrant
monstrosity”

------
yvoschaap
Great find! True hacker news worthy.

------
ck2
Such a relief some really smart and persistent people are not evil.

Though you have to wonder, if there wasn't a bounty...

------
rootsudo
That regex explorer to map is pretty cool.

Good idea - I know this would work on some places. I never thought of it like
this.

------
dana321
Its my opinion that when a regex gets this complicated it should be hand-coded
as a state machine parser.

~~~
Terretta
Complicated != complex ... it’s a straightforward expression.

My experience is regex seems like line noise to people. If you show them the
multi line commented form they get it, and can then appreciate the one liners.

I’ve thought a neat IDE trick would be doing that expansion, into this
visualization followed by multi line text you edit, and a column to paste
sample text into and see matching / back-refs / captures happen. Stays one
liner in code.

Standalone tools do it, slicker if inline.

------
maps7
Thanks for the write up. Really interesting and approachable for non-security
people.

------
terrislinenbach
And.. it's down..

~~~
JamesCoyne
Down for me too. Mirror.
[https://web.archive.org/web/20200309211517/https://bugs.xdav...](https://web.archive.org/web/20200309211517/https://bugs.xdavidhu.me/google/2020/03/08/the-
unexpected-google-wide-domain-check-bypass/)

------
justlexi93
It looks like regex originally came from Appendix B of RFC 3986 so this type
of vulnerability may exist in other places. According to the RFC, the regex is
"for breaking-down a well-formed URI reference into its components." In this
case, the URL was invalid because the host component cannot contain a
backslash unless percent encoded.

------
ultrablack
Well done!

