
Hacking GitHub's Auth with Unicode's Turkish Dotless 'I' - _jg
https://eng.getwisdom.io/hacking-github-with-unicode-dotless-i/
======
BoppreH
I love Unicode, but I'm more and more coming to the conclusion that strings
are evil and should be treated as opaque byte arrays, whose only available
operation is rendering into a bounded area. I now see any other string
operation as code smell.

It's scary how much of our infrastructure relies on strings, given how few
guarantees string operations actually give. Take files names, for example. Two
visually identical file names may map to different files (because
confusables[1]), or two _different_ names map to the _same_ file (because
normalization[2]), or the ".jpg" at the end may not actually be the extension
(because right-to-left override[3]), not to mention names with newlines or
backspaces in them, and inconsistencies between operating systems.

I would go as far as blaming our overreliance on strings for all the injection
attacks we see (XSS, SQL, command, etc).

[1]
[https://unicode.org/cldr/utility/confusables.jsp](https://unicode.org/cldr/utility/confusables.jsp)

[2]
[https://developer.apple.com/library/archive/qa/qa1173/_index...](https://developer.apple.com/library/archive/qa/qa1173/_index.html)

[3] [https://krebsonsecurity.com/2011/09/right-to-left-
override-a...](https://krebsonsecurity.com/2011/09/right-to-left-override-
aids-email-attacks/)

~~~
thaumasiotes
I want to note a separate issue of defensive coding that comes up in the
writeup:

> GitHub's forgot password feature could be compromised because the system
> lowercased the provided email address and compared it to the email address
> stored in the user database. If there was a match, GitHub would send the
> reset password link to the email address provided by the attacker

The logical flow is:

1\. Get the email address from the forgot-password request.

2\. Get the email address from the database for the same account.

3\. Check whether they match.

4a. If not, we're under attack -- refuse the request.

4b. If so, all is well -- send a password reset to the email address.

Of course, we know the email address twice -- we asked the user for it during
the password reset process (step 1), but we never needed to do that because we
already had an email address on file for the account. We retrieved _that_
email address in step 2. We know that the two addresses are the same, but, if
you look at the semantics behind the variables, in step 4b we're choosing one
of these two "equivalent" options, depending on which variable we use for the
email address:

1\. Send the account password to the account owner.

2\. Send the account password to a guy who doesn't know what the password is.

And these have very different risk profiles. Choosing the first option instead
of the second would have prevented this attack without needing to worry about
unicode case-translation issues. You never want to trust information you just
received from an unknown user when you already have the same information from
a more authoritative source.

~~~
2T1Qka0rEiPr
I'm not sure I fully understand. What do you mean by step 2? If I entered
"myemaıl@example.com" into the reset field, are you saying that step 2 _would_
be the process of doing some normalization to try to find a matching account?
If I reset a password, don't I _only_ provide an email address by means of
doing so? Therefore, doesn't the service merely attempt to match an email to
an existing account within the DB?

I believe I understand the rest (the take-away being, _however_ you match A to
B, send the reset email to the email address stored in the DB?), just not sure
about the flow beforehand.

~~~
thaumasiotes
I reply separately to observe that the flow you describe is bugged in a more
obvious way: if you ask only for an email address, and then discover the
related account by normalizing that address before doing a database lookup,
it's a serious error to then send the reset email (which controls an account
you looked up using the _normalized_ address) to the original address. You
found the account by looking up a normalized address; the original address
isn't even known to be associated with the account.

In that case, there are three options:

1\. Send the reset email to the address you pulled from the database.
(correct)

2\. Send the reset email to the normalized attacker-provided address. (wrong
but "probably fine"; this is the bug I was talking about in the first place)

3\. Send the reset email to the original, non-normalized attacker-provided
address. (wrong and definitely a problem)

------
deathanatos
This blog sets opacity: 0 (fully invisible) on the entire content, then fails
to unset that CSS with JS, b/c the JS crashes if you block cookies.

> _because the system lowercased the provided email address and compared it to
> the email address stored in the user database._

While sending the email to the attack-provided email, instead of the one in
the database, is bad… lowercasing emails is _also_ not valid. The lookup
should never have matched in the first place.

(It's slightly more complicated: to an extent, the case of the domain name
doesn't matter, ignoring non-ASCII characters — I have no idea what they do.
But the local part — the portion before the @ — is case sensitive. A server is
free to ignore that, and map multiple local parts to the same mailbox
internally¹, and many do, but the sender cannot make that assumption.)

¹or do other weird things, like ignore dots, or +extensions, etc.

~~~
iudqnolq
A few tiny nitpicks:

The local part MUST be treated as case-sensitive, servers are discouraged from
doing so (as Gmail does):
[https://tools.ietf.org/html/rfc5321#section-2.4](https://tools.ietf.org/html/rfc5321#section-2.4)

Github shouldn't have normized case, but it's not insane to require lowercase
in the first place I think so long as you don't convert silently.

I wouldn't call +extensions and ignoring dots weird because Gmail does that
and fairly or unfairly they set the standard for user expectations of email
nowadays.

Also, I think it's unreasonable to require full compliance with the spec. For
example, if you go around trying to give out email addresses with comments in
the ("usern(ignored_comment)ame@example.com") you'll see many things break and
I don't think that's an issue.

For a delightful example of how insane email addresses can get, here's a fully
compliant regex to validate one:

(?:(?:\r\n)?[ \t]) _(?:(?:(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[
\t]))_"(?:(?: \r\n)?[ \t]) _)(?:\\.(?:(?:\r\n)?[ \t])_
(?:[^()<>@,;:\\\".\\[\\] \000-\031]+(?:(?:( ?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[
\t])) _" (?:(?:\r\n)?[ \t])_)) _@(?:(?:\r\n)?[ \t])_ (?:[^()<>@,;:\\\".\\[\\]
\000-\0 31]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.) _\
](?:(?:\r\n)?[ \t])_ )(?:\\.(?:(?:\r\n)?[ \t]) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+ (?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](?:
(?:\r\n)?[ \t]) _))_ |(?:[^()<>@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z |(?=[\\["()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[
\t])) _" (?:(?:\r\n) ?[ \t])_) _\ <(?:(?:\r\n)?[
\t])_(?:@(?:[^()<>@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\ r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)
_\\](?:(?:\r\n)?[ \t])_ )(?:\\.(?:(?:\r\n)?[ \t]) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n) ?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](?:(?:\r\n)?[
\t] ) _))_ (?:,@(?:(?:\r\n)?[ \t]) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](?:(?:\r\n)?[
\t])* )(?:\\.(?:(?:\r\n)?[ \t]) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](?:(?:\r\n)?[
\t]) _))_ ) _:(?:(?:\r\n)?[ \t])_ )?(?:[^()<>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[ \t]))
_" (?:(?:\r \n)?[ \t])_)(?:\\.(?:(?:\r\n)?[ \t]) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?: \r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[ \t
]))_"(?:(?:\r\n)?[ \t]) _))_ @(?:(?:\r\n)?[ \t]) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031 ]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](
?:(?:\r\n)?[ \t]) _)(?:\\.(?:(?:\r\n)?[ \t])_ (?:[^()<>@,;:\\\".\\[\\]
\000-\031]+(? :(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.) _\\](?:(?
:\r\n)?[ \t])_ )) _\ >(?:(?:\r\n)?[ \t])_)|(?:[^()<>@,;:\\\".\\[\\]
\000-\031]+(?:(? :(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)? [
\t])) _" (?:(?:\r\n)?[ \t])_) _:(?:(?:\r\n)?[ \t])_
(?:(?:(?:[^()<>@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]| \\\\.|(?:(?:\r\n)?[
\t])) _" (?:(?:\r\n)?[ \t])_)(?:\\.(?:(?:\r\n)?[ \t]) _(?:[^() <>
@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|" (?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[
\t]))_"(?:(?:\r\n)?[ \t]) _))_ @(?:(?:\r\n)?[ \t] ) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\\["()<>@,;:\\\
".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](?:(?:\r\n)?[ \t])
_)(?:\\.(?:(?:\r\n)?[ \t])_ (? :[^()<>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\\["()<>@,;:\\\".\\[
\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.) _\\](?:(?:\r\n)?[ \t])_ )) _|(?:[^()
<>@,;:\\\".\\[\\] \000- \031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|( ?:(?:\r\n)?[
\t]))_"(?:(?:\r\n)?[ \t]) _)_ \<(?:(?:\r\n)?[ \t]) _(?:@(?:[^() <>@,;
:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([
^\\[\\]\r\\\\]|\\\\.)_\\](?:(?:\r\n)?[ \t]) _)(?:\\.(?:(?:\r\n)?[ \t])_
(?:[^()<>@,;:\\\" .\\[\\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\ ]\r\\\\]|\\\\.)
_\\](?:(?:\r\n)?[ \t])_ )) _(?:,@(?:(?:\r\n)?[ \t])_ (?:[^()<>@,;:\\\".\ [\\]
\000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\ r\\\\]|\\\\.)
_\\](?:(?:\r\n)?[ \t])_ )(?:\\.(?:(?:\r\n)?[ \t]) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]
|\\\\.)_\\](?:(?:\r\n)?[ \t]) _))_ ) _:(?:(?:\r\n)?[ \t])_
)?(?:[^()<>@,;:\\\".\\[\\] \0 00-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]|\\\ .|(?:(?:\r\n)?[
\t])) _" (?:(?:\r\n)?[ \t])_)(?:\\.(?:(?:\r\n)?[ \t]) _(?:[^() <>@,
;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|"(? :[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[
\t]))_"(?:(?:\r\n)?[ \t]) _))_ @(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\\["()<>@,;:\\\".
\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.) _\\](?:(?:\r\n)?[ \t])_
)(?:\\.(?:(?:\r\n)?[ \t]) _(?:[ ^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]
]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](?:(?:\r\n)?[ \t]) _))_ \>(?:(?:\r\n)?[ \t])
_)(?:,\s_ ( ?:(?:[^()<>@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\ ".\\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[
\t])) _" (?:(?:\r\n)?[ \t])_)(?:\\.(?:( ?:\r\n)?[ \t]) _(?:[^()
<>@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\\["()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[
\t]))_"(?:(?:\r\n)?[ \t ]) _))_ @(?:(?:\r\n)?[ \t]) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](?:(?:\r\n)?[
\t]) _)(? :\\.(?:(?:\r\n)?[ \t])_ (?:[^()<>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.) _\\](?:(?:\r\n)?[
\t])_ )) _|(?: [^() <>@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\".\\[\ ]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[
\t]))_"(?:(?:\r\n)?[ \t]) _)_ \<(?:(?:\r\n) ?[ \t]) _(?:@(?:[^()
<>@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\\["
()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](?:(?:\r\n)?[ \t])
_)(?:\\.(?:(?:\r\n) ?[ \t])_ (?:[^()<>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\\["()<>
@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.) _\\](?:(?:\r\n)?[ \t])_ ))
_(?:,@(?:(?:\r\n)?[ \t])_ (?:[^()<>@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@, ;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)
_\\](?:(?:\r\n)?[ \t])_ )(?:\\.(?:(?:\r\n)?[ \t] ) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\\["()<>@,;:\\\
".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](?:(?:\r\n)?[ \t]) _))_ )
_:(?:(?:\r\n)?[ \t])_ )? (?:[^()<>@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\\["()<>@,;:\\\". \\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[
\t])) _" (?:(?:\r\n)?[ \t])_)(?:\\.(?:(?: \r\n)?[ \t]) _(?:[^()
<>@,;:\\\".\\[\\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\\[
"()<>@,;:\\\".\\[\\]]))|"(?:[^\"\r\\\\]|\\\\.|(?:(?:\r\n)?[
\t]))_"(?:(?:\r\n)?[ \t]) _))_ @(?:(?:\r\n)?[ \t]) _(?:[^() <>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.)_\\](?:(?:\r\n)?[
\t]) _)(?:\ .(?:(?:\r\n)?[ \t])_ (?:[^()<>@,;:\\\".\\[\\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\\["()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\r\\\\]|\\\\.) _\\](?:(?:\r\n)?[
\t])_ )) _\ >(?:( ?:\r\n)?[ \t])_)) _)?;\s_ )

[http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html](http://www.ex-
parrot.com/~pdw/Mail-RFC822-Address.html)

Edit: Apparently Perl is wrong. Thanks
[https://news.ycombinator.com/item?id=21810662](https://news.ycombinator.com/item?id=21810662)
!

~~~
jcranmer
Except that's not actually correct, because strictly following the ABNF does
not yield correct semantics for an email address.

An email address consists of a local-part, a literal @ character, and then a
domain name or an IP address literal. The local-part is either a series of
dot-separated atoms (/[a-zA-Z0-9!#$%&' _+ /=?^_`{|}~-]+/ is the syntax for an
atom) or a quoted string (/"([^\\\"\0-\031\x7f]|\\\\[^\0-\031\x7f])_"/ in
regex). If you support EAI, you need to add \u00a0-\u10ffff [i.e., all non-
ASCII non-C1 control characters].

According to RFC 822, you can insert whitespace (and comments) arbitrarily
into the mailbox production, but that is non-semantic. Seeing "From: John Doe
<foo @ example (I ((hate)) CFWS).com>" means that the email address is exactly
"foo@example.com", and you can reject any claim of the spelling without
prejudicing any email addresses.

In practice, you can drop all support for quoted strings and IP address
literals in most applications. So a correct email address (pre-EAI) regex in
that vein would be:

    
    
        [a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*@([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*\.)+[a-zA-Z]{2,}
    

If you insist on quoted-string support, then it looks like this:

    
    
        ([a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|"([^\\"\0-\031\x7f]|\\[\\"])*")@([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*\.)+[a-zA-Z]{2,}
    

[I simplified the quoted string to only accept escapes that are semantically
necessary--the strings "a b"@example.com and "a\ b"@example.com correspond to
the same email address].

Edit: Sorry, there's quite a few asterisks in the regexes that Hacker News is
turning into italicization, and I don't know how to unbork them.

Edit 2: Someone suggested how to unbork the standalone regexes, but the
asterisks in the inline regex in the second paragraph are still missing.

~~~
chupasaurus
IDNs (Internationalized Domain Names) would blow any regexp to an infinity
because the list of languages usable in different domains isn't unified[0]

[0][https://www.iana.org/domains/idn-tables](https://www.iana.org/domains/idn-
tables)

~~~
jcranmer
By the time you hit IDNs, regexes for validation are no longer your biggest
issue. Your real check for validity at that point becomes "can I actually
contact the host" (or send an email, if validating an email address), and
there is little point in aggressively validating a purported domain name
instead of checking if it actually exists.

------
ljm
So if I understand this right, what GitHub did was something like:

    
    
        user = get_user_from_valid_email(params[:email])
        send_reset_email(params[:email])
        # instead of
        # send_reset_email(user.email)
    

?

I've seen this pattern before and the reason is usually something about using
the variable in memory as opposed to the function call. Total non-
optimisation.

~~~
microcolonel
We use three versions of the email address internally: the exact verified
address used at signup or the last valid email change, a normalized version of
that (for identity) without + mailboxes, lowercased, de-accented, stripped of
dots and other inert punctuation, and normalized in a number of other ways...
and then of course the email parameter (only used during registration).

We accomplish this with a slightly more restrictive version of the standard
ABNF provided in the RFC.

I guess I should probably document _why_ we go to this trouble, in case
somebody gets the brilliant idea to "simplify" it.

~~~
_jal
> stripped of dots and other inert punctuation

The period thing is a Gmail feature, not a standard. some.email@mydomain and
someemail@mydomain most certainly do not deliver to the same mailbox.

~~~
Symbiote
It is actually an antifeature, since it is non-standard and leads to various
attacks.

For example, a malicious user can register

    
    
      some.email@gmail.com
      so.meemail@gmail.com
      someem.ail@gmail.com
      so.mee.mail@gmail.com
      somee.mail@gmail.com
      so..mee.mail@gmail.com
      som..eem.ail@gmail.com
      so.meemail@gmail.com
      somee.mail@gmail.com
    

with a service, which will (quite reasonably) send a "Welcome" email. That
results in a flood of emails to the GMail user.

~~~
morpheuskafka
I wouldn't go so far as to call it an antifeature--in fact, I wouldn't be
surprised if a common use is to allow people to maintain multiple accounts on
the same service with one email. It isn't standard, but it's not in violation
of any standard--nothing says that the server must store each distinct valid
email in a separate mailbox with its own login. Many servers implement
"catchall" emails or aliases which result in the same thing, distinct
addresses going to the same mailbox.

What is a problem, and what is non-standards-compliant, is GitHub incorrectly
assuming that all mail providers will do this when many do not. It would be no
different from assuming that "admin" and "postmaster" go the same mailbox
because that's the way a lot of software is configured.

------
inimino
Domains are not case sensitive, but email local parts are! There is no reason
whatsoever to do case normalization on local parts of emails on any domain you
do not own, as this is strictly incorrect and could lead to a totally
different address that also exists (as happened here).

Of course, email providers are free to do whatever case folding or
normalization they want, in which case the security burden of avoiding
collisions is on the provider. If someone's email provider maps different case
variants to the same mailbox, there's still no need whatsoever to do anything
to the address, as the user will get it delivered to them regardless. If the
provider doesn't do case folding, they will have to enter their local part
case sensitively, but that's exactly the same as for any other use of their
email address.

I can only imagine how this vulnerability came to be. Unicode is not to blame
here. If security-critical password reset code was not audited carefully
enough to catch a mistake like this, one wonders what other errors might
remain.

------
qwerty456127
> 'ß'.toLowerCase() // 'ss'

Why? First of all ß is already lowercase, why should toLowerCase() change it?
It also is a normal letter having both uppercase (ẞ) and lowercase (ß) forms
so converting between the cases can be made be trivial and quirk-free.
Arguably the most common word you will encounter ß in is Straße (street) where
it already is lowercase - will "Straße".toLowerCase() turn it into "strasse"?
WTF? "Straße".lower() returns "straße" in Python which seems reasonable
(nevertheless "Straße".upper() actually returns "STRASSE" ignoring the
existence of the uppercase ẞ (U+1E9E)). Why should it behave different in
JavaScript? (Because JavaScript is different, I know, just a rhetorical
question)

~~~
thaumasiotes
> (nevertheless "Straße".upper() actually returns "STRASSE" ignoring the
> existence of the uppercase ẞ (U+1E9E))

Python probably predates the addition of a notional capital ß glyph in 2017.
SS is the capitalization of ß that you'd expect if you were thinking of your
data as a string rather than a collection of font elements.

------
evilotto
That's the same glyph that got a guy killed.
[https://gizmodo.com/a-cellphones-missing-dot-kills-two-
peopl...](https://gizmodo.com/a-cellphones-missing-dot-kills-two-people-puts-
three-m-382026)

~~~
mrmcd
My girlfriend is from Turkey, and I shared this story with her a while back (I
discovered it while also researching some Unicode collision issues.) She said
that while it's an entertaining story, there's almost certainly a bit of
sensationalism and exaggeration from the Turkish press combined with credulity
by English language journalists when re-reporting it. Basically, the supposed
texts wouldn't have made sense in terms of grammar and syntax with a
straightforward dotless/dotted I swap, and would've been obvious to someone
fluent in Turkish what happened. This is would've been especially true if you
had had this cell phone for any amount of time and communicating in Turkish
and it had been routinely swapping Is.

More likely is a bunch of young and/or not too bright people were looking for
a reason to get into a violent confrontation. Then the muckraking Turkish
press had a sensationalist murder-suicide lovers quarrel story, and as a bonus
a nationalistic "see how cell phone companies don't respect our culture" angle
as the cherry on top.

However, you should still ALWAYS be careful when converting between character
sets and be locality aware when manipulating strings. Practice string safety.
;)

------
mrighele
> // Note the Turkish dotless i 'John@Gıthub.com'.toLowerCase() ===
> 'John@Github.com'.toLowerCase()

I'm not sure this example is correct. The dotless ı is already lower cased, so
the comparison above should yield false. Maybe the author was thinking about
upper case dotted "İ", which becomes regular dotted "i" when lower cased.

So what could happen is that an user enter "JOHN@GİTHUB.COM" as email, and
then the email is sent to "john@github.com" .

~~~
_jg
Should be: 'John@Gıthub.com'.toUpperCase() === 'John@Github.com'.toUpperCase()

~~~
stedaniels
The example doesn't seem to work at all for GitHub's explanation.. they say
that their outgoing email server didn't support unicode in the domain part
anyway. What am I missing? Was your actual attack on the local part?

------
kerng
What is so fascinating about security is that the same problems keep popping
up again once in a while - I remember the Turkish i issues being a big problem
in early mid 2000s during all the security pushes that went through software
engineering world. Then we kinda forgot about them, and now it keeps coming up
again.

The best a researcher can do is go back 20+ years and look for security issue
that occurred back then. Most likely you'll find very similar things now
again.

------
michelpp
The word "delıvered" is snuck into the article as a little Easter egg.

~~~
Thorentis
So it is! It seems to even fool Chrome. If you search for "delivered" on the
page the search box says "1/4" but entering will only take you to the 2 real
ones, not the Turkish i ones which it has presumably counted.

~~~
Avamander
The alternative is worse though. Characters with umlauts matched with
characters with no umlauts. E.g. searching for "rõõsa" will find both "rõõsa"
and "roosa", incredibly annoying.

~~~
grenoire
Mind that an umlaut is the double dots (Ö, Ü, Ä, etc.) above letters; accents
are _any of_ the ones that are attached to the 'base set.'

------
jrochkind1
The examples in the initial "quick example" are backwards, no?

It's `'ß'.toUpperCase()` that is `"SS"`, _not_ `'ß'.toLowerCase() === 'ss'`.
As the later chart makes clear. Same with turkish ı.

~~~
gitgud
Well I just tried in the browser console and got:

    
    
        'ß'.toUpperCase() // = "SS"
        'ß'.toLowerCase() // = "ß"

~~~
dathinab
Note that Unicode did add a "uppercase-ish" ß, as it does appear in German,
but only in context of an all caps word e.g on a sign board, so captilazation
of a whole word to SS and that new all caps ß are both correct (not sure if
UNICODE changed the capitalization rules or just added that strange all caps
ß)

~~~
thristian
For backwards compatibility reasons, the capitalization rules can't be changed
for existing characters. So normalizing by naive case-folding now requires at
least three steps:

    
    
        "ẞ".to_lower() → "ß"
        "ß".to_upper() → "SS"
        "SS".to_lower() → "ss"
    

(there's a standard for how to compare strings case-insensitively that doesn't
involve repeated case-folding, but it's much more complex)

~~~
Pahr3yah
Unicode has a dedicated case _folding_ [0][1] (as opposed to upper/lower case
mapping) algorithm to cover these cases.

[0]
[https://www.unicode.org/reports/tr21/tr21-3.html#Caseless%20...](https://www.unicode.org/reports/tr21/tr21-3.html#Caseless%20Matching)
[1] [http://userguide.icu-
project.org/transforms/casemappings#TOC...](http://userguide.icu-
project.org/transforms/casemappings#TOC-Case-Folding)

------
Keverw
Reminds me of a issue Spotify had too with unicodes and usernames.

[https://labs.spotify.com/2013/06/18/creative-
usernames/](https://labs.spotify.com/2013/06/18/creative-usernames/)

I had someone tell me that programming isn't real work before, and this is yet
another example of all the small little details going into building things
that most people don't really think about day to day.

I haven't had to work with login code in a while, but might at some point. I
know some systems only allow alphanumeric usernames, but sounds like can't
force people to have alphanumeric emails... Well I guess you could but might
upset someone. I know there's normalizing functions though like NFD, NFC, NFKD
or NFKC that might work for usernames but not sure what's really recommended.

Then also brute forcing attempts to try to mitigate attacks and other
considerations to make also when building out an account system. Then if your
company is large enough to provide phone support, not sure how you'd tell the
support person which specific emoji you used in your username.

~~~
keyP
> I had someone tell me that programming isn't real work before

Haha, I don't even understand what metrics the person was using to consider
something "work". A pilot sits throughout the flight but I think it's fair to
say their working...

~~~
Keverw
Yeah. I guess this person thinks sitting at a computer all day isn't real
work. Real work would be working in a factory all your life breaking your
back. Then I also think some people think making websites and coding is the
same as using Word or Powerpoint. I guess they just don't really understand
tech, probably a lot of people in the rust belt. Probably why they're driving
young people away.

------
dathinab
As a side note puny code conversion is only defined for the domain name _not_
the local part. Using puny code on the local part will semantically create a
different email and at last theoretically a mail provider might support both
the puny code and normal version as two different mail addreses and as such
using punicode there would potentially open up a different vulnarability.

Now that I think about it as far as I remember the local part of mail is
actually not defined as cases insensitive , through all? mail programs treat
it as such. The important part her is to always use data from your database
for any security relevant parts.

------
MattConfluence
The linked post is titled "Hacking GitHub with Unicode's dotless 'i'.", but
this submission is in title case "Hacking GitHub's Auth with Unicode's Turkish
Dotless 'I'". I think this is a bad title change, because an uppercase I is
supposed to be dotless, whereas the lowercase i used by the author's title is
not.

------
reportgunner
I haven't finished reading the article, but isn't the problem here that they
are sending the e-mail to the address provided by the user rather than sending
it to the e-mail stored in the database ?

I fail to see any added value in sending a reset link to an e-mail entered by
the user (while that e-mail is already in the database).

Is it because the e-mails are stored hashed ?

------
angst

      >'ß'.toLowerCase() // 'ss'
      "ß"
      >'ß'.toLowerCase() === 'SS'.toLowerCase() // true
      false
      >// Note the Turkish dotless i
      >'John@Gıthub.com'.toUpperCase() === 
      'John@Github.com'.toUpperCase()
      true
    

Chrome 79.0.3945.79 (Windows, 64bit) seems to differ from the proposed results
for the first two statements. If the proposed results are what should indeed
happen by unicode standard then I wonder chrome is not fully implementing
them?

------
leowoo91
Just checked gmaıl IDN as xn--gmal-nza.com which is registered by: 2019-02-27

------
blipmusic
Julia returns:

    
    
        julia> c='ı'
        'ı': Unicode U+0131 (category Ll: Letter, lowercase)
        julia> uppercase(c)
        'I': ASCII/Unicode U+0049 (category Lu: Letter, uppercase)
        julia> lowercase(uppercase(c))
        'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
    

Is this something that needs changing in the Unicode spec itself or how
strings are handled in general by various tools/programming languages? I love
[plain] text, but it's so, so fragile. :/

~~~
ishi
I believe this specific problem exists in Unicode, and not just in the Julia
language.

------
bowmessage
Author: any word on how much GitHub paid out for discovering this
vulnerability? Also, how simple was it to create a unicode-based email address
on one of the large providers?

------
wolco
So any email containing an i can be reset. Technically those using a custom
domain name are immune but those using a general email service are at risk.

Why would you write a general function that resets an account password but
also accept an email address as a parameter? What use-case exists to change
the email address sending the message?

------
lamby
Django has announced their security releases to match. [0]

[0] [https://www.djangoproject.com/weblog/2019/dec/18/security-
re...](https://www.djangoproject.com/weblog/2019/dec/18/security-releases/)

------
veganjay
There's a challenge in HackVent that involves unicode:

[http://whale.hacking-lab.com:8881/](http://whale.hacking-lab.com:8881/)

I'm not sure if the solution is unicode collisions like in the article - still
trying to solve it...

------
electrotype
I have to admit this is an attack I wasn't aware of! I'll have to update some
websites soon...

------
hashStuff
Would this not have been solved if the email addresses were stored as hashes?
Besides this it would be an extra layer of security in the even of a breach.
Why addresses aren't stored as hashes seems silly. Especially when stored in
true databases that can be queried quickly.

------
gumby
Unicode actually has an uppercase ß though I don't understand why.

~~~
thewarpaint
Because it's part of the German language:
[https://en.m.wikipedia.org/wiki/Capital_%E1%BA%9E](https://en.m.wikipedia.org/wiki/Capital_%E1%BA%9E)

~~~
gumby
Barely. If you look at it it is clearly just what its name says: a lowercase
medial s combined with a lowercase z. The uppercase version exists, as a
parallel commentator noted, basically as a typographic utility.

Fraktur had/has other such lower case ligatures (tz, ch, sch, ss ( _not_ ß) et
al) but for some reason only ß survived into Latin script as a full fledged
letter. I have a lot of old (mostly 20th century) books in Fraktur and they
all use these ligatures more consistently than the Latin ligatures are used.

~~~
80386
Fraktur ß is a ligature of s and z; the current form came into use in Latin-
script German because the Latin script already had ß, for the ligature of s
and s.

Some sort of orthographic device is needed - _s_ between two vowels means the
first vowel is long and the consonant is voiced, and _ss_ between two vowels
means the first vowel is short and the consonant is voiceless. So it's useful
to have a different case for when the first vowel is long and the consonant is
voiceless:

 _Busen_ /bu:zən/

 _Busse_ /busə/

 _Buße_ /buːsə/

Vowel length isn't predictable from spelling in cases of consonant clusters
and other digraphs: _Hand_ /hant/ vs. _Mond_ /moːnt/, _Bruch_ /brux/ vs.
_Buch_ /buːx/, etc. So _ss_ could've been used for both - or _sz_ , although
there are some words spelled with _sz_ as a sequence of _s_ and _z_ , like
_Szene_ /stseːnə/.

~~~
leipert
Not really _needed_. Swiss orthography doesn’t use _ß_

------
ptah
> Have we convinced you that Unicode is Awesome?

this is sarcasm right?

------
hashStuff
Wouldn't having the email hashed prevent this? Not sure why email addresses
are stored plain text still. Especially if they are stored in databases.

~~~
sophiebits
Emails are used for, among other things, sending email to a user. Which
requires the application know the email address.

------
fartcannon
This is an organization that not only hosts a great deal of the worlds
public/secret code, but is run by one of the largest data-gathering
organizations on earth.

I don't believe Microsoft can permit things like this under it's umbrella if
it wants to continue to pretend that it's data-collection is benign.

------
notlukesky
The likelihood of such a “hack” happening using the Turkish dotless “I” is
ZERO as all Turkish email addresses and website domains are formatted WITHOUT
using Turkish characters which include examples like: ç, ı, ü, ğ, ö, ş, İ, Ğ,
Ü, Ö, Ş, Ç

If you are at interested in Turkish characters:

[https://en.wikipedia.org/wiki/Wikipedia:Turkish_characters](https://en.wikipedia.org/wiki/Wikipedia:Turkish_characters)

[https://www.turkcebilgi.com/türkçe_karakter](https://www.turkcebilgi.com/türkçe_karakter)

This should be called the Turkish character hack: A hack only possible in the
theoretical realm.

These characters are rendered in HTML and supported by OSes. They are just not
used for emails AND website domains.

So the password reset “hack” with an email containing Turkish characters in
not a possibility from the get go.

The whole attack vector hinges on emails that exist with Turkish characters in
the first place.

~~~
_jg
This was a very real and demonstrated vulnerability. Perhaps I've
misunderstood your comment.

~~~
notlukesky
There are no emails with Turkish characters. The whole attack vector hinges on
emails that exist with Turkish characters in the first place.

~~~
anderskaseorg
That’s incorrect. The attack vector hinges on the ability to _create_ email
addresses with Turkish characters. There is nothing stopping an attacker from
creating addresses with Turkish characters to attack existing addresses
without Turkish characters.

~~~
notlukesky
[https://en.wikipedia.org/wiki/Email_address#Internationaliza...](https://en.wikipedia.org/wiki/Email_address#Internationalization)

Turkish emails are not supported in the first place.

Internationalization examples[edit] The example addresses below would not be
handled by RFC 5322 based servers, but are permitted by RFC 6530. Servers
compliant with this will be able to handle these:

Latin alphabet with diacritics: Pelé@example.com

Greek alphabet: δοκιμή@παράδειγμα.δοκιμή

Traditional Chinese characters: 我買@屋企.香港

Japanese characters: 二ノ宮@黒川.日本

Cyrillic characters: медведь@с-балалайкой.рф

Devanagari characters: संपर्क@डाटामेल.भारत

~~~
dmurray
RFC 6530 doesn't mention those character sets explicitly. It proposes allowing
all Unicode characters, apart from some control characters.

It is true that the RFC recommends mailbox providers take normalization into
account. A mailbox provider that allows i and dotless-i addresses to be routed
to different mailboxes is careless, if not actually uncompliant. I don't know
if any popular provider does this: I'm guessing the authors created their own
to demonstrate this attack.

~~~
notlukesky
Turkish characters are not part of RFC 6530.

There are no email addresses with Turkish characters at all. They all use
Latin characters.

It just does not exist - yet at least.

~~~
anderskaseorg
Yes, they are part of RFC 6530, via its references to RFC 3629 (UTF-8) and the
Unicode standard.

