
Creative usernames and Spotify account hijacking - e1ven
http://labs.spotify.com/2013/06/18/creative-usernames
======
johnvschmitt
This seems odd. I mean, if their code was properly __modular __, they would
have just one place where they "fetchUserIdByName(userName)", which returns
one user ID or null if it's not used yet.

When a new user is created, it then gets assigned a unique user ID. The email
address is assigned to that user ID.

Then, if they do a password reset on user = "bigbird", it should do the exact
same lookup to find the email address.

The security bug was not really about having an improper function to do
unicode translation. It was more about having __different functions for the
same check __, simply because they were in different parts of the code.

Modular code is just so much better on all fronts, including security.

~~~
dan_manges
Based on their description of the bug, it sounded like the code was modular,
but they called the function twice: once when the password reset request was
generated, and again when the link in the email was clicked.

    
    
      However, when the link was used, canonical_username was once again applied
    

So after they sent the password reset link, they called "fetchUserIdByName"
again, but they passed in a username that had already been canonicalized once.
Because of this bug, I wonder if password resets worked at all for users with
unicode characters in their names.

~~~
mmahemoff
If you're saying canonicalise(canonicalise(name)) is not the same as
canonicalise(name), that's going to be seriously bug-prone. Idempotence ftw.

~~~
spicyj
That's exactly what they describe as the cause of the bug. They intended for
the function to be idempotent but it wasn't because of a misunderstanding with
the Python library spec.

~~~
Xorlev
Worse, it /did/ work that way in Python 2.4 but Python 2.5 stopped throwing an
exception for invalid codepoints which broke the Twisted library which broke
their canonicalization function.

------
NKCSS
I think they could have handled the 'reward' a bit better :P

"In this case the two users who posted to the forum where actually rewarded
with some Spotify premium months."

I'd say: Premium lifetime memberships would be better :)

~~~
robmcm
Yeah, seeing as before they had the option to log into anyones account,
affectivly the same as a lifetime membership.

------
a-nikolaev
I don't see any real reason to rely on idempotence.

They could simply store two names: One is provided by the user (verbatim), and
the second is its reduction to lowercase letters and digits (canonical). For
all internal logic, they could use only the canonical name, and use the
verbatim name in the front-end to make the user happy.

 _> Lower casing has the key property of being idempotent, i.e., that applying
it more than once has no effect: x.lower() == x.lower().lower(). So if a
username gets passed from service to service and you want to make sure it is
in canonical form you can safely apply .lower() and if it was already in
canonical form there is no harm done, and it is easy to stay safe._

Apparently, they thought that it's ok to use verbatim and canonical names
interchangeably, relying on idenpotence property of the XMPP function.

It looks like a huge design error.

~~~
readme
>They could simply store two names

Presumably you mean in the database? I don't see a reason to keep a copy of
the lower() transformation of a string when it is incredibly cheap to
transform a small string to lowercase.

What exactly is the point of that? I would just call lower() as needed,
personally.

~~~
a-nikolaev
Oh, my point was that two kinds of names should not be used interchangeably.
To store or not the .lower() is a matter of taste.. (Personally, I would store
both names, just to avoid wasting computing time)

~~~
readme
I think the act of storing both names is bad, because you multiply the amount
of data that could possibly become wrong by 2.

With lower(), we can expect we'll get the right transformation of string A
each time. If instead, we store string A, and then store string B as A.lower()
and copy it... A.lower() will always be A.lower, but it's much easier for
someone to come along, screw with the database, and change B.

~~~
timv
I'm not sure how they can avoid storing both.

They need to store the verbatim username in order to know how to display the
username in the UI.

They need to store the canonical username in order to efficiently know whether
a given canonical username is in use.

~~~
brokenparser
Not necessarily, in PostgreSQL you could simply add a canonicalised index.

~~~
timv
Well, in that case you're still storing it, you're just letting the the db
store it for you.

But - when the issue here is the question of the reliability of the
implementation of the canonicalisation function, having it done once in
python, and then again by PG is going to be a huge issue.

------
jrochkind1
Very interesting post -- dealing with unicode gets tricky. Because dealing
with a full character repertoire is tricky, not because unicode does it poorly
(unicode actually does it well).

I wonder if Unicode UTR#30 would be an alternative (more reliable?) method of
idempotent normalization?
[http://www.unicode.org/reports/tr30/tr30-4.html](http://www.unicode.org/reports/tr30/tr30-4.html)

Or maybe the twisted algorithm really is UTR#30, just not labelled that?

As far as I can tell, UTR#30 did not make it to formally being part of the
unicode spec, for reasons I'm not entirely clear on -- it is nonetheless quite
useful, and this case is an example. Solr for instance still uses it.
([http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#...](http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory)).

It might be a pain to find code implementing UTR#30 in your language of choice
though (I am not sure if it's part of current ICU libraries or not).

It's also worth pointing out, that in addition to this kind of 'folding' of
different-but-look-the-same graphemes, in this sort of use case you ABSOLUTELY
need to do byte normalization as per UAX#15
[http://unicode.org/reports/tr15/](http://unicode.org/reports/tr15/) .
Probably NFKC for this sort of use case.

~~~
jrochkind1
Huh, and another alternative would be using one of the unicode collation
algorithms for normalization -- which, unlike UAX#15, did make it beyond draft
stage to be an official part of the unicode spec.

[http://www.unicode.org/reports/tr30/tr30-4.html#_Toc23](http://www.unicode.org/reports/tr30/tr30-4.html#_Toc23)

------
tantalor
_So if a username gets passed from service to service and you want to make
sure it is in canonical form you can safely apply .lower() and if it was
already in canonical form there is no harm done, and it is easy to stay safe._

Why?

Suppose you only pass the original name around instead. Then you don't
_require_ your canonicalization function to be idempotent, which might be good
in your case since it wasn't.

~~~
antocv
Because they wanted "BigBird" to be the same as "bigbird" or "BIGBIRD". Hence
if you wanted to rely on a userId, you would still need to ensure all the
variations of BiGbirD map to the same userId.

------
Esifer
Why does Unicode threat Omega and Ohm like different characters?

~~~
moskie
Wikipedia's article on Ohm actually covers this!

[http://en.wikipedia.org/wiki/Ohm#Ohm_symbol](http://en.wikipedia.org/wiki/Ohm#Ohm_symbol)

"Unicode encodes the symbol as U+2126 Ω ohm sign, distinct from Greek omega
among letterlike symbols, but it is only included for backwards compatibility
and the Greek uppercase omega character U+03A9 Ω greek capital letter omega
(HTML: &#937; &Omega;) is preferred."

And from the Unicode Standards doc that is the source for that section:

"Greek Letters as Symbols: The use of Greek letters for mathematical variables
and operators is well established. Characters from the Greek block may be used
for these symbols.

For compatibility purposes, a few Greek letters are separately encoded as
symbols in other character blocks. Examples include U+00B5 µ n the Latin-1
Supplement character block and U+2126 Ω in the Letterlike Symbols character
block. The ohm sign is canonically equivalent to the capital omega, and
normalization would remove any distinction. Its use is therefore discouraged
in favor of capital omega. The same equivalence does not exist between micro
sign and mu, and use of either character as micro sign is common; for Greek
text, only the mu should be used."

~~~
afandian
Is there a separate symbol for A?

------
Sami_Lehtinen
Nothing new at all, even my stockbroker had similar kind of bug. There should
be only one function which handles usernames. When same thing is implemented
differently all over the source, this is exactly what happens.

I have seen also much worse solutions. Where actually giving username, logs
you in (sets logged in session cookie) and then prompts for password. When you
enter invalid password you're logged out. If you give username, and then
change url, you're in. Business as usual. When you test it, it works. Username
+ right password = ok, Username + wrong password != ok. Tests passed, and
that's it.

------
United857
A similar issue exists with international domain names.

[https://en.wikipedia.org/wiki/IDN_homograph_attack](https://en.wikipedia.org/wiki/IDN_homograph_attack)

~~~
Groxx
it's a different problem, though to some degree they stem from the same
underlying issue (similar but different characters).

------
Jhsto
I'd like to point out that even though not every website accepts unicode
characters on registration, some software such as vBulletin let the
administrator/user to replace the username with such characters once the user
is signed up.

You may not create as much havoc as in the original post, but some level of
confusion at least.

------
dhtbccnhmdhm
Why not to make canonization while the names differ?

canName=canonical_username(name);
for(canName1=canName,i=0;canName1!=canName;i++){
canName1=canonical_username(canName); if(i>=treshold){ stop_registration();
break; } }

------
paul_f
We require an email address as a user name. Does this get around the problem?

~~~
troels
Depends on how you validate an email address.

------
nodata
> it is probably good to... treat “BigBird” and “bigbird” as the same
> username.

Erm...

~~~
Groxx
How about "building" and "buiIding"? or "nodata" and "ｎodata"? You probably
don't want people running around claiming to be admins because their name
_looks_ the same.

Similarly, do you really want to do tech support when someone forgets that
their username is "nodata" instead of "Nodata", since their phone auto-
capitalized things when they signed up? It happens. And how do you know
they're "Nodata" and not "nOdata" or "nOdAtA"?

------
waster
A fascinating account.

------
pilsetnieks
I'm confused about why they even had the need for canonicalising the usernames
for whatever purpose? Why couldn't they just have stored and used them just as
they are?

~~~
jessaustin
If they did as you suggest, the specific account-stealing flaw they had
wouldn't happen, but since many unicode points have very similar glyphs, there
would still be "copycat" accounts. That is, the strings "Oscar" and "Οscar"
appear very similar (if one has the proper fonts installed), and one user
could therefore pose as another.

It's true that many sites don't care about this, but I don't fault Spotify for
trying to prevent it.

~~~
pilsetnieks
It looks to me as if the prevention of this caused more problems than it
solved.

~~~
ars
Not even close.

The letter À (A grave) can be written as the UTF-8 bytestream 0xC3 0x80 (i.e.
a single "character"), or as À - i.e. a letter A, then a combining grave
character i.e. 0x41 0xCC 0x80.

The two are identical. Except they have different byte representations. If you
don't normalize your unicode you will run into major problems.

~~~
jrochkind1
There are actually two kinds of normalization in play here.

The one you are talking about is the one unicode actually calls normalization,
and is dealt with in UTR#15.
[http://unicode.org/reports/tr15/](http://unicode.org/reports/tr15/)

You are absolutely right that, in almost any situation taking unicode input
where you're ever going to need to compare strings (and in most where you're
ever going to need to display them), you are going to need to apply one of the
UTR#15 normalization forms. UTR#15 normalizes different byte representations
of what, in ALL circumstances are indeed identical characters/graphemes. A lot
of people don't take account of this.

Then there's the kind of canonicalization that OP talks about, which Unicode
actually calls 'folding', and is about characters/graphemes which really ARE
different characters but which, for _some_ but not all contexts may be treated
as 'equivalent' (if not neccesarily identical). The simplest example is case
insensitivity, but there are other trickier ones in the complete repertoire,
like those discussed in the OP.

This second kind of 'folding' canonicalization is a lot trickier, because it
is contextual, not absolute. Which is maybe why Unicode started out trying to
make an algorithm for it in UTR#30 but then abandoned it. Nonetheless, despite
it's trickiness and contextuality, you often still really do need to do it, as
in OP.

------
drivebyacct2
So is there a need for a TRULY idempotent equivalent of XMPP's nodeprep? Or
one that handles more Unicode points? Or is it a calculated decision to
support Unicode 3.2 points only? (Sorry for the nooby questions, but this was
very interesting and I don't know a lot about Unicode)

~~~
lancestout
The decision to only support Unicode 3.2 is simply because the StringPrep
framework [1] (which XMPP's nodeprep and various other protocols use) is
forever tied to that version of Unicode.

Current work is on the PRECIS framework [2] which uses the metadata for
Unicode code points to determine how to handle them during canonicalization
instead of relying on a hard coded set of mapping tables. There's still a lot
of work to be done, mainly to review that the process works reliably and
doesn't introduce subtle new issues. Peter Saint-Andre (one of the authors of
PRECIS) has just started on a Python tool for testing how a given version of
Unicode is handled by PRECIS
([https://github.com/stpeter/PrecisMaker](https://github.com/stpeter/PrecisMaker)).

[1]
[https://www.ietf.org/rfc/rfc3454.txt](https://www.ietf.org/rfc/rfc3454.txt)

[2] [https://tools.ietf.org/html/draft-ietf-precis-
framework-08](https://tools.ietf.org/html/draft-ietf-precis-framework-08)

~~~
drivebyacct2
Great information, even cooler to see someone's got some code working
alongside it, I might have to adapt it to Go if it's fairly reasonable to
understand given my relative lack of experience with Unicode (heh as I'd
mentioned and is probably obvious given your knowledge on hand).

------
gkoberger
Reminds me of the early GMail bug, where you could sign up with "john.smith",
"jo.hn.smi.th" or "johnsmith", but all mail was delivered to the period-less
version:

[http://arstechnica.com/uncategorized/2006/01/6022-2/](http://arstechnica.com/uncategorized/2006/01/6022-2/)

~~~
Maxious
The link agrees - it's a feature not a bug and one that considered security:

> As it turns out, even though Gmail will act as if there is no period in a
> username when delivering mail, it will not permit users to register accounts
> whose only difference to other account is a period. That is, if bob.jones is
> registered, bobjones cannot get an account (he could get bobjones2,
> obviously).

