
Analyzing the Patterns of Numbers in 10M Passwords (2015) - BeautifulData
http://minimaxir.com/2015/02/password-numbers/
======
minimaxir
Huh. Of all my old blog posts, this is the last one I expected to randomly
resurface at the top of Hacker News.

There were a lot of other articles made using this 10M Password dataset at the
time it was originally released, which the dataset author aggregated into a
subreddit
([https://www.reddit.com/r/10millionpasswords/](https://www.reddit.com/r/10millionpasswords/)).
WPEngine, for example, has a much more comprehensive writeup with ad-hoc looks
at specific passwords
([http://wpengine.com/unmasked/](http://wpengine.com/unmasked/)).

~~~
anondon
Off topic, but where do you host your website and what is your tech stack?

Would it be possible to do a blog post about traffic patterns from HN? Eg-
Hits vs time since post, hits vs day of post.

~~~
minimaxir
The site is static, hosted on GitHub Pages and generated via Jekyll, backed by
Cloudflare for extra HN-proofing.

As of this comment, there are 150-170 concurrent users on the site, with about
120 of them (~80%) from HN. Although I do have the data, I am hesitant to do a
write up since I would need to correlate traffic to the rank of a submission
on HN, which I do not have in retrospect. (For example, a post at #1 can get
300 concurrent users while this post at #3 only 150. Posts in #20-30 are lucky
to get 50 concurrents. For further reference, note that Reddit posts which hit
the front page of a default like /r/dataisbeautiful can get 1,000
concurrents.)

EDIT: When this post dropped to #4, traffic immediately dropped to 100-110
concurrents.

~~~
anondon
Man, you have to do a post about traffic patterns to your website with
whatever data you have, it's way too interesting. Leave out the rank
correlation part, and share whatever data you have available. Please!

------
JoeAltmaier
The distribution of 1-digit numbers is simple: when sites require a digit,
everybody appends '1' to their usual password. The exponential declining
frequency of subsequent digits is because when passwords 'expire' folks just
add 1. The short lifetime of site usage results in that decline. Just thinking
out loud.

------
markild
Looks like a few of the patterns in his analysis has a tendency towards
Benford's Law[1]

[1]:[https://en.wikipedia.org/wiki/Benford%27s_law](https://en.wikipedia.org/wiki/Benford%27s_law)

~~~
nateberkopec
Technically it describes none of these. Benford's Law only describes
collections of leading digits. The charts in the article are just exponential
distributions.

~~~
markild
Yeah. Reading a bit more into it, I think you're right.

------
dfc
The problem with this type of analysis is that it treats the 10million
passwords as if they are representative of all passwords. A more descriptive
title would be:

"Analyzing the Patterns of Numbers in 10 Million passwords that were not
randomly selected from an unknown number of accounts"

One of the first cracking rules in john is append a "1" to dictionary word.
"123" is one of the few multidigit strings that john appends in the default
ruleset. Furthermore the first 5 million passwords were used to generate a
character frequency database for cracking the second 5 million.

~~~
minimaxir
The 10M dump was collected from a wide variety of sources to avoid sampling
bias.

~~~
dfc
How did you "avoid" sample bias? How many of the passwords come from databases
that were dumped in cleartext or cracked with 100% success? Meaning every
account on that system was included in cleartext or 100% of the passwords from
a dump were cracked.

The reason I ask is that the dataset you analyzed does not make this claim:

"Now not all of these passwords are plaintext. Many dumps include passwords in
a hashed format that requires you to crack them yourself."
[https://xato.net/a-glimpse-into-the-world-of-internet-
passwo...](https://xato.net/a-glimpse-into-the-world-of-internet-password-
dumps-5ee4609da237)

------
kijin
DataGenetics did a similar analysis with four-digit numbers in leaked
passwords and PINs. The article contains lots of cool visualizations.

[http://www.datagenetics.com/blog/september32012/](http://www.datagenetics.com/blog/september32012/)

~~~
maxerickson
I'm always struck by the uncredited similarities of stuff there to other
sources, like the pin grid, found in this paper published earlier in 2012 than
the blag there:

[https://www.cl.cam.ac.uk/~rja14/Papers/BPA12-FC-
banking_pin_...](https://www.cl.cam.ac.uk/~rja14/Papers/BPA12-FC-
banking_pin_security.pdf)

------
d--b
Notable fact: '69' makes it as '3rd most used combination of 2 numbers in
passwords'.

~~~
AznHisoka
I assume because most users were born in 1969?

~~~
wccrawford
I'm not sure if this comment is deeply sarcastic and insightful on a number of
topics, or just hopelessly naive. I'm learning towards sarcastic and
insightful, and it's impressive.

------
TorKlingberg
I think brute force password crackers could be made much more efficient by
using machine learning or manually written rules to exploit how people choose
passwords.

Even if you force users to pick a password of at least 8 characters with upper
and lower case letter, numbers and special characters, I suspect the real
entropy is much lower than the theoretical.

~~~
e12e
There were a couple of talks about this at password^12:

Like:

[http://passwords12.at.ifi.uio.no/Kirsi_Helkala/](http://passwords12.at.ifi.uio.no/Kirsi_Helkala/)

[http://passwords12.at.ifi.uio.no/Markus_Duermuth_Password_Se...](http://passwords12.at.ifi.uio.no/Markus_Duermuth_Password_Security_and_Markov_Models/)

But it's a whole conference about passwords... so not sure if I found the
presentation I had in mind...:

[http://passwords12.at.ifi.uio.no/](http://passwords12.at.ifi.uio.no/)

And btw, registration is now open for password^16 in Germany in December:
[https://passwordscon.org/](https://passwordscon.org/)

------
myfonj
When it comes to visualisation of numbers distribution, every time I recall
the Secret Live of Numbers [0] applet by Golan Levin from 2002. Haven't seen
anything comparable ever since. So pleasant to browse through the data I'm
tempted to try to make the java applet runtime working again now. (At least we
can enjoy some screenshots [1])

[0] [http://www.flong.com/projects/slon/](http://www.flong.com/projects/slon/)
[1]
[https://www.flickr.com/photos/golanlevin/sets/72157594388612...](https://www.flickr.com/photos/golanlevin/sets/72157594388612317/)

------
Coincoin
I'm surprised 69 is third instead of first. I'm even more surprised the author
is surprised it's in the tops.

When I first looked at a password database I actually laughed out loud at how
many 69 there were. I don't know, there is something funny about 'Yaris69' or
'Puppy69', although it's probably used ironically these days.

------
lwander
The fact that there are peaks at 6 and 8 digits per password is probably due
the fact that dates can be represented as DDMMYY and DDMMYYYY respectively,
rather than imply that humans are better at remembering an even number of
digits.

------
grkvlt
An interesting peak in the '7XX' subset is '768' which is an important number
for muslims. [1] I also noticed mild peaks at '258' and '852' which are
vertical sequences on a numeric keypad - in the 4-digit PIN dataset there was
a distinct peak at '2580' as well - as well as another at '951' for the
diagonal sequence.

[1] [http://islam.stackexchange.com/questions/799/what-
does-786-m...](http://islam.stackexchange.com/questions/799/what-
does-786-mean)

------
OJFord
There's a comment there [0] asking for more graphs including the distribution
for password managers that randomly generate passwords... erm...

[0]: [http://minimaxir.com/2015/02/password-
numbers/#comment-18765...](http://minimaxir.com/2015/02/password-
numbers/#comment-1876534596)

~~~
blakep
Looks like this guy is doing some serious campaigning for his password
manager, take a look at his previous comments:

[https://disqus.com/by/disqus_OIqfE7dCZb/](https://disqus.com/by/disqus_OIqfE7dCZb/)

------
e12e
Reminds me about the tidbit about "strong password" rules, like one each of
small letter, capital letter, digit or symbol. Like: "Password2016". Really
strong. It's even longer than 8 letters.

~~~
HeyLaughingBoy
"E12e likes 2016" is probably even stronger and easier to remember

~~~
e12e
The point is that "Password2016" will often score as "strong" (enough) while
it really isn't.

------
social_quotient
slightly off topic, what tool/lib did you use to make the charts?"

~~~
minimaxir
All charts in this post were made using R/ggplot2. (The code was not open
sourced in this case because the code for this post is a mess. I have revised
my process since)

