
PassGAN: A Deep Learning Approach for Password Guessing - rbanffy
https://arxiv.org/abs/1709.00440
======
tbiehn
I'm a bit confused as to why they cite
[https://www.usenix.org/conference/usenixsecurity16/technical...](https://www.usenix.org/conference/usenixsecurity16/technical-
sessions/presentation/melicher), but then don't evaluate their performance
relative. The selection of JTR and HashCat rule eval is troubling - 'Best64'
isn't the way a skilled attacker uses those tools. It would be cool to see
these teams participating in cracking event;
[http://contest.korelogic.com/](http://contest.korelogic.com/) or tasked
against un-recovered corpus; [https://hashes.org/](https://hashes.org/)

Of course, maybe a more accurate view was that the paper isn't actually
seeking to advance to the state of the art in password cracking, and has other
motivations.

------
matrix2596
PaGAN would have been a better name

~~~
IncRnd
Pagan is the name for a password generator.

------
dark_silicon
This makes me wonder: is it possible to iterate through the generated outputs
of a GAN rather than generate random outputs?

~~~
kdoherty
I'm assuming by iterate you mean "generate outputs of a GAN which preserve
some order that you can iterate through."

The answer is yes, but not in the way that one might think. Despite the fact
that we seed the GAN with input noise, there is no guarantee that the GAN
makes use of this at all. This is a theme with GANs: we often want to imbue
them with prior knowledge that _we_ think is important, but is easily ignored
by the GAN. In this case, we want to generate samples from p(x|z), where x is
in the space of our data (often images, in this case passwords), but provided
it gets good results according the the loss function, your GAN may learn
p(x|z)=p(x). This is fine if you don't care about enforcing some relationship
between input and generated samples, but here we do.

One solution is to use InfoGAN
([https://arxiv.org/abs/1606.03657](https://arxiv.org/abs/1606.03657)), which
adds a term to the loss function that the mutual information between a latent
code and the generator output must be high. Your latent code might be drawn
from a uniform distribution on [-1,1], and the generator output will be
conditioned on this code. This being continuous, it's questionable what
"iterate" might mean. On a computer, maybe you iterate through every possible
float (as someone mentioned), but if you want to generate N different samples,
you could also discretize this distribution to N values on the given interval,
each with probability 1/N and sample from this PMF.

------
nicklovescode
One interesting direction would be to see if you could use various parts of an
email as input features.

You could likely probabilistically guess gender or age from many addresses,
and I'm sure domain (yahoo, gmail) changes the distribution as well.

~~~
dithering
I've used email addresses as source dictionaries. Turns out people see both
choosing an email address and choosing a password as a kind of personality
test, meaning there's a lot of overlap.

------
yosito
I wonder if this type of approach could pick up on patterns in "random"
password generators.

~~~
Paul-ish
I don't think you can do better than brute force search for cryptographic
randomness, but it would be interesting to see if this does well on personal
heuristics people use. Eg the XKCD approach.

------
febin
Here is a simplified version of the same [https://medium.com/@heyfebin/how-
artificial-intelligence-can...](https://medium.com/@heyfebin/how-artificial-
intelligence-can-be-used-for-password-guessing-cf4fd4184a46)

------
dark_silicon
Is there any sort of metaheuristic a good password cracker can follow for
juggling between rounds of dictionary iteration, character substitution, brute
force suffixing, etc?

In other words, how do you most effectively combine different strategies?

~~~
dsfyu404ed
Target someone. Wget their LinkedIn and their friend's LinkedIns. Wget the
wikipedia pages for their hobbies and skills on said linkedIn pages.

Remove duplicate words and code.

Repeat with other sources of information (FB profiles, etc, etc).

Remove duplicate words.

Add it to your dictionary and then use that to see the generated dictionary.

Password cracking is always a compromise between speed and thoroughness.

~~~
zeveb
That's why I love passwords like PNe1KaC5gZGF5hlonE1k7g: they're just
completely unguessable.

I don't have a guessable password on any remote account I have. A remote
attacker simply cannot guess passwords; he'd have to use some other method,
e.g. taking over my phone number or email account.

~~~
htrp
how do you remember something like that across all of your accounts? (or do
you just use lastpass)?

~~~
pwg
One does not try to "remember" those. One uses a password manager, it does the
remembering, and you only have to remember the one long password that unlocks
the manager.

~~~
enzanki_ars
There is a chicken and egg problem. How does one make a secure password
manager password...? There still is the need for users to learn how to
generate secure but memorizable passwords.

~~~
pwg
With one big exception. There is only one (not 30 or 50 or 100) to have to
remember (the master one) and it is one that someone will have to enter
repeatedly (until they decide to change it) so they will eventually have
enough practice to actually remember it.

------
visarga
The moral of the story is that we will have to check our passwords against a
password GAN before using them, even if they are unique.

~~~
zokier
Better moral is that humans are bad at generating randomness, use CSPRNG.

------
Kequc
I suppose this is a good place to ask this question.

How much truth is there to the fairly famous XKCD comic "correct horse battery
staple" in this scenario. Isn't it possible that random word combinations, if
they were common in passwords, could be guessed relatively early in a password
hash brute force attack?

Especially considering we are talking about apparently billions of attempts
per second.

Second question, is it true that placing restrictions on the password such as
that it must include a capital letter, a special character, and be 8
characters long. Also reduces the time it takes to crack because the algorithm
can simply dump a huge amount of possible answers?

~~~
rbanffy
> correct horse battery staple

Assuming a vocabulary of 16K words, it's 15 bits per word for a total of 60
bits for the four word password. Letters plus punctuation will give you about
60 symbols or 6 bits per position. At 60 bits, the four word password is as
good as the 10 characters you generated by smashing your elbows on the
keyboard and, hopefully, easier to remember.

> Also reduces the time it takes to crack because the algorithm can simply
> dump a huge amount of possible answers?

Yes. Knowing the password rules limits the space that'd need to be
bruteforced.

~~~
FabHK
zxcvbn [1], the best simple password strength estimator I'm aware of, gives
"correct horse battery staple" around 62 bits, and "Tr0ub4dour&3" around 30
bits (cracked in a day). ("ILoveTacoAndBurgersWhatever1984", suggested below,
53 bits).

> Yes. Knowing the password rules limits the space that'd need to be
> bruteforced.

Yes, but not that much, really:

1\. Giving away the length of your password doesn't help the attacker much.
For realistic scenarios, testing all passwords with length < N takes less than
2% of the time of testing all passwords with length N.

(The proportion of passwords with length < N to passwords with length N is
approximately 1/M, where M is the number of distinct symbols (here about 60).
Exactly it's (q-q^N)/(1-q), I think, where q=1/M.) So, even if you use only
numbers, telling the attacker the length of the password gives them only a 10%
edge.

2\. Knowing that a 10 letter password contains at least one number excludes
about 1/6 of passwords ((50/60)^10). So, that's less than one bit. Similarly
with special characters etc.

TL;DR: Telling an adversary the length of your password doesn't really help
them. Telling them password rules (contains a number, etc.) helps them more,
but adding just one more character to your password increases the difficulty
more than knowing the password rules decreases it.

[1] [https://blogs.dropbox.com/tech/2012/04/zxcvbn-realistic-
pass...](https://blogs.dropbox.com/tech/2012/04/zxcvbn-realistic-password-
strength-estimation/)

[https://www.bennish.net/password-strength-
checker/](https://www.bennish.net/password-strength-checker/)

~~~
lucb1e
> the best simple password strength estimator

What one is really trying to estimate with a "strength estimator" is how much
entropy needs to be used to crack it when the generation method is known
(Kerckhoff's principle, sort of). So what one really needs to look at is the
generation method, not the resulting password.

------
user5994461
There is a similar technique using markov chain. The author don't seem to be
aware.

~~~
dithering
Has anyone proven that the recent spate of "apply neural networks to a text
corpus" projects produce output any "better" than a markov chain with the same
inputs would?

~~~
bitL
In speech recognition they fare additional 10-15% better than Markov models
and are state-of-art.

~~~
lucb1e
Isn't that fundamentally different, though?

With password guessing, one wants to generate as many possible options as
possible in order of most likely to least likely. If the generation method
takes more than a microsecond it might already be faster to just go the brute
force way.

With speech recognition one is also time-constrained, but much less so. If it
takes 0.2 seconds to come up with the best match, the user is barely done
pronouncing the next word.

Neural stuff is slower than Markov chains if I'm not mistaken, but while for
speech recognition that might work great, it could be fatal for password
guessing.

------
Oras
It would be interesting to create sets from auto-generated passwords that
these tools provide (like LastPass, 1Password, KeyPass) and browser generated
passwords.

~~~
FabHK
I don't think so: as they're generated randomly, there's no structure you can
"deep learn". (You might recover password rules that are imposed by the
password manager, such as at least one number, one capital letter, etc. But
that doesn't give you that much).

The point is that humans are quite bad at generating "random" strings, and you
can extract regularities there.

~~~
Oras
Your argument is based on the context that every single user is writing
his/her own password rather than using generators. How do you know that all
users are writing their own passwords? I was suggesting creating a dataset
using these tools not reverse engineering their algorithm.

~~~
roywiggins
Outputs from those are just noise, and completely unguessable at a rate better
than chance. You can't use a GAN to find structure where there wasn't any to
begin with.

