
Generate memorable passwords using Markov Chains, Huffman trees, and Dickens - j2kun
http://www.brinckerhoff.org/molis-hai/pwgen-3.html
======
nullc
The "secure unless" list needs to include:

* Author of page is malicious.

* Operator of server is malicious.

* Anyone with access to the network between you and the server is malicious. (no HTTPS)

* Google or anyone who has compromised their infrastructure is malicious (has embedded google js on the page).

* Likewise for 'bootstrapcdn', 'jquery.com', or cloudflare.

OTOH, it uses window.crypto and doesn't silently fall back to insecure
randomness like 95% of JS crypto that I've looked at...

:)

~~~
jbclements
Um... certificate pinning?

Fortunately, my last caveat covers a bunch of these.

Also, I think you(I) forgot:

* You are in fact hallucinating at this very moment.

~~~
nullc
:)

Cert pinning does nothing when the page is plain http as it is here.

(and HTTPS's helpfulness is limited without HSTS; were it HTTPS I'd have a
couple of other assumptions like none of the hundreds of CAs are malicious,
and "no attacker is able to MITM between the site and any CA that does domain-
validation.)

I didn't run off that list to criticize the site. The last bullet is indeed
good. :)

------
exratione
Here is one of my attempts at a memorable passphrase generator, using Moby
Dick as source material. It is interesting to see just how distinctive some
works and authors are when you are doing this sort of thing:

[https://github.com/exratione/four-word-
phrase](https://github.com/exratione/four-word-phrase)

clenched bakers detestable evincing

keenly eastern expertness presto

topsails existence self-created fittings

delusion brushed lockers discipline

seaman vintages cross-bones uncouth

furnished vanity copied invited

sea-captains cross-running engendered disastrous

freshet willingly lead-lined asserted

remarking repentance insert compendious

vice-bench equally imminent quadrupeds

------
eridius
Very cute. But all of the presented passwords I got had lots of punctuation
strewn about them (most commonly the " character), and the "words" were
extreme nonsense. For example, I just generated a new set and I'm getting
passphrases like

    
    
      firs." Neven," drifty-T
    

and

    
    
      is!" Mr. Look fore," oble)
    

Perhaps the best password to date is still complete gobbledygook:

    
    
      lon tunittory lippile
    

This has the benefit of at least being pronounceable, but it bears no
resemblance to actual words.

All that said, I wonder how it would do with different source material, or
source material with all punctuation deleted.

~~~
jbclements
Things will definitely get a bit longer if you make the source more uniform. I
sympathize with you on the quotation marks, though; I think it might be nice
to toss just those. Lemme try that.

~~~
jbclements
I tried it, and ... eliminating the double-quotes looks like a win. Average
password lengths in the order-2 model don't increase measurably; it rounds to
20.0 chars for both models.

I've now deployed this.

Um ... I mean... what quotations? I don't see any quotation marks!

(Many thanks.)

~~~
eridius
It definitely looks better. I'm curious what the results would be if you
eliminated all punctuation, but removing the quotes alone is a marked
improvement.

------
bumbledraven
The [http://www.diceware.com](http://www.diceware.com) passphrase-generation
algorithm is more straightforward to implement and analyze. It uses only whole
words, so the result is easier to memorize.

~~~
coyotebush
Diceware is great, though the paper fairly points out that algorithms like
that and [https://xkcd.com/936/](https://xkcd.com/936/) necessarily often
choose very uncommon words.

I suppose the Markov model could just as well use whole words (as done in all
sorts of other scenarios), at the expense of substantially longer passwords.

~~~
jbclements
I took a look at whole words. Things got a _lot_ longer. Like, twice as long.
Here, let me generate some. These are 56-bits each:

'(#" that looked up to Darnay: you not?\" \"Very willingly,\" pointing" #"
that it was confirmed. \"He may not--thou wouldst rescue this is touched the
child" #" that direction through his, and said Stryver, \"that, there! I" #"
that criminal in Lombard-street, out at neighbouring streets that she" #" that
he,\" said Darnay. Released yesterday. I.\" It was set, and--in a highly" #"
that although Sydney Carton.\" This must have to finish that way" #" that
Madame Defarge's wine-rotted fragments of his. \"Carton, idlest and its
accessories" #" that every rapid movement, and, as the executioner showed a
high grass and")

Really, I would not not like to type in "that every movement, and, as the
executioner showed a high grass and" every time I opened my laptop.

------
tempestn
This looks like a fun project, although whenever something like this comes up
I have to wonder if providing ways to make passwords more memorable is in fact
a beneficial activity, since ideally people won't be remembering more than one
password anyway (the master password of their manager). Of course there are
exceptions, like login/unlock codes for PC and phone, but often in those
exception cases a long string of characters isn't practical anyway. The cases
where it is would normally be very few.

That said, I don't think I really believe my own argument. Anything that gets
people thinking about better ways to handle their passwords is probably a good
thing. Once the ball is rolling, they might even land on a password manager.
Perhaps even through this comment!

~~~
ketralnis
I use a password manager, but I still use xkcd936 password for things that I
have to type in often. Like the password to unlock my phone or computer, or
passwords to mobile banking apps that I check often.

Copy-pasting out of the password manager on a phone is a multi-step process,
most of which is all of iOS's new animation delays.

There's a world for both of these things, and xkcd936 passwords are very easy
to generate:

    
    
        $ cat $(which xkcd936)
        #!/bin/sh
        cat /usr/share/dict/words | unsort | head -n 40 | xargs -n4 echo

~~~
garrettr_
Note that `unsort` uses the Mersenne Twister for its PRNG, which is _not_
cryptographically secure [0]. However, cracking Mersenne does require a
significant amount of random bytes generated from the same seed, and a quick
glance at the source reveals that it initializes the seed from /dev/urandom
where available (unsort.c:169), which is good and probably obviates practical
attacks.

[0]
[https://jazzy.id.au/2010/09/22/cracking_random_number_genera...](https://jazzy.id.au/2010/09/22/cracking_random_number_generators_part_3.html)

~~~
ketralnis
What attack are you envisioning on me running that by hand and pasting one of
those passwords into an online service?

------
616c
This is very cool. And now even more Racket on HN's front page. I will look at
the code later.

For those interested, variable-order Markov chaining is different form
traditional Markov models. I studied the latter in computational linguistics
courses; they are a right of passage in stastical models for CL/NLP work. I
was wondering how it was a good idea to model passwords, even pseudorandomly
with Markov chains, but VO Markov chains are somewhat different.

For the uninitiated, like myself.

[https://en.wikipedia.org/wiki/Variable-
order_Markov_model](https://en.wikipedia.org/wiki/Variable-order_Markov_model)

~~~
jbclements
Variable-order markov models do sound interesting; this tool isn't using them,
though....

~~~
616c
Is that not what order-3 Markov model means? I mean, I will not lie. I skimmed
the article and searched DDG/Wikipedia for "order 3 markov" and that article
is what I got.

Granted, my statistical comp ling course was ages ago. Perhaps it is time to
crack open that book, but neither the article or Wikipedia article cleared up
for me if order 3 or variable order are different or what is implemented.

------
EGreg
The default "choose a password" interface on our platform asks the user to set
up a Pass Phrase (see [http://blog.codinghorror.com/passwords-vs-pass-
phrases/](http://blog.codinghorror.com/passwords-vs-pass-phrases/) ) and also
scrapes Yahoo News for three consecutive words once in a while, to provide
suggests to inspire people. For example "and the horse" or "the car was" are
suggestions.

Do you see any obvious flaws in that? After the fact it's hard to scrape ALL
of Yahoo news.

~~~
garrettr_
Basing passphrases off of a (since you've just mentioned it on Hacker News)
well-known public source is usually a bad idea. Why not just pick random words
(using a CSPRNG like /dev/urandom) from a dictionary? This is the "Diceware"
approach, which is recommended because it is simple to implement and has
straightforward provable security guarantees.

~~~
EGreg
We do have a mode where the software generates random phrases like "the [noun]
[verb] the [noun]" generating about a million possible combinations (100 x 100
x 100)

But the space of three consecutive words in Yahoo News is even larger, and
frankly, most passwords are chosen from a small space anyway. And finally, the
examples are just meant to inspire, so people don't select "password" as their
password.

------
ternaryoperator
Could someone explain how these passwords are secure? Given that some phrases
are all lower-case letters with a single punctuation character, aren't they
vulnerable to brute-force attacks? My understanding has been that stronger
passwords use a wide range of character options: lower-case, upper-case,
punctuation, special chars. So, #RedBum72! would be more secure than
abigredbum despite each having 11 chars.

Is it simply the length of these passwords that makes them secure?

~~~
dietrichepp
I think you have a misunderstanding of the mathematics behind password
security.

If I choose lower-case letters only, then the password can have about 4.7 bits
of entropy per letter, so a 16-character password would have about 75 bits,
maximum.

If I choose from the entire set of non-control ASCII characters, I get about
6.6 bits of entropy per letter, which means that I can get 78 bits of entropy
with only 12 characters.

The difficulty of a brute force attack is proportional to the number of
possible passwords. So, you can make brute force attacks harder by using
longer passwords, or you can use a wider range of characters. Strictly
speaking, there is no need to do both, since either option gives you the exact
same benefit.

The technique demonstrated in the article is significantly more complex,
because the letters in the passwords are not chosen uniformly, but they are
chosen to form words that fit the surrounding context according to a certain
kind of statistical model (a Markov chain). This makes it harder to calculate
how many bits of entropy is in the password, but the bijection shown in the
article makes it much easier—when you have a bijection, you are neither
creating nor destroying entropy, so we know that passwords selected uniformly
from the right column (such as "buill to Sound thand exp") have the exact same
entropy as passwords selected uniformly from the left column (such as
"0fe5c363d354d7").

~~~
ternaryoperator
Thank you. That's very clear. Much appreciated!

------
VanillaCafe
Okay, but for 56-bits, you can also just have 3 quadruples of a-z, such as
"envh zmds vrdq" (dropping the spaces for the actual password). I'm not sure
random, punctuated text is so much easier to commit to memory.

------
jbclements
Seriously. Has someone already done this? Next gen peer review, anyone?

~~~
espadrine
It's weird, but I actually had a very similar idea a few years back! I
struggled implementing it and left it half done. I was trying to work directly
with the letter frequencies; using a Huffman tree is much nicer. Instead, I
tweaked it to generate random Sherlock Holmes paragraphs. It gave the best
results with Markov state strings of size equal to the average word size.

It's great to see that it gives a really nice output. Also, your paper is a
breeze to read.

I ended up making this[0], which is a Diceware-like with more words, and I
just added substitutions to see what it would look like.

At entropy 75, here are five from Molis Hai:

look had can's asing his, nativa "Hemmer,'" Charactiour roopiaties more folks
and conful never succepth myriage; by! Let taughter, any mind shu goost out
uponsibing of?" "Your," rem

And five passphrases with 3 substitutions:

ov&rload grXas<proof investing Men,ts peartrees Yamages automata
adVredGstudy/ng chloroformed exh>linYFassessable gGla rumbOs draughtBmanship

It is hard to determine which is easiest to remember, however. We should
perform tests; maybe ask a large number of users to remember one passphrase,
and send them a mail three days later asking them to input it. If that sounds
like fun to you, I can help.

[0]:
[https://github.com/espadrine/passphrase](https://github.com/espadrine/passphrase)

------
dminor
I did something similar using Markov chains and a dictionary - you can see it
at [http://password.supply](http://password.supply)

~~~
jbclements
cool... can you guarantee that every password is generated with equal
probability?

~~~
dminor
I guarantee nothing, other than they look somewhat like English words :)

------
chm
I think it should read "row" not "column":)

------
ryan-c
Is it possible to get the source bits back from the password?

~~~
jbclements
Yes, absolutely. 100% invertible.

------
SixSigma
I would still rather gamble with using :

It was the best of times.

Or

Please sir can I have some more?

------
Vektorweg
_Her watches, why him, fog_

A nice description why crypto.

