
Show HN: WordPass – password generator giving over 90 bits of entropy - miga
http://hackage.haskell.org/package/wordpass
======
euank
We need to just get away from remembering passwords. Something like Mozilla
Persona would have been great, but since it didn't catch on we already have
KeePass (and for mac/linux, keepassx2 alpha works quite well).

These cutesy little passwords might be fairly strong and easy to remember, but
it hardly matters because having to remember them results in reuse and an
upper bound on the possible entropy. The human mind has more important things
to do than remembering thousands of bits of entropy (spread among all your
passwords, you should have at least that much)... we made computers to
remember this sort of stuff for us.

If these passwords aren't meant to be remembered, but put in keepass, then
there's no point in not just upping the entropy by randomizing at the
character granularity.

KeePass makes it much easier; you have it generate you 300 bits of entropy
passwords (or whatever you feel is enough), and then you can focus all your
security efforts on that one database file. Having a "single point of
failure/password" doesn't have to be bad because it lets you guard more
closely against that single point.

------
dasil003
I don't understand how 4 random words from a 50k dictionary gives 90 bits of
entropy.

    
    
        50k ^ 4 => 6.25e+18
    
        2 ^ 90  => 1.24e+27
    

Am I missing something?

~~~
atmosx
Can you offer a link on how do you do these measurements and what exactly do
they mean?

Reading the wikipedia article about entropy in (computing) I stumbled across
this[1]:

" _Around 2011, two of the random devices were dropped and linked into a
single source as it could produce hundreds of megabytes per second of high
quality random data on an average system._ "

What does it mean _high quality random data_ in this context?. I mean 'random'
is a precise definition, something 'is random' (e.g. distance of next prime
:-P, kill Riemann) or 'not random' (distance of next prime if Riemann's ζ(s)
is solved).

[1]
[https://news.ycombinator.com/user?id=dasil003](https://news.ycombinator.com/user?id=dasil003)

~~~
ef4
"Distance of next prime" is not random, it's entirely deterministic. It's a
well-defined function. Just because we don't have an analytical solution for
it doesn't make it random.

As for "high quality", some random variables are less predictable than others.
A weighted coin that comes up heads 51% of the time is still random (not
deterministic), but more predictable than a fair coin.

~~~
avoid3d
I don't like calling non-uniform distributions 'predictable'.

Think of it like this: A coin which produces heads 99/100 times is not
predictable, since you cannot predict when the next tails is coming, you just
know that they are rarer.

As for the distance to the next prime thing, you are just plain wrong, in a
sense that he is talking about a seeded pnrg, i.e the starting prime is a seed
and you don't get to know which it is, out of all integers...

~~~
ef4
> in a sense that he is talking about a seeded pnrg

That little "p" makes all the difference. The post I was replying to was
asking about the precise definition of "random", not "pseudorandom".

As you allude to, any real randomness in such a scheme needs to come in the
form of a seed.

------
csirac2
My favourite tool is simply gpw(1). It tries to generate semi-pronounceable
words that aren't necessarily dictionary words:

    
    
      $ gpw
        rpreence
        rethersi
        atencend
        rostrass
        rtschers
        pocrevoy
        umblowfl
        disalsit
        rmsturnu
        tinfethe
    

I then have a perl script which can give me slightly more interesting
capitalizations and numbers:

    
    
      nessinea 258 rusness
      redoodr rInGLe 893
      orinGsT 87 iNdaPeRv
      3423 RisEv screar bullys
    

I've become quite good at remembering these. I can usually remember them even
if I go a month or two between usages. It's hard to explain, but it seems to
require a different memorization effort that somehow sticks better into my
long-term memory than with random dictionary words.

(1)
[http://manpages.ubuntu.com/manpages/hardy/man1/gpw.1.html](http://manpages.ubuntu.com/manpages/hardy/man1/gpw.1.html)

~~~
miga
Yes, it is critical constraint - if you find it easy to remember.

What's the size of random password space there?

~~~
csirac2
You just prompted me to read the source code :-) gpw.c has this comment at the
top:

    
    
      /* GPW - Generate pronounceable passwords
       This program uses statistics on the frequency of three-letter sequences
       in your dictionary to generate passwords.  The statistics are in trigram.h,
       generated there by the program loadtris.  Use different dictionaries
       and you'll get different statistics.
    
       This program can generate every word in the dictionary, and a lot of
       non-words.  It won't generate a bunch of other non-words, call them the
       unpronounceable ones, containing letter combinations found nowhere in
       the dictionary.  My rough estimate is that if there are 10^6 words,
       then there are about 10^9 pronounceables, out of a total population
       of 10^11 8-character strings.  I base this on running the program a lot
       and looking for real words in its output.. they are very rare, on the
       order of one in a thousand.
    
       ...
    

I'm not really in a position to properly figure this out... for a start, it's
commonly claimed that there are 1M words used in English however my local
/usr/share/dict/words has only 100K entries (and a even lot of those seem
redundant).

Assuming my gpw binary was "trained" on a 1M wordlist, and that my algorithm
averages three 8-char strings per passphrase... and we take the comments at
face-value, i.e. 10^9 possibilities per 8-char string generated.

I'm going to pretend I can competently use M _log2(N) to estimate that three
such strings might make up 3_ log2(10^9)=89 bits of entropy. Throw in a 0-9999
number, let's pretend that's worth 13-ish bits, let's call it 100 bits?

I haven't factored in the capitalizations yet, and I'm not sure how much the
random placement of the numbers and the random lengths of each "word" are
worth.

IANACG (I am not a crypto guy) but that seems like a huge number of bits and
so this is likely completely wrong (I'd start with going back to trying to
properly analyze how many combinations my gpw binary can actually produces in
an n-char string - that 10^9 combinations from a 10^6 wordlist needs testing).

------
arh68
This inspired me to write a rudimentary version in bash/C:

    
    
        ----pwgen.sh
        #!/usr/bin/env bash
        DICT=/usr/share/dict/words
        NL=$(wc -l $DICT | sed 's_[ \t]*\([0-9]*\).*_\1_')
        PW=""
        for i in {1..4}; do
        RAND=$(./arc4random)
        LINE=$[$RAND % $NL]
        WORD=$(sed -n $LINE"p" < $DICT)
        PW=$WORD" "$PW
        done
        echo $PW
     
        ----arc4random.c
        #include <stdio.h>
        #include <stdlib.h>
        int main() {
            printf("%u\n", arc4random());
            return 0;
        }
    

They're not always easy to remember, though, at least for me:

    
    
        > ./pwgen.sh 
        interdepartmental anesthyl intrabranchial physiolatrous

~~~
cynwoody
Your little script performs amazingly well on my Mac, coming in at about 165
ms, quite a bit faster than I expected. If your script is too fast for you,
you could consider reading the dictionary into a bash array instead of using
sed. My /usr/share/dict/words contains 234,936 words.

As a point of comparison, here is a Python version that clocks in around 107
ms:

    
    
        #!/usr/bin/env python
        from random import SystemRandom as r
        
        with open('/usr/share/dict/words', 'rb') as f:
            lines = f.readlines()
        
        print ' '.join(r().choice(lines).rstrip() for x in xrange(4))
    

However, I don't know how well SystemRandom compares to arc4random in terms of
crypto.

~~~
jzwinck
If you're set on using arc4random it's no big deal to use ctypes to invoke it
from Python. Bash is quick to bang out but it's funny how your parent needs to
compile a C program then invoke it N times per bash script run. Using Python
for this is an obvious choice.

~~~
arh68
Yes thanks for pointing that out, I wrote a slightly faster version below. I
took your advice and wanted just a single invoke on the C program, so I
modified it to take an argument. I'm satisfied!

    
    
        > cat arc4random-v2.c
        #include <stdio.h>
        #include <stdlib.h>
        int main(int argc, char** argv) {
          if ( ! (argc == 1 || argc == 2) ) {
            printf("usage: %s [n_output]\n", argv[0]);
          } else {
            int n_iter = (argc == 2) ? atoi(argv[1]) : 1;
            for (int i = 0; i < n_iter; i++) {
              printf("%u\n", arc4random());
            }
          }
          return 0;
        }
    
        > ./arc4random 3
        4282823399
        333796506
        762478899
        
        > cat pwgen-v2.sh
        #!/usr/bin/env bash
        dict=/usr/share/dict/words
        n=4 # of words to include in passphrase
        nl=$(wc -l $dict | sed 's_[ \t]*\([0-9]*\).*_\1_')
        fmt_flag() { echo $[$1 % $nl] | sed -n -e 's_^_-e _' -e 's_$_p_p' ; }
        flags=$(./arc4random $n | while read -r i; do fmt_flag "$i"; done)
        words=$(sed -n $flags < $dict)
        echo $words
        
        > time ./pwgen.sh
        liparite evade retiringly spacious
        
        real	0m0.268s
        user	0m0.251s
        sys	        0m0.015s
        
        > time ./pwgen-v2.sh
        latherin prideling thumbscrew unraveled
        
        real	0m0.082s
        user	0m0.072s
        sys	        0m0.013s
    

EDIT: note that the single call to sed results in alphabetical/dictionary-
dependent results... (!) oops

------
ghshephard

      import random, re
    
      lines=[re.sub('[-!.,;:]',' ',x).split() for x in open('in').readlines()]
    
      w=set()
      for l in lines:
          w=w | set(l)
    
      random.sample(w,4)
    

Hrm, for romeo and juliet I get some interesting combinations...

    
    
      ['unseen', 'iron', 'Catling', 'baked']
      ['grove', 'press', 'Aurora', 'garish']
      ['agate', 'She', 'drybeat', 'rather']
      ['flowed', 'sails', 'wed', 'masks']
      ['spilt', 'cage', 'Remembering', 'stiff']
      ['heartsick', 'shame', 'enjoin', 'weeping']

~~~
wyager
> import random, re

You'll want to replace random with a cryptographically secure RNG.

~~~
peteretep
Could you describe a practical attack against this password generator based on
how it actually works?

Specifically what information you would need to be in possession of in order
to exploit it, and then a description of how you would - again, in practice -
exploit that, inside a practical timeframe. Thanks.

~~~
borplk
I'm also interested in knowing a practical attack against this.

How could a weakness in the randomness be a problem at all in this case?
(generating words to use as password)

Ok suppose I've used a not-at-all-random PRNG and got the following numbers:
1, 16, 29, 30, 18

So I grab the 1st, 16th, 29th, 30th and the 18th words from my list and those
words are my generated password.

How is this vulnerable to any attacks?

~~~
euank
To expound on the other response.

A PRNG is only as random as its seed / starting state. Let's say it gets this
state from the current time, in milliseconds.

Now, let's assume the attacker knows, based on your "Time joined" statistic
for some account, a roughly 2 minute period of when you generated your
password (it's very common for a website to show time joined in the
granularity of a second through an api; passwords are rarely generated more
than a minute beforehand if you use a generator + keepass). 2 minutes is
3,600,000 milliseconds... or in other words, they can seed the RNG with all
3.6 million "time-in-milliseconds" and run the program and see the output.
They now only have to guess 3.6 million possible passwords, not, say, 1e20
(assuming a dictionary of 10k words). That's a _massive_ speedup on the order
of 1e14.

I hope that all made sense... the basic idea is just if the attacker can make
any sort of estimate about the state of the PRNG, he can just test passwords
generated from states like that, not all possible passwords.

You'll notice, of course, that this vulnerability relies on the other party
being able to discern some information about the state. This doesn't have to
be from guessing the seed directly (as was the case in the example above). It
can also happen by observing multiple outputs and guessing at what seed could
produce them both, or noticing a bias.

The last one, the chance for a bias, is quite possibly what you're asking
about. "How could a weakness in the randomness be a problem at all in this
case?". A weakness in randomness means that some combinations of words will
show up more often than others, by definition. It makes the password weaker
because the attacker can discover this bias and then guess more probable
passwords first.

~~~
borplk
Thanks, but all of that is based o the assumption that the attacker has the
list of words that I used to generate the password right?

~~~
euank
That he has it or can get it, yes. Even if he doesn't have it, there are a few
dictionaries (/usr/share/dict/words on various popular distros would be a good
start) he could try and only guess the probable options in both of them.

Furthermore, it's possible he could figure out which dictionary you were using
by another means. If a site that stored your password in plaintext was
compromised, that would give him 4,5,whatever words in your dictionary, and
using a similar attack on this flawed prng as discussed above, he could narrow
down the possible positions of those words, thereby figuring what dictionary
is most probable.

The assumption that the attacker can get your word list in this sort of
password scheme is actually implicit in how the entropy is calculated; it's
calculated assuming the attacker is stringing together words from the same
known source in the same general format.

------
tinco
I made one in Ruby a few weeks ago. Please don't do what ghshephard does below
and blindly import random. It's actually quite important that you use a secure
random generator, or your entropy will be drastically reduced.

My solution is here, don't use it unless smart people have confirmed that it
works:

[https://github.com/d-snp/wachtwoord/blob/master/wachtwoord.r...](https://github.com/d-snp/wachtwoord/blob/master/wachtwoord.rb)

Also smart people: please let me know if I've done anything stupid here, I
actually use passwords generated by this tool.

For people who don't enjoy reading Ruby: My tool reads in a dictionary file,
drops any too small words because they don't have enough absolute entropy and
too long words because they're harder to remember (for Dutch anyway, your
mileage might vary in your own language).

You can vary the amount of words you want to have in your passphrase, and the
tool will print how much entropy the password has if the attacker is aware
that you're using this scheme and this dictionary file, but not aware of the
random values spitted out by the random generator to pick the words.

It will also spit out a cheesy ballpark estimate as to how long it might take
an attacker to crack this password with reasonable effort, you can tweak what
reasonable effort means for your context. What currently is a reasonable
attack depends heavily on the hashing scheme you employ.

------
robgering
Quick and dirty Bash one-liner:

    
    
      shuf -n4 /usr/share/dict/words | sed "s/'s//g" | tr -d "\\n"; echo;

------
keithpeter
[https://www.schneier.com/essay-246.html](https://www.schneier.com/essay-246.html)

Oulipo rules. Latin squares. Nursery texts. Maps.

Anyone working on a generator based on this kind of approach?

~~~
miga
Could you elaborate and give pointers to what "Oulipo rules" and others are?

~~~
keithpeter
Oulipo was a silly 'organisation' invented by some writers in Paris after the
second world war. The members included mathematics lecturers and writers
interested in producing texts according to routines or with constraints.
George Perec walked the streets of Paris systematically for some months
according to a routine derived from a double latin square, and then wrote
about what was happening at each location at a specified time.

I was alluding to the idea easily remembered _rules_ for selecting characters
from a _text_ known to the person.

------
pjaspers
I made something similar [0] but based on Diceware [1] and a book you pass on
to the app. It generates passphrases by using words from the book.

[0] [https://github.com/pjaspers/frasier](https://github.com/pjaspers/frasier)

[1]
[http://world.std.com/~reinhold/diceware.html](http://world.std.com/~reinhold/diceware.html)

------
tomberek
A similar scheme used to save, retrieve, and publish elm code:
[http://tomberek.insomnia247.nl:5055/xkcd/flowerbed/too/socio...](http://tomberek.insomnia247.nl:5055/xkcd/flowerbed/too/sociologically/defiantly)

Using 'Publish' will produce
[http://tomberek.insomnia247.nl:5055/xkcd/2797191408452820889](http://tomberek.insomnia247.nl:5055/xkcd/2797191408452820889)
which is just the output page without access to an editor.

Edit: go to
[http://tomberek.insomnia247.nl:5055/try](http://tomberek.insomnia247.nl:5055/try)
to make your own

------
kolev
How do you install stuff like this? I've never touched Haskell, I tried "cabal
install wordpass", but it seems it doesn't exist.

~~~
samstokes
Assuming you mean wordpass doesn't exist: try "cabal update" first.

If you mean cabal doesn't exist, you'll need to install the Haskell Platform
first. There may be a package for your distro (or Homebrew).

~~~
kolev
Thanks, "cabal update" did it for me. I guess I have no more excuses to dive
into Haskell!

