
Word generator using recurrent neural networks - cardigan
http://burgundy.io/
======
cardigan
Ooh front page; I guess this calls for a bit of an explanation!

First off - used this code for training the models:
[http://karpathy.github.io/2015/05/21/rnn-
effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

Very very easy to setup and train; highly recommend playing around with your
own training data (just a text file!)

This project's code: github.com/shariq/burgundy

Styled and deployed the website about a year ago at a hackathon; it then used
a nice wordlist with hand picked words.
(repo/wordserver/old_burgundy_words.txt)

Few days ago: got the server to start training a bunch of models (~200), with
randomized parameters, using the original wordlist as the training data.
(repo/rnn/rnn.py:forever)

Yesterday: woke up at 3 AM after my sleep schedule rolled around, started
exploring the output of models trained to different numbers of epochs and run
at different temperatures. Subjectively looked at the outputs, decided some
model/epoch/temperature tuples were horrible, got rid of those. Wrote a few
different scoring functions (just using intuition for what kinds of bad
outputs seemed to be commonly occurring) to score the model/epoch/temperature
tuples. Got the top ~10 scoring tuples from each scoring function, plus added
some additional interesting ones along the way, and then used a pronunciation
scoring function (repo/rnn/pronounce.py) to select the top 5 of all of these.
Funny enough, the top 5 tuples all used different models and a varying range
of temperatures (i.e, not the same model from different epochs, and picking
the right temperature significantly improved how well the model performed)
(repo/rnn/explore.py)

Since the models would still occasionally output words which were completely
unpronounceable, I put some code on top of the models which would generate a
bunch of words then discard the bottom 1/3rd of unpronouncable words. A
significant portion of generated words from these models also started with a
"c" or "b" for some reason: gave those a high chance of being discarded. Short
words were also uninteresting, and extremely long words would occasionally
show up: added probabilistic filters for length. Finally, initialization time
of LuaJIT is very high, so I had the server keep a pool of words which gets
reseeded as it runs out. (repo/rnn/rnnserver.py)

If you want to train your own word generator and you need some pointers, would
love to help: @shariq

~~~
bravura
Can you make a mode that you filter down to words that have an unambiguous
spelling?

Turn the word into phonemes, and then try to reconstruct the spelling from the
sound of the word. Highly rank words that have one obvious spelling.

~~~
cardigan
This is for brandable names I guess?

That sounds like a good approximation: although it's more judging if the word
is spelled like it sounds versus if it has just one obvious spelling.

Similar ideas: check social media account availability, check domain
availability, see how many Google search results show up, find similar
sounding existing brand names

------
gliese1337
No "about" info? No "how it works", "how it was trained", etc.?

It _seems_ to only generate words that match English phonotactics & spelling
conventions- things that _could be_ English words. Can it be retargeted to
other languages, or to arbitrary word-shape constraints?

I am particularly interested because I've recently undertaken a survey of
word-generation software for conlangers (people who create artificial
languages, like Quenya or Klingon or Na'vi), and while they do come in widely
varying degrees of sophistication, with varying degrees of built-in linguistic
knowledge, there are none yet publically available that are based on neural
networks.

~~~
cardigan
Yes totally can be. Nothing English language specific about this. Scoring the
models is a bit of a pain without a large corpus of English, but the training
data was only 800 words.

~~~
igravious
You said that you wrote a scoring function based on your subjective response
to outputs. Supposedly if English is your native language (it seems to be)
then you would have influenced the algorithm to home in on English phonology.

Surely?

I looked at about 20 words and they all seemed like decent English candidate
words, including one actual English word, "molest".

~~~
cardigan
The first kind of scoring function was language independent (e.g, penalize
words showing up from the training set, penalize < 3 character words, penalize
duplicate words, penalize common words), and the second scoring function
needed a large training set of English words (but could be any other training
set).

All this is described at the top. I guess you didn't read that.

------
DanBC
I got cacurine, which is less pleasant.

Is it really using recurrent neural networks, or is it using markov chains?

~~~
cardigan
It's using either an RNN/GRU/LSTM: realized this after posting; one of the
parameters for the models being trained is the kind of model, and it picked
randomly between an RNN, GRU, and LSTM. Not sure what the five models picked
at the end ended up being.

------
jgalt212
I got carantil which is not great, but with a small tweak and it's Carancil
which is a perfectly good name for a new drug. Companies like Brand Institute
charge good money for these services.

------
pizza
These are all pretty cool. What determines how it could "improve" generated
words? Perhaps a larger, "pleasant"-words-only corpus?

~~~
cardigan
Larger corpus, deeper model, better scoring function on models, more models

------
namuol
Is there source available for
[http://burgundy.io:8080/](http://burgundy.io:8080/) ?

~~~
maxmcd
[https://github.com/shariq/burgundy](https://github.com/shariq/burgundy)

------
argonaut
The results are indistinguishable from Markov chains. There's really no need
to use RNNs for everything...

~~~
gliese1337
That depends on the order of the model you use, and the complexity of the
phonology. For an over-simplified example, consider that English allows both
'st' and 'ts' clusters; a 2nd-order Markov model might thus end up producing
something like _stststupid_ , which is clearly not a valid word.

A 3rd-order model will of course do much better, but will still fail to a
greater or lesser extent depending on language. And the higher the order of
the model you use, the more it will just spit out the same stuff it was
trained on and the less it will behave creatively, so there is a tradeoff
there.

I agree it might not make much difference here, but there are good reasons for
investigating more complex kinds of models for language generation, and RNNs
are an interesting choice.

------
throwaway24997
I got 'mingerrot' which isn't beautiful at all - it sounds like some kind of
unpleasant infection.

------
ogig
Train it with some Tolkien appendixes and it could be a good RPG name
generator.

Also, realworld usernames may be fun. You could make a twitter username
generator or something.

------
gregw134
Any ideas on how to generate startup names with a neural network?

~~~
cardigan
Use the same code but with a list of company names?

------
jastanton
Amamanus, any word with anus in it isn't exactly pretty :)

~~~
bluemenot
Similarly, I've got `turdurine`, doesn't sound pretty either.

~~~
devin
"molaster" was one of my unfortunate results. In any event, words are fun, and
this is an interesting project either way.

------
smcnally
vermocharen -- certainly works in some contexts. A coffee roaster, e.g.

no small feat to get even marginally-euphoneous words from an open, available
code base.

Next up came tintilu picolera fangon

------
seqizz
Thanks for my new hostname generator.

~~~
cardigan
Aww :-) you're very welcome!

------
abrkn
turdurine

