Very very easy to setup and train; highly recommend playing around with your own training data (just a text file!)
This project's code:
github.com/shariq/burgundy
Styled and deployed the website about a year ago at a hackathon; it then used a nice wordlist with hand picked words.
(repo/wordserver/old_burgundy_words.txt)
Few days ago: got the server to start training a bunch of models (~200), with randomized parameters, using the original wordlist as the training data.
(repo/rnn/rnn.py:forever)
Yesterday: woke up at 3 AM after my sleep schedule rolled around, started exploring the output of models trained to different numbers of epochs and run at different temperatures. Subjectively looked at the outputs, decided some model/epoch/temperature tuples were horrible, got rid of those. Wrote a few different scoring functions (just using intuition for what kinds of bad outputs seemed to be commonly occurring) to score the model/epoch/temperature tuples. Got the top ~10 scoring tuples from each scoring function, plus added some additional interesting ones along the way, and then used a pronunciation scoring function (repo/rnn/pronounce.py) to select the top 5 of all of these. Funny enough, the top 5 tuples all used different models and a varying range of temperatures (i.e, not the same model from different epochs, and picking the right temperature significantly improved how well the model performed)
(repo/rnn/explore.py)
Since the models would still occasionally output words which were completely unpronounceable, I put some code on top of the models which would generate a bunch of words then discard the bottom 1/3rd of unpronouncable words. A significant portion of generated words from these models also started with a "c" or "b" for some reason: gave those a high chance of being discarded. Short words were also uninteresting, and extremely long words would occasionally show up: added probabilistic filters for length. Finally, initialization time of LuaJIT is very high, so I had the server keep a pool of words which gets reseeded as it runs out.
(repo/rnn/rnnserver.py)
If you want to train your own word generator and you need some pointers, would love to help: @shariq
That sounds like a good approximation: although it's more judging if the word is spelled like it sounds versus if it has just one obvious spelling.
Similar ideas: check social media account availability, check domain availability, see how many Google search results show up, find similar sounding existing brand names
No "about" info? No "how it works", "how it was trained", etc.?
It seems to only generate words that match English phonotactics & spelling conventions- things that could be English words. Can it be retargeted to other languages, or to arbitrary word-shape constraints?
I am particularly interested because I've recently undertaken a survey of word-generation software for conlangers (people who create artificial languages, like Quenya or Klingon or Na'vi), and while they do come in widely varying degrees of sophistication, with varying degrees of built-in linguistic knowledge, there are none yet publically available that are based on neural networks.
Yes totally can be. Nothing English language specific about this. Scoring the models is a bit of a pain without a large corpus of English, but the training data was only 800 words.
You said that you wrote a scoring function based on your subjective response to outputs. Supposedly if English is your native language (it seems to be) then you would have influenced the algorithm to home in on English phonology.
Surely?
I looked at about 20 words and they all seemed like decent English candidate words, including one actual English word, "molest".
The first kind of scoring function was language independent (e.g, penalize words showing up from the training set, penalize < 3 character words, penalize duplicate words, penalize common words), and the second scoring function needed a large training set of English words (but could be any other training set).
All this is described at the top. I guess you didn't read that.
It's using either an RNN/GRU/LSTM: realized this after posting; one of the parameters for the models being trained is the kind of model, and it picked randomly between an RNN, GRU, and LSTM. Not sure what the five models picked at the end ended up being.
I got carantil which is not great, but with a small tweak and it's Carancil which is a perfectly good name for a new drug. Companies like Brand Institute charge good money for these services.
That depends on the order of the model you use, and the complexity of the phonology. For an over-simplified example, consider that English allows both 'st' and 'ts' clusters; a 2nd-order Markov model might thus end up producing something like stststupid, which is clearly not a valid word.
A 3rd-order model will of course do much better, but will still fail to a greater or lesser extent depending on language. And the higher the order of the model you use, the more it will just spit out the same stuff it was trained on and the less it will behave creatively, so there is a tradeoff there.
I agree it might not make much difference here, but there are good reasons for investigating more complex kinds of models for language generation, and RNNs are an interesting choice.
False; try making something similar with a Markov model trained on just 800 words: the RNNs have somehow learned what it means for a word to be pronounceable, and rarely output words which are difficult to pronounce.
The network learns something about the relationship between characters, and by extension phonemes, in the training set of nice sounding words. Obviously it could not learn the image the word invokes, given the size of the training set and since there are no negative examples.
"turdurine" sounds nice to me; "amamanus" does not
First off - used this code for training the models: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Very very easy to setup and train; highly recommend playing around with your own training data (just a text file!)
This project's code: github.com/shariq/burgundy
Styled and deployed the website about a year ago at a hackathon; it then used a nice wordlist with hand picked words. (repo/wordserver/old_burgundy_words.txt)
Few days ago: got the server to start training a bunch of models (~200), with randomized parameters, using the original wordlist as the training data. (repo/rnn/rnn.py:forever)
Yesterday: woke up at 3 AM after my sleep schedule rolled around, started exploring the output of models trained to different numbers of epochs and run at different temperatures. Subjectively looked at the outputs, decided some model/epoch/temperature tuples were horrible, got rid of those. Wrote a few different scoring functions (just using intuition for what kinds of bad outputs seemed to be commonly occurring) to score the model/epoch/temperature tuples. Got the top ~10 scoring tuples from each scoring function, plus added some additional interesting ones along the way, and then used a pronunciation scoring function (repo/rnn/pronounce.py) to select the top 5 of all of these. Funny enough, the top 5 tuples all used different models and a varying range of temperatures (i.e, not the same model from different epochs, and picking the right temperature significantly improved how well the model performed) (repo/rnn/explore.py)
Since the models would still occasionally output words which were completely unpronounceable, I put some code on top of the models which would generate a bunch of words then discard the bottom 1/3rd of unpronouncable words. A significant portion of generated words from these models also started with a "c" or "b" for some reason: gave those a high chance of being discarded. Short words were also uninteresting, and extremely long words would occasionally show up: added probabilistic filters for length. Finally, initialization time of LuaJIT is very high, so I had the server keep a pool of words which gets reseeded as it runs out. (repo/rnn/rnnserver.py)
If you want to train your own word generator and you need some pointers, would love to help: @shariq