
Show HN: Markov Baby Name Generator - AlexeyMK
http://alexeymk.com/2012/07/15/weekend-hack--a-markov-baby-name-generator.html
======
codeka
I did something similar recently for a game I'm working on[1]. I didn't know
it was called a Markov chain, but one thing I found is that if you take the
_two_ previous letters to generate the next one, the results are a little less
random and seem a little more natural.

The more letters you take to generate the next one, the closer to the original
source data you get, but with a big enough corpus of source data, you can
still make random names using three or four letters.

[1] <http://www.war-worlds.com/blog/2012/07/generating-names>

~~~
pserwylo
It was probably still a Markov chain. But when Markov chains are used in this
way, they are usually referred to as n-gram models
(<https://en.wikipedia.org/wiki/N-gram>) where "n" is the number of previous
items you investigate (in your case, a 2-gram or bigram model).

Ideally, you would want to go back infinitely many times. My understanding of
why Markov models are great is because they reduce the computation required,
by not looking back too far, yet they still achieve good results. The more you
look back, the more possible combinations (in this case, of letters) there are
to consider. As the number of possible combinations of letters increases, the
chance of each combination appearing in your training corpus decreases.

N-gram models are often used at the whole word level. That is, instead of the
"next letter", they are interested in the "next word". This leads to
interesting ways to perform spell checking, based on the context of the
surrounding words. For example take the following sentences:

"I did that to"

"I did that too"

The _to_ is a spelling mistake, even though the word is an English word with
correct spelling. Imagine how many times the phrase "I did that to" occurs in
a large enough corpus, compared to "I did that too".

My understanding is that Google has the largest corpus and largest "N" (they
have a 5-gram model). The cool thing is that they have released it under a CC
license (<http://books.google.com/ngrams/datasets>).

~~~
waterhouse
I once made a word-based Markov chain and fed it a corpus of my instant
messages. I did that to amuse myself. I did that too crudely, though; I didn't
give it any notion of beginning or ending a sentence (although I kept capital
letters and periods as part of each token, simply because that was easy to
do). I don't think there were any other corpuses that I did that to.

------
boyter
I did something like this a few weeks ago using quotes from a loudly spoken
team member.

Previously my workmates and I had started jotting down his sayings, and before
we knew it had 2,000 or so entries in a database of the stuff he had said. I
ran it though a Markov chain to see what sort of nonsense it would produce. My
favorite thing that came out of it so far is the following,

"While I am changing my underwear people should check my email. Its an old
Greek saying mate."

------
johnwatson11218
I had a similar idea for generating test data for a relational databases. It
seems that for test data the boundary conditions and exceptional cases are
sort of easy but it is the common stuff that is harder to fake. My ides is to
create test data for numeric columns by estimating statistical parameters and
use those in conjunction with a random number generator to make 100-500 rows
of fake data. But for the first and last names (and other textual columns)
I've been thinking about modeling the data using Markov processes to be able
to come up with fake names and addresses that are somewhat close to the real
data. I think that once you have a good statistical model you could export
that and outsource testing more easily without compromising confidential
information. If things like average salary were considered confidential then
that could be skewed as a kind of obfuscation step.

~~~
pserwylo
That sounds cool. The term for what you want to prevent is "statistical
disclosure"
([http://scholar.google.com.au/scholar?q=statistical+disclosur...](http://scholar.google.com.au/scholar?q=statistical+disclosure)).

That is: How do I release interesting and useful data, while still preventing
the disclosure of important information such as salary, or SSN for those of
you in the US.

------
harper
I built one of these quite awhile ago to help a friend pick his babies name.

<https://github.com/harperreed/Baby-Chains>

The names it generates are hilarious.

Markov chains are wonderful things. A good markov chain bot will really spice
up a company IRC channel.

~~~
AlexeyMK
Heh, glad to know I'm not the only one! Seems like you have a far larger set
of initial names, very cool.

------
wingspan
Cool! I'll have to remember this when we can't pick a name for our next baby.
A little hit or miss, but for a geeky name it is better than little Bobby
tables...

------
mlawlerau
Ahh, Markov chains. I pulled all my favourite quotes into a text file and used
them to generate new quotes. Some nice ones:

What you do speaks so loudly that I drink this beer.

Premature optimisation is the immemorial refuge of the most troubled mind.

Collective judgement of new ideas is so often wrong that it picks up
confidence as it appears.

How many seconds are there in a cellar on a rainy day?

The music business is a higher revelation than philosophy.

Art is a sign of intelligence.

My choice early in life is to have dinner.

------
aw3c2
Slightly off-topic: Does anyone know of tree(?) graphs of markov chains for
certain bodies of words? Eg the probability of following characters for a
certain book, showing only the more probable choices. That might look pretty
cool.

------
sofal
Just so you know, I'm looking at this from my Android 4.1 stock browser and
the entire page is blinking on and off randomly like some kind of joke. I
can't scroll down because it seems as if it's constantly reloading itself.

~~~
AlexeyMK
Appreciate the heads-up. I think partially this is third-party stuff loading
and partially the lack of pagination. It's on my to-do to fix; thank you.

------
icehero
What does it have to do with "Jekyll" in the title?! It reads "Weekend Hack: A
Markov Baby Name Generator in Jekyll"

~~~
AlexeyMK
What I wish it was: honeypot for mindless RT-ers.

What it actually was: I have no idea. Fixed.

------
joshu
I built a bunch of stuff around this for domain names. I should open source
it.

~~~
AlexeyMK
Cool! I was thinking of doing that. Where'd you get the startup names
database, CrunchBase?

~~~
joshu
I had a bunch of sources. None with really good results, so it didn't really
matter... I should work on it more. I have a much more expansive idea
involving big relationship graphs and whatnot, but actual startup is taking up
too much time. Someday.

------
threepipeproblm
I am Weenan, son of Mindron.

------
pan69
Why is it called a "baby" name generator and not just a "name generator"?

~~~
corin_
Because the vast majority of people who need to chose a name are doing so for
a baby, not someone older. Obviously in this case it's not really designed to
be a useful tool, but it's still based on the idea of helping parents pick a
name for their new child.

------
rorrr
Here are a few "gems" from the twitter feed:

    
    
        C would make an awesome boy's name
        Ieahaholijayson would make an awesome boy's name.
        Thinking about a boy's name? how about Chosowex?
    

Pretty damn horrible.

~~~
AlexeyMK
Heh, agreed - it's definitely a mixed bag, that's what makes it fun. I've cut
it off to a minimum of three character names since; that said, IMHO "C" is a
pretty sweet name.

Some of the names aren't so bad, though: Marin, Harker, Gacon... Once in a
while it'll generate a real name by accident. Those are my favorite.

~~~
joshu
Harker Chosowex Gacon, at your service!

