
Nantucket: an accidental limerick detector - Omni5cience
http://www.daniellesucher.com/2012/04/nantucket-an-accidental-limerick-detector/
======
shalmanese
I created haiku_robot (<http://www.reddit.com/user/haiku_robot>) on reddit
and, from experience, found that it wasn't too worthwhile optimizing for
accuracy. The cases where I got the syllable count wrong seemed to have an
equal distribution of upvotes compared to the ones where I got it right and
regional variations in pronunciation meant that I was accused of being wrong
more often when I was right than when I was wrong.

~~~
Drbble
Haiku_robot would be much better if it broke lines and only at phrase or
clause boundaries, not just anywhere.

~~~
bdr
Disagree! Enjambment suits the form. I tend to like the robot's output.

------
mjn
This is pretty neat. I've been puttering on one on and off, but it's horribly
broken so I haven't released it, so this one gets extra points for actually
existing. :)

In case my half-done thoughts are useful to anyone looking to build something
in this space:

My aim is/was to allow configurable matching, so you can match, e.g. "XxxXxx /
XxxXxx1 / XxxXxx / XxxXxx1", meaning four consecutive lines of six syllables,
where X is a stressed, and x an unstressed syllable, and where the last
syllable of the 2nd and 4th lines have the same phoneme, denoted "1", whereas
there are no phonemic constraints on any other syllables (this allows a crude
approach to rhyme).

I'm not entirely happy with cmudict because, since it works one syllable at a
time, it can't really do much about stress, which can vary depending on the
surrounding words. I've been using the output of _espeak -x_ instead, which
gives a phonetic rendering of an entire sentence, including assigning both
phonemes and stress. I'm not sure if it's genuinely an improvement though. Its
poorly documented output surely isn't an improvement! And in particular it
gives a normal prosaic reading of a sentence, which might be too constraining
for poetry-finding, since poems often allow a bit of freedom on moving around
the stresses.

The idea to scan large amounts of text is to compile the configurable pattern
into a regex that matches espeak -x output, so for example X gets mapped to a
"match any stressed syllable" regex snippet. Alas, that's error-prone,
especially since the espeak -x phoneme format is a bit quirky (e.g. no fixed
length per syllable or syllable markers, so you need to have some per-language
rules to figure out what sequences of ASCII constitute what, which I haven't
debugged).

~~~
dsucher
That's actually why Nantucket ignores meter for now - cmudict's stress
patterns are really inaccurate and unsatisfying. espeak is an interesting
idea!

One thought I want to explore in a later version is using cmudict's stress
patterns for polysyllabic words, but ignoring any stress/meter rules for
monosyllabic words. I suspect that'll do pretty well, and it'll be interesting
to test it out.

~~~
mjn
That sounds like a good heuristic. Monosyllabic words don't seem _entirely_
free when it comes to assigning accent, but more free overall. Polysyllabic
words seem relatively stable, except for some oddities where you can take
poetic license, like putting the accent on the last syllable of "cursed"
("cursèd").

For an example of where it seems weird w/ monosyllabic words, compare, "I WENT
to the STORE to BUY some BREAD", which has a sort of poetic rhythm, with "I
went TO the STORE to BUY some BREAD" which seems weird, even in a poem. An
offhand analysis is that stressing the main verb and then running "to the"
together into one unstressed syllable is more natural than making the main
verb unstressed and stressing the preposition. Perhaps buried in the code of
some text-to-speech engine are heuristics that cover some of these cases? But
perhaps they can just be ignored at first, and patched up later in cases where
results are too strange.

Anyway, this is just miscellaneous thoughts about future enhancements; the
current Nantucket is cool to try out.

~~~
dsucher
Sure, sure. (And thanks. ^^) But it's fun brainstorming heuristics!

------
Jun8
Fantastic! This shows the possibilities of what can be created given the text
on Gutenberg archives. Assuming all the fiction ever created is available on
your laptop (quite feasible now, except of course, for the small matter
copyright) what new expressions can be derived?

On a different note, I read the about section of the blog and saw that the OP,
in addition to this great stuff, is a beekeeping, hacking attorney who also
spins fire. Amazing!

------
talos
for placing every moment of

the labourer's time and that of

his family at the

disposal of the

capitalist for the purpose of

greater quantity of labour

In addition to a measure

of its extension

ie duration

labour now acquires a measure

-Karl Marx

------
chronomex
It may be interesting to adapt the TeX hyphenation methods to this problem.

------
mfringel
Great stuff! Seeing the thought processes intertwined with the implementation
is fascinating.

------
msutherl
I am (the man) from Nantucket. Any other Nantucketers on HN?

