

Words that make valid sentences - makeshifthoop

Given 5 randomly chosen words.  What is the likelihood that it makes a valid sentence? The list of valid available words does not include proper nouns.  Plural&#x2F;singulars are treated as different words.  Interesting thought experiment and wanted to see what other people come up with.
======
lutusp
1\. Create a list of common words, the more the better. Let's say we list a
common spelling dictionary of 80,000 words.

2\. Create an order-5 permutation of the list (a result in which the order of
the words matters). For an 80,000 word dictionary, that's 3.27 x 10^24 test
sentences, each of five words.

3\. Scan the result sentence set, using a heuristic able to distinguish valid
sentences from invalid ones. Let's say that one optimized validation test
requires one millisecond -- in that case, the test would require 1.04 x 10^14
years, 7536 times the age of the universe.

Meaning this is a much easier question to ask than answer.

EDIT: Another approach is to make an arbitrary assumption about the structure
of a five-word sentence, like pronoun-noun-noun-verb-noun: "The calico cat ate
breakfast.". Crude and limited, and many apparently valid results will be
meaningless, but it makes the estimate easier.

We realize that random words have a probability of being pronouns, nouns or
verbs, pp, pn and pv. The probability of producing a valid sentence (pvs)
using the described template is therefore:

pvs = pp * pn * pn * pv * pn

Just generate the probability values by scanning a dictionary, identifying the
word types (easier said than done), perform the above equation, and you have
your answer.

~~~
jtmoulia
Complementing the thought on sentence structures: 5 word sentences could be
extracted from existing texts. Tagging them would give you the set of possible
sentence structures.

Hopefully there wouldn't be too much trouble with ambiguous word types, e.g.
green can be either an adj or noun.

------
thejteam
It depends completely on whether you are looking for "valid" in a structural
sense or a semantic sense. If a structural sense, lutusp's second approach
looks very do-able. There are a lot of cases to consider and it would most
likely mean pulling out an old grammar book that actually teaches these things
but it could be approached systematically. It also would mean making some
assumptions about verb forms, ie I would only want to have to choose the verb
at random and not the form.

If "valid" in a semantic sense, I wouldn't have a real clue where to start.
The structural requirements give you an upper bound and a quick initial test.
Perhaps you could start with a small set of words and observed how they scale
upwards?

------
brudgers
Randomly chosen from what? Or, to put it another way, does every word have the
same probability of selection or is each word's probability weighted based
upon frequency.

If "crwth" is as likely as "is" then the likelihood of a meaningful sentence
is lower than if actual word frequencies come into consideration.

[http://en.wikipedia.org/wiki/Crwth](http://en.wikipedia.org/wiki/Crwth)

------
chrismorgan
A simple way to try it, in Python:

    
    
        >>> import random
        >>> words = open('/usr/share/dict/words').read().split()
        >>> words = filter(str.islower, words)
        >>> ' '.join(random.sample(words, 5))

~~~
makeshifthoop
How would you know if the word combination is a valid sentence? NLTK?

~~~
chrismorgan
I just parsed them manually; I didn't come up with any syntactically valid
combinations in the first 100 attempts.

