from collections import Counter
from math import log
corpus = """training corpus here"""
n = 1 # token size
return ((w[i:i+n], w[i+n]) for i in range(len(w)-n))
tokens = Counter(t for t,l in trans(corpus))
transitions = Counter(trans(corpus))
return sum(log(transitions[t,l]+1)-log(tokens[t]+26**n) for t,l in trans(w))
for w in [" llmyw "," domyh "," tretz "," qenis "," debts "]:
print w, ': ', score(w)
llmyw : -25.9316238289
domyh : -23.311005645
tretz : -21.4220068707
qenis : -20.7421233042
debts : -15.3287127006
The variable `n` sets the number of letters of memory the markov chain has. If you set `n` higher then you need a bigger corpus. If you set n=2 and get a larger corpus and preprocess it to filter out anything that isn't [a-z]+ it would probably work fairly well (I just copied the article in as is).
Lli is a variant of llif - life.
Ll is a letter - [phlegm production noise].
Myw is a mutation of byw - tide.
So in Welsh it's most likely as hard to pronounce as "ktide" in English. Maybe he's using a multilingual dictionary.
I discovered that pool.com maintains a list of domains that are set to expire. I download and filter the list and then email myself a list of domains that match my requirements (.coms under a certain length, no numbers or other funny characters, maybe .coms with a specific word in them). I actually just wrote this script, it had been on my to-do list for over a year. The daily email contains hundreds of domains so I might have to filter it more.
Here's my script, it only uses PHP to get tomorrow's date, otherwise it's standard linux utilities like wget, egrep, unzip, cut, sed...
I have it set up as a daily cron job.
I'm interested in suggestions on how to snipe/reserve/etc domains as soon as they become available.
It includes member auctions etc.
I'd guess other services have similar?
I have a somewhat off-topic question. How do you get access to these domain names?
In terms of obtaining information, most whois queries can be performed via command line utilities... so to start you off here is a good list for whois servers (http://code.google.com/p/whois-servers-list/). Finally, check out each service, some will allow queries which will return true or false to being registered and generally you get a lot more of these requests then complete lookups (without being IP blacklisted)
Finally, in terms of building and managing an index, I believe manual crawling is the only option available... and start with dictionary terms and work out.
Edit: Read this as well - http://www.dotweekly.com/pending-delete-domain-name-drop-lis...
domain_file_url = "http://www.odditysoftware.com/download/dldoms.php?domdate=
return urllib2.urlopen(domain_file_url + today()).read()
http://instantdomainsearch.com has this same problem when deciding if a domain is actually available or not.