

Clojure Unsupervised Part-Of-Speech Tagger Explained - aria42
http://aria42.com/blog/?p=48

======
alextp
That's a very thorough description. It's always great to see laymen-readable
expositions of research.

I find it nice that a purely functional language can handle this sort of
problem (state management) more cleanly than what I usually do with my python-
based samplers.

------
gjm11
Very nice. It would be considerably enhanced by some examples of how the
algorithm performs on real data.

(I wonder how it does if you feed it not words from a natural language, but
tokens from a programming language. Can this sort of technique be adapted to
infer the whole grammar? I guess that would be more difficult -- you're trying
to learn a much more complicated sort of structure.)

~~~
alextp
If you feed it words and appropriate features from a programming language you
should get something close to a tokenizer. Laura dietz ( <http://www.mpi-
inf.mpg.de/~dietz/index.html#publications> ) has some work that applies
similar techniques to programming languages to find bugs.

------
jaekwon
very cool, haven't read it through yet but i'm curious, what's your motivation
for this research?

~~~
alextp
It is still an open problem to do unsupervised or mostly-unsupervised part-of-
speech tagging that performs as well as (or close to) the usual supervised
models. For English in the standard domains for which there are annotated
corpora this is not necessary, but if you want to apply tagging to a
completely different domain or a resource-impoverished language this sort of
technique is necessary.

This is relevant because it performs just as well (or better) than state-of-
the-art methods while being faster and simpler.

