

Syntactic: A lexical categorizer with a pretty visualization - omershapira
http://syntactic.omershapira.com/

======
bravura
It seems like your cluster quality will be sensitive to the words used to seed
each cluster.

Why not use a standard word clustering algorithm like Brown clustering?
<http://acl.ldc.upenn.edu/J/J92/J92-4003.pdf>

Percy Liang wrote a great implementation in C++ that you could plug into your
visualization: <http://cs.stanford.edu/~pliang/software/>

Also of interest is that Brown clustering is hierarchical, so you can get
coarse or fine-grained clustering.

[Aside: Here are some 2-d visualizations I made of word embeddings from a
neural language model: <http://metaoptimize.com/projects/wordreprs/> ]

~~~
omershapira
I'll definitely look into the Brown clustering. I used this one with the hope
of being able to eventually run spectral clustering on this (main problem: the
KLIC defines a pre-metric, not a proper metric, so the order of operations is
crucial).

How does the Brown clustering method guarantee less sensitivity to the seeding
words?

Your visualization is interesting, I'd like to able to navigate it in order
overcome most of the clutter. Originally I thought of mapping the clusters in
3d "clouds", but I think the dataset is too large for making a dimension-based
visualization more than recreational - I mean, I'd probably be happier to read
a cluster as a list.

------
JunkDNA
Would love to see what clusters from PubMed would look like. Anyone planning
to run this on it?

~~~
skram
Agreed. Would love to see this too.

IIRC they have a pretty easy to use API but as far as data dumps (according to
a quick search and <http://www.nlm.nih.gov/bsd/sample_records_avail.html>) it
appears they only provide XML whereas the current code requires XML.

~~~
dbaupp
_> it appears they only provide XML whereas the current code requires XML_

Is there a typo here, or am I just reading this wrong?

~~~
skram
typo -- second instance of "XML" should be ".txt"

------
danso
First of all, great work and thanks for sharing!

I guess I know less about NLP and clustering than I thought, but what exactly
does the visualization indicate?

On Iteration 1/3, when I click "husband" on the sidebar and "first" shows
up...what does that mean? That that's the closest cluster by distance?

The visualization looks nice but the accompanying text doesn't shed much
light...

~~~
omershapira
There's a 'help' button on the top-right corner in case you get lost - but I
guess I should curb my minimalism and make it larger.

The visualization is meant to visually explain the 'distance'. If the chosen
word (top scope) falls down nicely on the target cluster (if any lit square on
the top lights up on the bottom), then the word should be close. Note that it
doesn't work the other way around (more about that in the 'The KLIC' section).

In the bottom of the screen there are two text displays - the left one shows
the closest cluster with its contents and the right one is any cluster you
choose. Note that the scopes change according to the selection.

------
username3
need horizontal scrollbar

