
Wink-lemmatizer - petethomas
http://winkjs.org/wink-lemmatizer/
======
vecplane
Here's a link to the github (AGPL 3.0) - [https://github.com/winkjs/wink-
lemmatizer](https://github.com/winkjs/wink-lemmatizer)

Very cool! It looks like this is only for English at the moment, but it would
be cool to see this for other languages as well.

------
raldi
Link was originally [https://nlp.stanford.edu/IR-
book/html/htmledition/stemming-a...](https://nlp.stanford.edu/IR-
book/html/htmledition/stemming-and-lemmatization-1.html) ... I don't mind that
the mods changed it; I just wanted there to be a record, because otherwise
some of the comments here don't make sense.

------
sanjayaksaxena1
Hey! I am Sanjaya, the lead developer of wink-lemmatizer, and the rest of the
wink family. We are trying to break down NLP and ML into atomic building
blocks. I am happy to answer any questions and gather feedback about wink
located at [https://github.com/winkjs](https://github.com/winkjs) :)

------
amelius
Nice. Any plans for the following features?

\- instead of calling the API using "noun", "verb", etc, let the library
figure out the type of word.

\- let the library return information about the word, e.g. "coolest" ->
"cool"+superlative.

~~~
moarcoffee
“cool” is both a noun and a verb. How would it handle that?

~~~
amelius
Return a list of results?

------
raldi
From the Exercises section:

 _> The stemming for ponies and pony might seem strange._

What's that all about?

~~~
djur
The stem for 'ponies' in the earlier figure was 'poni'. That means that 'pony'
has to be stemmed as 'poni' in order to correctly retrieve 'pony' when
searching for 'ponies' and vice versa.

~~~
raldi
The next question was,

 _Does it have a deleterious effect on retrieval? Why or why not?_

Any thoughts?

------
ccleve
Ok, so how does it work? Which stemming algorithm?

It's not helpful unless we know what it's doing under the covers.

~~~
repsilat
I disagree. For use as a tool, for some use-cases, if it "works as
advertised," all we really need to know is whether it's efficient enough for
our purposes.

One (horrible) use for something like this is in a framework like Rails, where
there's a cultural acceptance of making method and field names by pluralising
and conjugating etc.

In Rails if your `House` model has a one-to-many relationship with your
`Mouse` model, a `house` object probably automatically gets a `.mice` field.
That sort of thing can be done with extensive rule sets, or it can be done
with a black-box library.

Of course it's a horrible use-case, and you'll probably always need to deal
with ambiguity and context, but for that sort of thing the implementation
details aren't nearly as relevant as "the dimensions of the black box" \-- how
quickly does it run, how quickly does it start up, how much memory does it
use, how good at its job is it, and which languages does/can it be made to
support?

------
saagarjha
It would be really cool if they had a live demo to try it out on their site.

~~~
tomjakubowski
Here you go. This works for many npm packages. [https://npm.runkit.com/wink-
lemmatizer](https://npm.runkit.com/wink-lemmatizer)

------
slx26
any plans for multi-language support?

~~~
EmilStenstrom
This is really needed in the NLP world. Most things are english only.

~~~
afsina
Good luck with Turkish. A very large lookup table may work for %90 cases but
it will still fail. A basic `true` lemmatizer requires either a complex graph
with rules, or an FST generated from it. input is searched through the graph
and morphological disambiguator must be applied to result to pick the correct
lemma.

~~~
DFHippie
Or Welsh. You want a canonical form for "wnaethpwyd"? Try "gwneud"! Some
regexes and an exceptions list isn't going to cut it.

The more a language needs a lemmatizer for NLP the harder it is to write it.

~~~
tomjakubowski
Could someone explain what's hard in particular about canonical forms in
Welsh?

