
Our machine learning and NLP journey - prabhatjha
https://www.wootric.com/blog/our-machine-learning-journey-from-zero-to-customer-value-in-12-months/
======
Radim
A big factor in producing a good analysis is the feedback modality -- chat
transcripts are different from emails, which are different from web forms or
operator notes.

We've had several "customer feedback / intent / support case analysis"
projects in the past. Some for large customers with millions of individual
records (Autodesk), where there's the additional challenge of "What should the
categories be in the first place? What's in the data?" (discovery).

What we learned is a model trained on one type of feedback will not
necessarily perform well on others, because the relevant signals manifest
differently across modalities: feedback length / writing style / typos,
lexical richness / repetition / boilerplate, OCR noise / how long is the long
tail… Your model may learn to pick up on cues that are orthogonal to the
sentiment or categorization problem.

This is especially true for black box models (deep learning) where
introspection is limited: Did the model learn to rely on syntax? Specific
words or character ngrams? Exclamation marks? Something else? Does an Indian-
looking name imply sentiment negativity?

Slapping a generic ML technique (Stanford NLP, Naive Bayes, bi-LSTM, whatever)
onto a bunch of tokens is a reasonable first step, that's the low-hanging
fruit. The tricky part is defining the problem space and the QA process
correctly, and managing the devil that comes with the details.

~~~
prabhatjha
I totally agree with this. We learned it pretty quickly that classification
does not generalize across domains so we narrowed the problem space by
focusing one domain at a time followed by predefined and fixed set of
categories so that we can measure effectiveness of our solution as we
experimented with different algorithms and deployment pipeline.

------
diegoserranoa
I always see all these article, services and products offering NLP for
English. I wonder how this works with other languages that have a different
structure e.g. Japanese, Arabic, etc. It would also be interesting to see how
these algorithms behave when considering cultural aspects: one word or
expression may have a different meaning in different places. How would the
system handle something like "Your service is the sh*t!". Is that positive?
negative? There's probably info on this subject all over the internet already
haha very interesting though...

~~~
Eridrus
The main thing I've seen to be different across languages is
tokenization/lemmatization.

Something as simple as wanting to split a sentence into words can be
difficult, e.g. you may want to be able to split German compound nouns into
their component words, and to do that you need a model (or list) of nouns so
that you can identify these. Or if you're doing Chinese which doesn't have
spaces.

A bunch of the deep learning work starts from characters for this reason - you
get to avoid that messy step, though in Chinese, characters may not quite be
the best representation either, maybe you want to break down characters into
their component radicals (I don't actually know this, I don't work on Chinese
NLP, and have not run this theory past any chinese speakers)

But if you're not just throwing everything in a Char-LSTM, you may want to do
things like lemmatization so that you can generalize across different forms of
a word, or maybe you want to use lemmatization info to inform your
tokenization, so that you don't lose the form info.

But, really, one big advantage of Neural Nets is that you don't need to do
this, that you can just get a big pile of labels via MTurk/users and train on
that without really needing to understand the language you're working on very
deeply.

~~~
titanix2
> maybe you want to break down characters into their component radicals (I
> don't actually know this, I don't work on Chinese NLP, and have not run this
> theory past any chinese speakers)

No you don't, if what you want to process is text. You're right however that a
big problem is the segmentation that must happen before any processing and
that cannot be done 100% correctly by software. Thus, errors compounds down
the chain.

~~~
Eridrus
Not sure why you're shitting on this idea, this does actually get done:

[https://nlp.stanford.edu/courses/cs224n/2012/reports/report....](https://nlp.stanford.edu/courses/cs224n/2012/reports/report.pdf)

[http://sentic.net/radical-embeddings-for-chinese-
sentiment-a...](http://sentic.net/radical-embeddings-for-chinese-sentiment-
analysis.pdf)

[https://github.com/nieldlr/hanzi](https://github.com/nieldlr/hanzi)

~~~
titanix2
I'm not "shitting" on the idea. I gave you an informed opinion as someone with
Japanese & Mandarin knowledge working in NLP research.

Did you read more than the title of the paper you linked? Because the Stanford
paper states:

" _Results and Discussion_ We consistently observed a decrease in performance
(i.e. increased for perplexity) with radicals as compared to baseline, in
contrast to a significant increase in performance with part-of-speech tags.
[...] Such a robust trend indicates that radicals are likely not actually very
useful features in language modeling"

For most tasks, you won't get more information on a word by looking at its
characters decompositions, in the same way that the individual letters of a
lemma won't help you for the task.

There existing use cases however. It is useful when building dictionaries for
human beings (for search for example, I just put online such a tool yesterday)
and when trying to automatically guess the reading of a character.

~~~
Eridrus
Arbitrarily saying "No you don't" isn't indicative an informed opinion.

I haven't really dug into these papers, though the Stanford paper does say
"This conclusion is consistent with results from part-of-speech tagging
experiments, where we found that radicals of previous word are not a helpful
feature, although the radical of the current word is.", whereas the quote you
pulled out has to do with language modeling.

Though I wouldn't consider a single negative result from before the deep
learning trend took of necessarily indicative of the value.

The more recent paper, on the other hand, sees a positive boost from their
"hierarchical radical embeddings" vs traditional word or character embeddings
for 4 classification tasks. Not that this is necessarily meaningful either.

In my mind, the usefulness of this would be, not that you would get new
information, per se, but that you could generalize some amount of knowledge to
rare/out of vocabulary words.

Since you work in the field though, do you have any pointers to good papers on
Chinese NLP?

~~~
titanix2
I don't have generic good pointers but a few interesting things I read or
downloaded:

\-
[https://aclanthology.info/pdf/I/I05/I05-7002.pdf](https://aclanthology.info/pdf/I/I05/I05-7002.pdf)
This paper make use of the radicals to build an ontology, but it does so with
a stunting amount of depth (historical context, variants, etc.) that most
works overlook. Too bad no data is available.

\-
[http://www.persee.fr/doc/clao_0153-3320_1978_num_4_1_1047](http://www.persee.fr/doc/clao_0153-3320_1978_num_4_1_1047)
Very interesting read on the formation of Chinese-like characters by the
Vietnamese. Some technics described were also used by the Japanese when
adopting sinograms.

\- didn't read the paper but the references section lists a number of paper
about the segmentation of Mandarin
[http://www.anthology.aclweb.org/F/F12/F12-3001.pdf](http://www.anthology.aclweb.org/F/F12/F12-3001.pdf)

\- didn't read it yet, but seems to contains accurate information of the
Chinese writing system:
[http://learnlab.org/uploads/mypslc/publications/perfetti-
lex...](http://learnlab.org/uploads/mypslc/publications/perfetti-
lexicalconstituencymodel.pdf)

Anyway, I think for getting a fair understanding of the writing system the
learning of about 600 characters in either Chinese or Japanese + basic of the
chosen language is required.

------
alexbeloi
Have you considered using this for analyzing feedback for politicians? They
have similar pain points in understanding what feedback from constituents is a
general problem vs isolated concern. Maybe through twitter data (as PoC) and
then actual emails from constituents.

~~~
prabhatjha
That's a great idea. Once we have training data for a new category it does not
take us a long time to create a ML model for it. Are you aware of any such
corpus? ;-)

~~~
alexbeloi
>Are you aware of any such corpus?

I am not aware of a corpus for political sentiment specifically. There's a
general twitter sentiment dataset[0], the link appears to be broken but it's
what everyone cites, not sure why it's down.

This paper[1] uses tweets and emoticons in the tweet as a soft label for
sentiment, there's obvious issues with that, but it's a cheap way to get lots
of noisy labels.

[0]: [http://www.sananalytics.com/lab/twitter-
sentiment/](http://www.sananalytics.com/lab/twitter-sentiment/)

[1]:
[https://www.aaai.org/ocs/index.php/SSS/SSS13/paper/download/...](https://www.aaai.org/ocs/index.php/SSS/SSS13/paper/download/5702/5909)

~~~
prabhatjha
Ah -- I thought you were talking about classification of these tweets so that
politicians know what their followers are talking about. Sentiment analysis is
a very small part of what we do and as you said there are tons of examples on
web that use Twitter's data in their model.

~~~
alexbeloi
>I thought you were talking about classification of these tweets so that
politicians know what their followers are talking about. Sentiment analysis is
a very small part of what we do and as you said there are tons of examples on
web that use Twitter's data in their model.

I was, I was guessing that you had an general topic/clause segmentation model
+ sentiment analysis, sounds like you're saying the topic/clause (or issue)
segmentation model is pretty domain specific, so what new datasets you'd need
to for political issues build is beyond me, but I think it'd be well worth it.
Connecting politicians and constituents is a pretty universal need.

~~~
prabhatjha
Got it. Yes, topic classification does not generalize across domains based on
what we see in on our implementation.

------
gmonfort77
Interesting article, how do you guys cope with badly written feedback or
feedback that just doesn't make sense? I guess that this type of feedback
could "pollute" your algorithms if you constantly use unverified feedback as
training data?

~~~
rsmith49
Unfortunately, this is a very common occurrence in NLP applications. Our first
step to combat this is through performing a spellcheck step when preprocessing
all of our data. Next, some of the algorithms we employ only look for the
presence of words in feedback, not necessarily grammatical correctness. So, if
we get something like "food good, love love love", we will still be able to
recognize the feedback is referring positively to food quality, and our
ensemble prediction will reflect this.

Despite this, we still run into some feedback that is complete gibberish, or
does not refer to anything. Fortunately, since this is a multi-label
classification problem, it is possible for us to classify the feedback as not
having any tag associated with it. Therefore, including some of these samples
in our training data helps fortify our engine against any live data that may
come in without meaning, and allows us to classify that feedback as having no
tag associated with it.

In our upcoming blog about our "human in the loop" machine learning system, we
also address how we can manually filter samples of data to make our training
more efficient.

------
sharkenstein
When you talk about discerning between algorithms from Google, Stanford,
etc... what's the criteria for doing that? Does it change based on the domain?
if you are just trying to classify feedback how much the domain affects the
algorithm?

~~~
rsmith49
Our criteria mainly depended on internal testing to see which pre-packaged
algorithms performed better on our data. While performance does vary with the
domain (or in our case, industry of feedback we are analyzing), we have found
it to be more efficient to find the overall effectiveness of one platform and
deploy it for all of our analysis.

As far as the domain affecting the algorithm, it can vary from some algorithms
maintaining decent performance over most industries, to some algorithms
working very well for some industries and terribly for others. Although it is
just feedback, the topic of the feedback and even the ways people talk about
the same topic (such as price of the product) will vary across each industry.

