
Automatic categorization of text is a core tool now - awinter-py
https://abe-winter.github.io/2019/01/01/nlp-18.html
======
stuartaxelowen
What changed was pretrained word embeddings like fasttext and GloVe. These
embeddings drastically reduce the complexity of text classification problems,
and reduce the amount of data you need to produce a high quality text
classifier a great deal, and in that 2 year span, understanding of how to
leverage them spread far enough to reach applications like this.

The world has been sleeping on leveraging NLP, given the incredible amount
corporations spend paying people to read and summarize text (answering
questions like "How many people are having a bad search experience?", etc).
The recent leaps in NLP make for many opportunities, and many products are
coming to market that allow even non-technical folks to train and utilize text
classifiers. If you're thinking about getting into NLP, now is definitely the
time to do it.

Source: founded taggit.io

------
chime
> I was fascinated and disturbed by the duplex conversation agent demo G
> posted on their blog this summer.

Duplex gave me shivers. Enough that I ended up wondering what would happen if
I logically extrapolated from that:
[http://chir.ag/201812180030](http://chir.ag/201812180030)

The thing that worries me is not strong AI or evil-AI but rather the selective
use of AI by humans who choose to create/remove barriers that were unthinkable
just a few years ago. AI doesn't need to actually be intelligent, just pass
off as close enough to an average stranger.

~~~
dsfyu404ed
>Duplex gave me shivers. Enough that I ended up wondering what would happen if
I logically extrapolated from that:
[http://chir.ag/201812180030](http://chir.ag/201812180030)

The scary part is that a large subset of people here will read that as a
instruction manual and not a warning.

------
paraschopra
Just recently I used a pre-trained language model to generate philosophy which
was similar to my favourite philosophers. I think what impressed me was that
given my small dataset of 5k quotes, the model ended up generating some really
good quotes. Plus I trained it in just a couple of hours, not knowing much
about NLP before. Here’s my tutorial
[https://towardsdatascience.com/generating-new-ideas-for-
mach...](https://towardsdatascience.com/generating-new-ideas-for-machine-
learning-projects-through-machine-learning-ce3fee50ec2)

~~~
srean
Curious if have you tried more traditional generative models on these data
sets. It seems a even a higher order Markov or a higher order hidden Markov
would perform comparably. Stochastic grammars that allow recursion would be
even better.

With a small corpus these would run into similar difficulties but would likely
perform much better than RNN/CNNs trained on the same corpus. Regardless of
whether you use NN or something else your approch is the right one: start with
model trained on a larger but similar corpus and then use the smaller one to
modify the transition probabilities.

Leaving out the "Deep" stuff will shed popular clicks on your post, probably a
lot, but would get the job done with less cumulative effort. Important if for
some reason you cannot use third party models. The stochastic grammars I
mentioned are a lot less fiddly to train than __* to vec

~~~
paraschopra
I agree that an n-grams approach coupled with beam search may also give
similar results. I did that project for fun and learning about language
modeling.

------
lettergram
As someone who works on synthetic data[1], NLP[2], and a boat load of other
deep learning for production systems (and have for a few years) — the tools
are available now.

Two years ago the deep learning frameworks were difficult to use. At one point
I even hacked together a CNN using background subtraction function in OpenCV
at one point to simplify my code. I remember coding up a neural network in
OpenCL and it took a couple months to verify it worked correctly (several
years ago).

Now, I want to write a blog series on text classification (should be out in a
couple days) and the actual coding takes 20 minutes, runs way faster and is 20
lines of code.

Neural networks are approachable, research is moving faster than most can keep
up, and the accuracy of models is

As mentioned in the article, there’s also pre-trained models, but IMO that’s
less important. It's the ease of access that's really the killer.

[1] [https://medium.com/capital-one-tech/why-you-dont-
necessarily...](https://medium.com/capital-one-tech/why-you-dont-necessarily-
need-data-for-data-science-48d7bf503074)

[2] [https://hnprofile.com](https://hnprofile.com)

~~~
wpietri
Would you mind saying a little more about the tools you like? I have a problem
that I'm hoping might be tractable, but it's a little odd, so I'm looking for
things to check out.

(The specific problem is having a lot of tuples of (lat, lon, placename) and I
want to build something that can canonicalize all of the placenames. It's
different enough from common ML problems that I'm not sure where to start.)

~~~
onefuncman
your problem sounds like [https://developers.google.com/places/place-
id](https://developers.google.com/places/place-id)

~~~
wpietri
Yes, except that I'm doing it for a class of places that isn't covered by any
open dataset I've found. What I have is a lot of data on ship movements, which
includes destination strings hand-entered by sailors (mostly merchant marine
sailors). The data is messy, and many of the strings can be obscure.

For example, consider this guide for Tokyo's ports:
[https://www.kaiho.mlit.go.jp/03kanku/h22houkaisei/sozai/guid...](https://www.kaiho.mlit.go.jp/03kanku/h22houkaisei/sozai/guide3_e.pdf)

This one happens to be well documented, but most don't seem to be, and in any
case many of the port labels in the movement data bear a somewhat hazy
relationship to official labels.

I could do it all manually, of course: look at the data, figure out where
ships stop, collect all the labels they apply, build a regex pachinko machine.
But I'm wondering if I can bootstrap my way to a good text location
classifier.

~~~
ethbro
There are two approaches I can think of, depending on your gut about your data
set.

If single or minimal-word phrases are enough (e.g. 'kws' 'kei' 'hei' 'keih'
etc), something like word2vec. Except instead of training to predict distance
between words, you're generating an embedding between words and ports.

If more complex modeling is required (e.g. 'tokyo tuesday kei then chiba'),
something like LSTM [1].

The lat-long data you'd want to squish into a closest-port training set via
geospacial math. Haven't done anything in the space, but I'd imagine chewing
through lat-long legs and computing closest-approach distances to various
ports (that is, to canonical port lat-long). Optimized for computation and
space. Probably compressed down into a "visit / not-visit" binary feature.
Might take awhile, but the math seems straightforward for this bit.

[1] [http://colah.github.io/posts/2015-08-Understanding-
LSTMs/](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

------
evrydayhustling
Data/ML guy here, also founder of frame.ai -- we're a NYC-based startup that
helps customer-facing team categorize and analyze their conversations.

There are a lot of factors contributing to widespread adoption of NLP --
certainly the availability of great tools like PyTorch/Keras, SpaCy, and
Gensim, and more broadly growing technical competence and awareness among
teams that traditionally handle text.

But the single biggest factor is that people are generating massively more
purpose-driven text in interactive channels. A generation that grew up texting
their friends is becoming core consumers and workers in every industry. People
drop text in team chat, in project management tickets and pull requests, when
contacting support, and more. Plus, we break our text into meaningful snippets
and add our own metadata to make them more readable (and usable) for other
people. Language and tools are co-evolving very rapidly in ways that make them
more accessible.

As a human-and-organization-augmentation nerd, I'm incredibly excited. We are
seeing so much more about how groups of people self-organize than we ever used
to, and at the same time that our tools are starting to become capable of
understanding us in the ways that we communicate most conveniently.

------
joekrill
I hope I'm not misunderstanding the comparison, but for me the takeaway seems
to be: make the users job easier. That has always been what we're trying to
do, hasn't it? Stop forcing them to enter _so_ _much_ _data_ by hand,
manually. It's tiring for them. They don't want to do it. They will do what
they can to avoid it. They will make mistakes or enter random data just to get
it done. Using "ML" or whatever to classify things automatically is generally
"good enough" and makes everyone much happier. Except the execs that seems to
always want "perfect data" to analyze. Hopefully as we get better and better
tools to do this (and it seems even simple categorization is usually
sufficient, anyway), we will stop forcing people to do this stuff.

And I hate to pick at specific points that are more cursory to the article,
but I can't help myself:

> Duplex feels like an extension of the gig economy; this is about college
> grads not wanting to waste their time negotiating with grunts.

I understand the sentiment here, but I don't think that's what's happening
here. If there were a better way to do this sort of thing without "hacking"
the human interaction I think we'd do it. I think Duplex has come about
because many companies still insist on wasting everyone's time by forcing
people to make time-consuming phone calls to perform very simple tasks. I
don't think is about helping college grads not have to interact with "grunts".

------
carbocation
"Does deep learning help?"

I think that Jeremy Howard's results with FastAI do strongly suggest that deep
learning helps. Specifically, it can be very quick to train a language model
with your own data... so quick that he doesn't even bother starting with pre-
trained weights.

------
Dowwie
I would like to see topic classification applied to message forums, such as
Hacker News, so that I could stop manually bookmarking and tagging. Algolia is
indexing and offering search capabilities and ought to take the lead on this.
If they don't, they may be disrupted by a provider who does.

~~~
redox_
(I'm an Algolia employee) We don't plan to add classification but would love
to partner with anyone willing to work on it, providing them with access to GH
repositories and/or API tokens. Let us know through our support channels.

------
blake8086
Are there any companies working on personalized text categorization?

------
jacquesm
That could be the core of a very nice search engine.

