
Classifying 200k articles in 7 hours using NLP - rezamoaiandin
https://salt.agency/blog/nlp-and-stuff/
======
smeeth
This article is a bit of a Frankenstein monster. There are too many possible
target audiences for ML blogs and it seems like this post made an attempt to
please everybody. This isn't a condemnation of the author, its just an
impossible task.

1\. Experienced ML practitioners will be unimpressed with the ML task
generally (simple problem, no comparison with common models, no use of common
dataset, no lit review) and wish that there was more detail in model design.

2\. Inexperienced ML practitioners will be happy with the birds-eye view of
NLP tasks but wish there were more implementation details.

3\. Potential clients (non-technical) will get lost in the details/lingo and
wish there were case studies or a vision of what this service can accomplish
for them/their business.

4\. Potential clients (technical) and SWEs will wish they got a better look at
the GUI, got an explanation of the stack, and wonder about APIs/integration
with whatever it is they already do.

Perhaps this might explain why literally every other comment at the time I'm
writing this is asking for additional details. Pick one or two!

~~~
aledalgrande
5\. CIOs will be ecstatic and go tell their team that they have to do NLP

------
jointpdf
For those interested in related/alternative approaches, one or more of the
following established open-source libraries might appeal to you:

\- Snorkel (training data curation, weak supervision, heuristic labeling
functions, uncertainty sampling, relation extraction):
[https://github.com/snorkel-team/snorkel](https://github.com/snorkel-
team/snorkel)

\- AllenNLP (many pretrained NLP research models for tasks beyond text
classification, model training and serving, visualization/interpretability
utilities):
[https://github.com/allenai/allennlp](https://github.com/allenai/allennlp)

\- Spacy (tokenization, NER/POS + tagging visualizer, pretrained word vectors,
integration with DL models):
[https://github.com/allenai/allennlp](https://github.com/allenai/allennlp)

-huggingface Transformers (latest and greatest pretrained models, e.g. BERT): [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)

\- ...or a barebones “from scratch” solution in less than an hour with a Colab
notebook and scikit-learn (preprocess text into tf-idf vectors, LSA/NMF to
generate “document embeddings”, visualize embeddings with t-SNE/UMAP
[facilitates weak supervision/active learning], classify with
LogReg/RF/SVM/whatever). You could also tack on pretrained gensim/TF/PyTorch
models quite easily as a next step. But this basic flow quickly gives you a
handle on your corpus.

By the way, the docs for DeepDive (the predecessor of Snorkel) are some
amazingly detailed background reading: [http://deepdive.stanford.edu/example-
spouse](http://deepdive.stanford.edu/example-spouse)

~~~
Der_Einzige
Kill the LSA/NMF middle-man and use UMAP directly. It supports sparse (tf-idf)
vectors.

~~~
jointpdf
True, good point. That may be better for classification performance. But at
least for visualization and interpretability purposes using NMF is extremely
simple and versatile (e.g. you can induce sparsity in the representation,
setting the rank to be artificially low can cause high-level structure to “pop
out”). That is, it gives you a few more knobs to turn than UMAP alone.

------
sixhobbits
The title makes it sound like they talk about how they did it so efficiently.

But all the info we get about that is

"The last step was to combine the four binary models into one multiclass
model, as explained in the previous section, and use it to classify 1M new
documents automatically. To do this, we simply went on the UI and uploaded a
new list of documents."

Great intro to NLP article, but very light on the actual implementation
details and dataset.

~~~
rezamoaiandin
Given that is of interest, I'll do a follow up on the implementation details
and dataset!

~~~
elbigbad
Same, very interested. Cool intro to NLP, but those are easy to come by. Some
implementation details and perhaps some code to reproduce would be amazing.

------
swayson
This reminded me of a great OSS tool I discovered the other data for data
labelling. It is called Label Studio
([https://labelstud.io/playground/](https://labelstud.io/playground/)) and
support quite a variety of different task formats. Works well

Disclaimer: No affiliation, only sharing for those who are curious

------
yunusabd
I agree with the other posters that the intro to NLP part is unnecessary. It
reads like those recipe websites where they tell you their whole life story
before the actual recipe. I get that it's good for SEO, but it's still
annoying to read.

Did you try other solutions like ULMFiT [1]? Seems like the exact use case for
that. Although it might be overkill for just 4 categories.

[1] [https://arxiv.org/abs/1801.06146](https://arxiv.org/abs/1801.06146)

~~~
rezamoaiandin
Yes! it's a great fit for this problem, we actually experimented with it:
[https://medium.com/sculpt/a-technique-for-building-nlp-
class...](https://medium.com/sculpt/a-technique-for-building-nlp-classifiers-
efficiently-with-transfer-learning-and-weak-supervision-a8e2f21ca9c8)

~~~
yunusabd
Interesting, if you end up writing a follow-up article, it would be nice to
include this kind of information, and why you (seem to have) decided against
using ULMFiT in this case.

------
pmayrgundter
Glad to see work like this being shared!

There are some well known text classification datasets, e.g. the Reuters news
dataset from David Lewis of Bell Labs:

    
    
      http://www.daviddlewis.com/resources/testcollections/reuters21578/
    

More background here:

    
    
      https://link.springer.com/content/pdf/bbm%3A978-3-642-04533-2%2F1.pdf
    

Here's a result from ReelTwo's Classification System circa 2003 (Based on a
bayesian learner; related to the U Waikato WEKA ML system) if you'd be up for
comparison:

    
    
      https://web.archive.org/web/20040606002449/http://www.reeltwo.com/datasets.html
    

10 categories 2,535 documents 15 build time (~170 docs/sec; these were short
news abstracts; see pdf for example) 0.9121 F-measure

Build Time is the time to load, model and evaluate (using Leave-One-Out
evaluation) a dataset on a WinXP/1GHz Celeron/256MB computer. F-Measure is the
micro-averaged F-Measure across all categories in the dataset.

------
ashish01
I guess it’s cool. It will be interesting to compare this with using fasttext
as a baseline for classification.

------
kgarten
Maybe off topic ... Is Stanford ML expert some type of accreditation? How do
you become a Stanford ML expert? :) Attending the (excellent) Stanford ML
online course on Machine Learning or do I have to read an ML book on Stanford
Campus?

~~~
whymauri
They have masters degrees from Stanford. I agree it's a bizarre accredidation
since they have a few years of industry experience otherwise, which is, IMO,
more relevant.

~~~
kgarten
Agreed. I also value the industry experience more ... I can remember a couple
of years ago Stanford faculty complaining at a conference social that they
cannot find good candidates for academic positions as Google and Facebook will
hire them.

------
blackbear_
I wonder what classifier was used (assume neural network-based, given the
figures), and how that compares to a simple baseline that uses bag-of-words,
such as a linear model or naive Bayes. The examples look easy enough to be
classified by matching keywords.

~~~
rezamoaiandin
We used a shallow neural network. The main challenge at Sculpt wasn't the
modelling part, but rather the whole UX (active learning to speed up the
training process, explainability, easily get predictions, etc). So it's true
that a relatively simple model performs well, and actually having an efficient
/ shallow network also helps make the active learning pipeline fast for the
user.

~~~
blackbear_
Thanks! The dashboard looks like a nice piece of work indeed.

------
josephjrobison
Is Sculpt AI available for public use? I see sculptintel.com is down.

~~~
rezamoaiandin
SculptAI is currently available to selected customers of SALT. But happy to
discuss if you would like access. Please get in touch with us

------
_Microft
"This article has been written by Sculpt AI [...] in collaboration with Reza
[Article’s author]", so we now use tools to write longer articles faster only
to later feed them into tl;dr-bots/summarizers to get the gist without having
to read all of it. ;)

~~~
wyldfire
This sentence was confusingly worded. It means that the corporation/entity
"Sculpt AI" wrote the article AFAICT -- specifically the humans who founded
the corporation. I do not believe the article was written by machine.

If it was hard for humans to understand that, just imagine how difficult it
would be for NLP to understand it. :)

