Hacker News new | past | comments | ask | show | jobs | submit login
Flair: A simple framework for natural language processing (github.com)
361 points by kumaranvpl 3 months ago | hide | past | web | favorite | 44 comments



The closest alternatives in this space would be allennlp [1], the recently released pytext [2] and spacy [3]. pytext's authors wrote some comparison on the accompanying paper [4] and this GitHub issue [5].

[1] https://github.com/allenai/allennlp

[2] https://github.com/facebookresearch/pytext

[3] https://spacy.io

[4] https://arxiv.org/pdf/1812.08729.pdf

[5] https://github.com/facebookresearch/pytext/issues/110


Do you know if any of these can be used for text prediction? (I.e. guessing what the next word/token will be.)


Text prediction is usually called "language modeling" in NLP. Because it's useful as a weak supervision signal to improve performance on other tasks, most of the mentioned libraries support it. However, they might not always provide complete examples, instead assuming that you know how to express the model and train it using the primitives provided by the library.

Flair: https://github.com/zalandoresearch/flair/blob/master/flair/m...

Allen NLP: https://github.com/allenai/allennlp/blob/master/allennlp/dat...

PyText: https://github.com/facebookresearch/pytext/blob/master/pytex...

spaCy seems to focus on language analysis and I couldn't find an API that'd be directly usable for text generation.


Flair looks really promising to me!


Markov chains can be used to do type ahead prediction. It's likely what the iOS uses for their predictive keyboard.

https://en.wikipedia.org/wiki/Markov_chain


Yes, there are plenty of methods, and I have a couple implemented, but an off-the-shelf one from a cutting edge library would likely be better.


It's gonna be hard to get an "off the shelf" model for text prediction, because the upcoming text depends on the author, topic, and other context. You can probably find some decent pre-trained models to get started, but you'll need to customize them for your application to get good results.


Right, I was thinking off-the-shelf in the sense of giving it a tokenised corpus and it does the rest, or it incorporates that into its existing model. Dictation software, phone keyboards, etc. do this.


Which method would work best for email classification into 1 of 7 categories? Problem I've seen is 1 or 2 key sentences within the email can classify the message but they are usually outnumbered by generic sentences such as signatures, greetings, headers/footers etc


These are all frameworks and while none of them have any signular advantage over other especially in the problem statement you are looking for, you should ideally be able to figure out what works best for you based on the classification sensitivity and training data you are working with. The problem in itself can be quite simple to extremely complex based on the above 2 factors. Spacy's pre-processing tools are quite easy to use and that combined with tool like talon should help you clean up the email correctly. Thereafter, if your email text is pretty much to the point, then any intent classification tool will work, however, if the email text is long and intents are spread across, then you will need a hierarchical layer to understand the intent hierarchies as well as an attention layer to understand which intents to focus and not lose track of in an email. At that time, you are quite far from using a generic plug and play framework and will need to exactly and quite thoroughly understand the deep learning models you are working with as well as the dataset you have and the classification you are trying to build.


Thanks this is really helpful! I am using talon and sklearn as paragraph by paragraph intent classifier. I am classifying the whole email from the highest individual intent probability. This seems to be working well for my minimal test data (~200 sentences) but have yet to test in the wild. I will research hierarchical layer and attention layers.


Generally these generic sentences should be randomly distributed, and so their effect should be minimal.

You can randomly add them to your training set, if you feel that real world data has them randomly distributed, but you training sample is too small to capture this.


I like Flair for three reasons:

- Easy to use

- Developers are very active

- State-of-the-art results using approaches that are easy to understand and works well for most of text classification tasks


I'd be wary of being dazzled by the performance metrics here. I've been disappointed in the past using out of the box language models on data that they weren't trained on, e.g. SpaCy. I feel like people put too much emphasis on trying to get the high score on benchmark datasets, and they're overtraining to those particular domains.

For example, try using named entity models trained on CoNLL (newspaper articles) on free text (e.g. tweets, or text from application forms), you generally get pretty bad results. When the domain is different, I've even seen them screw up basic things like times and dates, where regexes will suffice. If you're using it for newspaper articles you're sorted, if you're not, the performance metrics here are probably not all that meaningful.


Here's the thing: the authors, and most people in the NLP community (as opposed to people who can use off-the-shelf tools and nothing more) know how to get better performance on other domains. It's just that "it requires some adjustments and manual labour" doesn't add anything to the "deep learning solves the problem" narrative that is predominant in research articles and most blog posts on the topic.

On the other hand, you can bet that actual practice in Zalando (the authors are all from Zalando's reasearch lab) involves more regexes and retraining models on proprietary datasets and less using off-the-shelf models and hope they stick.

No one claims either that you can solve every Vision problem with a model trained on ImageNet - you'd do transfer learning, or for non-understanding problems (estimating colors and contrast or anything else that's unrelated to objects in the image) you'd use something else that doesn't involve deep learning models at all.


Right, but my point is - once you start needing to add ad-hoc retraining, or regex hacks, it's not clear to me that shaving a point off baseline f1 scores is really all that relevant anymore.


You'd have to do the same modifications to the baseline models to adapt them to a different domain. If they managed to shave off percentage points on a large number of benchmarks, then it's likely that using their models will also help you with the task you care about.


Not convinced. Pretty much all baseline NER datasets are on news corpuses, which are written in well formatted prose, tend not to have spelling mistakes, abbreviations, bad punctuation, etc. etc. Why do you think that a better performance on these kinds of datasets will translate to better performance in other domains? I wouldn't even be surprised if it's the opposite, maybe it relies more heavily on these assumptions. The truth is there is no way to know a priori - you need a different kind of benchmark to test this.


WNUT-17 [1] is not a news corpus. It has a lot of badly-formatted prose, spelling mistakes, abbreviations, bad punctuation etc. etc. Accordingly, it's the dataset where they get the worst F1 of 50.20, but still better than the previous best of 45.55. In general, I'd be surprised if they hard-coded reliance on the specific regularities of news texts into the model assumption, so if a model is able to exploit those regularities better, training it on a corpus with different regularities should also enable it to perform well on that corpus.

[1] https://noisy-text.github.io/2017/emerging-rare-entities.htm...


I dont think there is anything contentious here. If the data that a model is trained on is nothing like what the model will be applied on the results will suck. Well, duh! isn't that obvious ?

If one is trained on screwing light bulbs that training would not be very helpful in composing music. If there is some common structure between the train and the test scenarios there would be some point in learning that from the default train set. Then you use your domain specific training set to unlearn the things that do not apply and learn the other things that do. As long as there is something worth learning from the default train set it will be of some use.


I have found pre-trained models for language detection and Wikipedia word embeddings useful. Everything else I have to train from scratch. I have successfully done NER on search keywords where there were many examples of misspelled words or weird punctuation. I used Spacy but I had to train the model from scratch.


I had the same experience when trying to do NER on customer support requests. My model performed great for research datasets but it was mediocre at best for my own dataset. Do you have any suggestions on how to achieve better results in domains where mistakes, bad punctuation, etc are common?


Label more training data.

Do more clustering.

Label more training data.

Strip out more garbage.

Label more training data.

PS you can get an idea of how much value additional training data will give you by training models on various subsets of your dataset (e.g. 10%, 20%...), evaluating them against the same test dataset, and plotting the results.


Is research in commercial companies becoming a driving factor in today's scientific achievements?


It always has been. The transistor was invented at Bell Labs in 1947, kicking off the entire modern world of computing.


Probably. However you have to consider fields that are outside of the scope of commercial companies. Nuclear energy (both fission and fusion) is a good example. If you are referring to NPL than it is definitely true.


That's the free market at work I would guess. Zalando seems to see NLP research as a valuable asset so they pay for projects like this plus they open source it assumably because of employer branding. I don't see anything wrong with that.


Zalando Research is an internal team of researches, their work is primarily shared with the research community through publications: https://research.zalando.com/welcome/mission/publications/

In the case of Flair, research led to a reference implementation, which was then matured through internal use and open sourced to further mature it and get external feedback. While employer branding is a nice benefit, it is a positive side-effect, not the motivation in itself :)


Can someone explain what the expected behaviour is with punctuation and Named Entity Recognition? Is there an assumption that punctuation is preprocessed in some form?

I'm a noob but it's not what I expect - periods change what is extracted in inconsistent ways.

e.g.

"I love Berlin." -> "Berlin."

"I love Berlin ." -> "Berlin"

"George Washington loves Berlin." -> "George Washington"

"George Washington loves Berlin ." -> ["George Washington", "Berlin"]


If you go to their first tutorial, "Tutorial 1: Basics", you will see this comment in the code: "# Make a sentence object by passing a whitespace tokenized string"

In that simple example you posted, they already did the tokenization manually, as it's pretty trivial. But yes, in many cases, you have preprocessors that do the tokenization. In some libraries, you actually have a class/object_type for tokens, but it's pretty common to just preprocess and take every space as a token separator.

In some contexts and cases, it's possible to see tokens like "social_network", where multiple words are considered a single token.

In that first tutorial, they also mention they have a tokenizer if you need it: "In some use cases, you might not have your text already tokenized. For this case, we added a simple tokenizer using the lightweight segtok library."

So for your example you would simply run the tokenizer first, then the named entity recognition.

EDIT: apparently you can do this directly: "sentence = Sentence('The grass is green.', use_tokenizer=True)"


Thanks. I assumed it was French style punctuation with a space before the period rather than space tokenization!


French doesn't put spaces before single-part punctuation like periods or commas.


It mentions in the tutorial that you can input space tokenized sentences directly or have it tokenize it for you, see https://github.com/zalandoresearch/flair/blob/master/resourc....


Ah yes, thanks!


I like the way Zalando is doing tech. They release (and maintain!) tons of open source stuff in several domains.


Flair 0.4 was released just 14 days ago, and contiains LOTS of improvements for a point release: https://github.com/zalandoresearch/flair/releases


What are the main advantages of Flair over Spacy?

Is it easy to add a new language in Flair? In Spacy adding a language looks pretty straightforward.


Nice! Was using stanford POS tagger which is both bad in quality and sooo slow in execution. Looking forward to trying this out.


How does it compare to managed NLP services like Google's Cloud Natural Language API and AWS's Comprehend?


Google and Amazon have proprietary datasets for important sub-tasks (e.g. recognizing "consumer good" entities, or more accurate sentiment recognition, or supporting other languages better) that are not available to the public.

In other words, if your problem looks like one of the benchmarking tasks in NLP research (e.g. recognizing persons and locations in fluent text) you can expect good performance out of open source tools. If you go beyond that, you have to concoct your own dataset and/or use proprietary cloud services.


would you mind linking to ones you are aware of. it would be super helpful.


I recently tried out flair and deployed it myself on aws. Works nice. However, if you need heavy duty batch inference i guess you need to use a gpu instance or so ...


Very great and elegant API, look promising. Has anyone tried it? How does it stand against NLTK, Spacy or gensim ?


It solves different problems. flair doesn't include parsing, Spacy doesn't support embeddings, gensim doesn't do tagging or parsing at all but contains the most practical word2vec implementation. NLTK is nice for learning, but don't use it in production unless you're ready to reimplement things in a more efficient way when parts start falling off.

The message is - again - learn about the mental framework, not individual tools, to understand where each strengths are and what the gaps are in between them. Or choose a problem, find the best tool for that problem and get progressively better at the tool(s) that helps you with most of your problems.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: