Allen NLP: https://github.com/allenai/allennlp/blob/master/allennlp/dat...
spaCy seems to focus on language analysis and I couldn't find an API that'd be directly usable for text generation.
You can randomly add them to your training set, if you feel that real world data has them randomly distributed, but you training sample is too small to capture this.
- Easy to use
- Developers are very active
- State-of-the-art results using approaches that are easy to understand and works well for most of text classification tasks
For example, try using named entity models trained on CoNLL (newspaper articles) on free text (e.g. tweets, or text from application forms), you generally get pretty bad results. When the domain is different, I've even seen them screw up basic things like times and dates, where regexes will suffice. If you're using it for newspaper articles you're sorted, if you're not, the performance metrics here are probably not all that meaningful.
On the other hand, you can bet that actual practice in Zalando (the authors are all from Zalando's reasearch lab) involves more regexes and retraining models on proprietary datasets and less using off-the-shelf models and hope they stick.
No one claims either that you can solve every Vision problem with a model trained on ImageNet - you'd do transfer learning, or for non-understanding problems (estimating colors and contrast or anything else that's unrelated to objects in the image) you'd use something else that doesn't involve deep learning models at all.
If one is trained on screwing light bulbs that training would not be very helpful in composing music. If there is some common structure between the train and the test scenarios there would be some point in learning that from the default train set. Then you use your domain specific training set to unlearn the things that do not apply and learn the other things that do. As long as there is something worth learning from the default train set it will be of some use.
Do more clustering.
Label more training data.
Strip out more garbage.
PS you can get an idea of how much value additional training data will give you by training models on various subsets of your dataset (e.g. 10%, 20%...), evaluating them against the same test dataset, and plotting the results.
In the case of Flair, research led to a reference implementation, which was then matured through internal use and open sourced to further mature it and get external feedback. While employer branding is a nice benefit, it is a positive side-effect, not the motivation in itself :)
I'm a noob but it's not what I expect - periods change what is extracted in inconsistent ways.
"I love Berlin." -> "Berlin."
"I love Berlin ." -> "Berlin"
"George Washington loves Berlin." -> "George Washington"
"George Washington loves Berlin ." -> ["George Washington", "Berlin"]
In that simple example you posted, they already did the tokenization manually, as it's pretty trivial. But yes, in many cases, you have preprocessors that do the tokenization. In some libraries, you actually have a class/object_type for tokens, but it's pretty common to just preprocess and take every space as a token separator.
In some contexts and cases, it's possible to see tokens like "social_network", where multiple words are considered a single token.
In that first tutorial, they also mention they have a tokenizer if you need it:
"In some use cases, you might not have your text already tokenized. For this case, we added a simple tokenizer using the lightweight segtok library."
So for your example you would simply run the tokenizer first, then the named entity recognition.
EDIT: apparently you can do this directly:
"sentence = Sentence('The grass is green.', use_tokenizer=True)"
Is it easy to add a new language in Flair? In Spacy adding a language looks pretty straightforward.
In other words, if your problem looks like one of the benchmarking tasks in NLP research (e.g. recognizing persons and locations in fluent text) you can expect good performance out of open source tools. If you go beyond that, you have to concoct your own dataset and/or use proprietary cloud services.
The message is - again - learn about the mental framework, not individual tools, to understand where each strengths are and what the gaps are in between them. Or choose a problem, find the best tool for that problem and get progressively better at the tool(s) that helps you with most of your problems.