Also what you are doing with prodi.gy is also pretty interesting with active learning and stuff.
Yeah the active learning is one reason we made it a downloadable tool. We couldn't figure out how to make that work well as a SaaS.
That said, there's definitely advantages to the SaaS design too, so I think the products take different shapes. It's good to see more tools in this space, and your free usage for open data plan should be really useful for researchers.
To provide a little context, we're currently building an annotation tool that will be accessible by clients and seriously considered prodigy. I'm curious to know what pitfalls we may not be anticipating with that model.
P.S. Your work on SpaCy is nothing short of awesome. Thanks you.
We also had a slightly different idea about what makes annotation or data collection "hard" or painful.
If you start back at a business or product need, you first have to sketch out how you're going to structure your solution before you can start annotating the data. For instance, you might need to decide whether to do sentence classification, or tag spans of text, or recognize structured relations. You need to figure out how to select which texts to annotate, or which parts. Maybe your documents are long, and it's efficient to annotate only the start of the document. Maybe the information is rare, and a lot of effort should go into getting the right filtering process before you annotate your text.
These types of considerations are really basic -- they arise on every new thing you do. You can develop better or worse intuitions, but in the end you have to make a bunch of decisions, and making them blindly is really inefficient.
Prodigy addresses this by letting a single developer quickly iterate on trying out different ideas. You can filter the stream of examples however you like, plug together the components in different ways, and build complicated pipelines. You might start by selecting examples for entity annotation by keyword, but then decide your keywords suck, so you build a text classifier to select the examples. Then you might apply a sentence-based classifier to all examples of a given entity type, to identify the presence of some relation. This pipeline of three models is very quick to build once you know you need to build it --- and Prodigy helps you figure out whether an approach is working within a couple of hours.
We could have given developers all of these pieces as REST endpoints...But that would give you a really miserable workflow. If what we're trying to give you is a bunch of functions you can compose in different ways, we should just let you program.
In short, we think the big problems with ML/data science/AI are that the technologies are so unpredictable. If you're at a point where you know exactly the inputs and outputs of all your models, and you just need to 10x your data set, you've really almost won already --- you're at the happy state of knowing you can turn $ into %. That's a great problem to have, because it's not the big problem. The big problem is the unpredictability.
The best solution to unpredictability is rapid iteration, and to do that we need to give data scientists the tools to figure out which supervised approaches seem most promising. And if developers are working with a bunch of composable pieces themselves, it may as well be a library. Making it a SaaS just makes the product worse.
CrowdFlower (https://www.crowdflower.com) is also a very large company in this space already.
Don't you think in that context the name is kinda appropriate?