
Show HN: ML data annotations and tagging made easy - mohi13
https://dataturks.com/datalicious.php
======
syllogism
What's your policy on data ownership once the data (and annotations) have been
sent to you? Do you reserve the right to keep and use the data that's sent to
you? Your privacy policy page only talks about personally identifying
information, not usage rights.

~~~
mohi13
Hi thanks for the comment and thanks for SpaCy, have used it a few times and
found it really powerful.

Also what you are doing with prodi.gy is also pretty interesting with active
learning and stuff.

W.r.t privacy policy the data is owned by the users and he can delete it as he
sees fit. We will never access it.

~~~
syllogism
Thanks!

Yeah the active learning is one reason we made it a downloadable tool. We
couldn't figure out how to make that work well as a SaaS.

That said, there's definitely advantages to the SaaS design too, so I think
the products take different shapes. It's good to see more tools in this space,
and your free usage for open data plan should be really useful for
researchers.

~~~
brd
Could you elaborate on the reasons you didn't go SaaS? Were there other
reasons besides technical configurability and sheer scale of data?

To provide a little context, we're currently building an annotation tool that
will be accessible by clients and seriously considered prodigy. I'm curious to
know what pitfalls we may not be anticipating with that model.

P.S. Your work on SpaCy is nothing short of awesome. Thanks you.

~~~
syllogism
Well, a major consideration is that uploading data to a third party is a huge
barrier. Imagine you want to work on internal messages between your users, or
support tickets, or emails within your company. Because spaCy is open-source,
we already had a lot of users where data privacy was a 100% non-negotiable
requirement. There are already a lot of people doing cloud stuff, so it made a
lot of sense for us to think about self-hosted tools.

We also had a slightly different idea about what makes annotation or data
collection "hard" or painful.

If you start back at a business or product need, you first have to sketch out
how you're going to structure your solution before you can start annotating
the data. For instance, you might need to decide whether to do sentence
classification, or tag spans of text, or recognize structured relations. You
need to figure out how to select which texts to annotate, or which parts.
Maybe your documents are long, and it's efficient to annotate only the start
of the document. Maybe the information is rare, and a lot of effort should go
into getting the right filtering process before you annotate your text.

These types of considerations are really basic -- they arise on every new
thing you do. You can develop better or worse intuitions, but in the end you
have to make a bunch of decisions, and making them blindly is really
inefficient.

Prodigy addresses this by letting a single developer quickly iterate on trying
out different ideas. You can filter the stream of examples however you like,
plug together the components in different ways, and build complicated
pipelines. You might start by selecting examples for entity annotation by
keyword, but then decide your keywords suck, so you build a text classifier to
select the examples. Then you might apply a sentence-based classifier to all
examples of a given entity type, to identify the presence of some relation.
This pipeline of three models is very quick to build once you know you need to
build it --- and Prodigy helps you figure out whether an approach is working
within a couple of hours.

We could have given developers all of these pieces as REST endpoints...But
that would give you a really miserable workflow. If what we're trying to give
you is a bunch of functions you can compose in different ways, we should just
let you program.

In short, we think the big problems with ML/data science/AI are that the
technologies are so unpredictable. If you're at a point where you know exactly
the inputs and outputs of all your models, and you just need to 10x your data
set, you've really almost won already --- you're at the happy state of knowing
you can turn $ into %. That's a great problem to have, because it's not the
_big_ problem. The big problem is the unpredictability.

The best solution to unpredictability is rapid iteration, and to do that we
need to give data scientists the tools to figure out which supervised
approaches seem most promising. And if developers are working with a bunch of
composable pieces themselves, it may as well be a library. Making it a SaaS
just makes the product worse.

------
minimaxir
This appears to be inspired by Amazon's MTurk, which data people typically use
for manual data labelling; as a result, you may want to change the name.

CrowdFlower ([https://www.crowdflower.com](https://www.crowdflower.com)) is
also a very large company in this space already.

~~~
mohi13
Yeah even we think of it as your personal MTurk. You can do the annotation
with your team/network and not have to depend on crowdsourcing it.

Don't you think in that context the name is kinda appropriate?

~~~
minimaxir
Amazon's trademark lawyers may see that differently.

~~~
denzil_correa
"Mechanical Turk" is an 18th century fake chess machine.

[https://en.wikipedia.org/wiki/The_Turk](https://en.wikipedia.org/wiki/The_Turk)

~~~
riku_iki
Trademarks are scoped within vertical/business sector/industry.

------
aldanor
There’s also a recent Prodigy by the same folks who make SpaCy -
[https://prodi.gy](https://prodi.gy) (disclaimer: I’m not affiliated, but
participated in their beta).

~~~
mohi13
Yeah that's true, the above commenter _syllogism_ is the founder of Spacy I
guess. Similar tool but for a different purpose.

------
nicodjimenez
This seems like a copycat product [https://prodi.gy/](https://prodi.gy/) I
would use this product because they are actual AI people who will help you
customize your product for your use case.

------
gajju3588
Biggest usecase for me will be to create golden data set to create some hold-
out set to iterate algorithm on.

~~~
mohi13
That certainly is one use case. Even for an auto-generated training data it is
almost always the case to have some noise in the data and taking out the
golden set from that is rarely an option, we always need to do some manual
tagging and cleaning. Thanks for the feedback.

