Hacker News new | comments | show | ask | jobs | submit login
Show HN: ML data annotations and tagging made easy (dataturks.com)
41 points by mohi13 10 months ago | hide | past | web | favorite | 17 comments

What's your policy on data ownership once the data (and annotations) have been sent to you? Do you reserve the right to keep and use the data that's sent to you? Your privacy policy page only talks about personally identifying information, not usage rights.

Hi thanks for the comment and thanks for SpaCy, have used it a few times and found it really powerful.

Also what you are doing with prodi.gy is also pretty interesting with active learning and stuff.

W.r.t privacy policy the data is owned by the users and he can delete it as he sees fit. We will never access it.


Yeah the active learning is one reason we made it a downloadable tool. We couldn't figure out how to make that work well as a SaaS.

That said, there's definitely advantages to the SaaS design too, so I think the products take different shapes. It's good to see more tools in this space, and your free usage for open data plan should be really useful for researchers.

Could you elaborate on the reasons you didn't go SaaS? Were there other reasons besides technical configurability and sheer scale of data?

To provide a little context, we're currently building an annotation tool that will be accessible by clients and seriously considered prodigy. I'm curious to know what pitfalls we may not be anticipating with that model.

P.S. Your work on SpaCy is nothing short of awesome. Thanks you.

Well, a major consideration is that uploading data to a third party is a huge barrier. Imagine you want to work on internal messages between your users, or support tickets, or emails within your company. Because spaCy is open-source, we already had a lot of users where data privacy was a 100% non-negotiable requirement. There are already a lot of people doing cloud stuff, so it made a lot of sense for us to think about self-hosted tools.

We also had a slightly different idea about what makes annotation or data collection "hard" or painful.

If you start back at a business or product need, you first have to sketch out how you're going to structure your solution before you can start annotating the data. For instance, you might need to decide whether to do sentence classification, or tag spans of text, or recognize structured relations. You need to figure out how to select which texts to annotate, or which parts. Maybe your documents are long, and it's efficient to annotate only the start of the document. Maybe the information is rare, and a lot of effort should go into getting the right filtering process before you annotate your text.

These types of considerations are really basic -- they arise on every new thing you do. You can develop better or worse intuitions, but in the end you have to make a bunch of decisions, and making them blindly is really inefficient.

Prodigy addresses this by letting a single developer quickly iterate on trying out different ideas. You can filter the stream of examples however you like, plug together the components in different ways, and build complicated pipelines. You might start by selecting examples for entity annotation by keyword, but then decide your keywords suck, so you build a text classifier to select the examples. Then you might apply a sentence-based classifier to all examples of a given entity type, to identify the presence of some relation. This pipeline of three models is very quick to build once you know you need to build it --- and Prodigy helps you figure out whether an approach is working within a couple of hours.

We could have given developers all of these pieces as REST endpoints...But that would give you a really miserable workflow. If what we're trying to give you is a bunch of functions you can compose in different ways, we should just let you program.

In short, we think the big problems with ML/data science/AI are that the technologies are so unpredictable. If you're at a point where you know exactly the inputs and outputs of all your models, and you just need to 10x your data set, you've really almost won already --- you're at the happy state of knowing you can turn $ into %. That's a great problem to have, because it's not the big problem. The big problem is the unpredictability.

The best solution to unpredictability is rapid iteration, and to do that we need to give data scientists the tools to figure out which supervised approaches seem most promising. And if developers are working with a bunch of composable pieces themselves, it may as well be a library. Making it a SaaS just makes the product worse.

This appears to be inspired by Amazon's MTurk, which data people typically use for manual data labelling; as a result, you may want to change the name.

CrowdFlower (https://www.crowdflower.com) is also a very large company in this space already.

Yeah even we think of it as your personal MTurk. You can do the annotation with your team/network and not have to depend on crowdsourcing it.

Don't you think in that context the name is kinda appropriate?

Amazon's trademark lawyers may see that differently.

"Mechanical Turk" is an 18th century fake chess machine.


Trademarks are scoped within vertical/business sector/industry.

I don't know law, but the mechanical turk term precedes amazon's use of it.

CrowdFlower is, or at least was when I used it, powered by MTurk. It's just an extra abstraction layer on top for things like automated fraud detection & elimination.

There’s also a recent Prodigy by the same folks who make SpaCy - https://prodi.gy (disclaimer: I’m not affiliated, but participated in their beta).

Yeah that's true, the above commenter syllogism is the founder of Spacy I guess. Similar tool but for a different purpose.

This seems like a copycat product https://prodi.gy/ I would use this product because they are actual AI people who will help you customize your product for your use case.

Biggest usecase for me will be to create golden data set to create some hold-out set to iterate algorithm on.

That certainly is one use case. Even for an auto-generated training data it is almost always the case to have some noise in the data and taking out the golden set from that is rarely an option, we always need to do some manual tagging and cleaning. Thanks for the feedback.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact