
Launch HN: Datasaur (YC W20) – data labeling interface for NLP - flyx
Hey HN community -<p>I’m Ivan from Datasaur (<a href="https:&#x2F;&#x2F;datasaur.ai&#x2F;" rel="nofollow">https:&#x2F;&#x2F;datasaur.ai&#x2F;</a>) - we build software to allow humans to more efficiently label data for training natural language processing (NLP).<p>NLP algorithms are being trained in a wide variety of industries - from customer service to legal contracts, forum moderation to restaurant reviews. All these algorithms benefit from recent breakthroughs in academia and a generous open-source community. However, in order to be deployed to the real world, they require a custom set of training data to learn and understand the language unique to each industry. Therefore, people around the world are meticulously labeling data samples.<p>Example sentence: <i>London is the capital and largest city of England and of the United Kingdom.</i><p>Labels: “London” —&gt; “capital”, “United Kingdom”<p>Labels: “London” —&gt; “largest city”, “England”<p>In the last few years I’ve worked at companies such as Apple and Yahoo and noticed that many organizations tend to reinvent the wheel when creating labeling interfaces for their labelers. Some companies still do this work in Excel.
We saw an opportunity to create a &quot;single interface to rule them all&quot; - to handle all sorts of text labeling tasks.<p>We leverage existing NLP capabilities to intelligently validate the quality of labels in a document and complement human judgment. Furthermore, we already understand terms like “Starbucks” and “New York” - why spend time labeling these terms from scratch every time? We created an API so you can plug in existing models to apply a first pass on labeling the document. We also built many other extensions to help labelers optimize their time - a “find and label” extension for labeling repetitive terms, a dictionary extension for quickly looking up unfamiliar terms. We spent the past year building out the labeling solution I wish I could have used.<p>We now handle named entity recognition, parts of speech, document labeling, coreference resolution (multiple words referring to the same object&#x2F;person) and dependency parsing (drawing relationships between words). A case study with one of our clients shows 70% improved labeling efficiency upon adopting the Datasaur platform, and we have much more room to improve.<p>We also spoken with 100+ AI teams globally and identified the best practices in labeling. In addition to providing an enhanced interface, we can help track labeler performance, peer disagreement scores, and detect&#x2F;remove labeler bias. By incorporating and encoding these features into our software, we can not only help improve the labeling efficiency but also improve the quality of the data and therefore the resulting AI model.<p>We believe that as AI becomes ever more prevalent and ubiquitous, labeling will become an increasingly important task. AI is a garbage-in, garbage-out technology, and the quantity and quality of data can often make a critical difference in the resulting AI model. We’re really excited to open Datasaur up to the world today and hear your feedback. Have you run into similar labeling issues? What tips and tricks have you employed to keep up with AI’s voracious appetite for data? We’d love to hear how you’ve tackled data labeling at your own companies. Thanks so much in advance!<p>Ivan
======
zachguo
It resembles an open source annotation tool that has existed for years.
[https://brat.nlplab.org/](https://brat.nlplab.org/)

It doesn't include a ML assistant though.

We have built a semi-automated annotation tool for our internal use too. ML
models help classify documents and extract named entities by making
suggestions. Sometimes I'm thinking of spinning it off as a standalone product
but not sure how big the market would be.

~~~
agusgun
I think the semi-automated annotation tool is quite the same with prodigy

~~~
flyx
Both can be described as semi-automated annotation, but we use different
approaches! Datasaur allows you to plug in any pre-existing model to pre-label
or validate your labels. One such model we integrate with is spaCy, so we're
certainly fans of the Prodigy/Explosion team.

~~~
mrgordon
Yes it is more similar to Figure Eight Text Annotation which allows you to use
pre-existing models including Spacy or you can bring your own model

[https://www.figure-eight.com/platform/machine-learning-
assis...](https://www.figure-eight.com/platform/machine-learning-assisted-
text-annotation/)

------
aliml85
Looks great, Ivan. Congrats! If I understand correctly you would use Spacy and
some pre-trained models to validate human labels. Now my question is that what
is the point of collecting labels for ML training if we already have a valid
model for the same task that can complement human labels?

------
aliakhtar
Cool project, what would be cooler is if you had an API to retrieve the labels
for a given word. May be that's in the works?

~~~
flyx
Done and shipped! :D One of our extensions allows you to plug in an API -
either use your own model, or an integration with an open-source project like
spaCy to apply labels.

------
staticautomatic
Could you please elaborate on what you mean by "intelligently validate the
quality of labels in a document and complement human judgment", and discuss
your methodology?

This seems to operate under the assumption that human labels are not actually
the ground truth. I understand that they can be dirty, but most unsupervised
approaches aren't producing a ground truth, either. So, are you saying it's
better to have multiple pretty good sources of truth instead? Because
depending on the application, that might make sense or it might be like trying
to start a farm with a dead horse and a dead cow.

~~~
flyx
Certainly. Our philosophy is to complement human wisdom with computer
precision. Humans may often be labeling for 8 hours a day and may get
fatigued. So if Starbucks has been labeled as a cafe 35x in a document and as
a person 2x, we can flag this and ask "hey, are you sure you wanted to label
this as a person?". Or if we know for a fact Canada is a country, but it's
labeled as an animal in a document, we can raise a flag as well. This won't
work for everything, but we think it can help with quality assurance.

------
IanCal
Looks interesting, signed up to try this out and see if it might deal with
some of our labelling.

A few notes -

Some help docs would be good, or better links to them. You specify a few types
of projects but don't really explain what they are - I tried searching for
"constituency" type projects but I have no idea what they are still.

You're sending error messages to the frontend. "Cannot read property
'startsWith' of undefined" is not something that should be reaching an end
user, and this is happening unreliably when I upload files.

If I upload a CSV file I cannot seem to do NER from "new project". NER
specifically chosen supports TSV but not CSV. My TSV files just say "server
error", though they load as just txt files.

What's a question set? What's the format you need from me (it just says
"csv").

Autolabelling seems to do nothing. Do you have example text where that should
work?

I can navigate the text but the hotkeys for labelling don't do anything until
I've already clicked once.

Search & label all is interesting but doesn't seem to give me any labelling
options. Also the regex search for "someword \w" just returns all two words
next to each other which seems wrong to me.

Congrats on the launch!

~~~
flyx
Hi IanCal - Really appreciate you taking the time to test Datasaur out and
provide this feedback! Responses below: \- We took the risk of launching
before our tutorial is in place to get feedback, so I do apologize for lack of
clarity for first time users. \- Good point, we'll try improve error messages
for users. \- We've been taking import formats on a case-by-case basis, but
we're looking to improve/expand our import/export format flexibility further.
\- This is probably one of the highest priority issues for us to fix - making
sure expected format and input is clear to the user. \- Happy to send you
additional sample files and instructions on how to get autolabelling working.
We can accept your own models/api endpoints as well! \- Hotkeys for labeling
should work - this bug seems odd. What kind of project is this for? \- Search
and label should work as soon as labels have been uploaded, sorry that it
wasn't working for you.

Happy to discuss any of this in further detail. Thanks again for leaving such
comprehensive thoughts!

------
Shenglong
This is awesome--really excited to see this need being solved.

------
crimsalis
Congrats on the launch! I spend more than 50% of my time labeling data and
this will make life much easier.

------
sailfast
This looks awesome! Waiting for my email confirmation.

I was looking for information about where my data has to be hosted to use this
service and could not find it. Will there be some more information about how
this data is handled once I get past the login? Thanks!

~~~
flyx
We offer both a hosted service on AWS as well as an on-prem solution if
needed. We can even choose to host on the cloud provider of your choice -
happy to work with you on this!

------
andrewnc
This is very cool! I especially love the logo. Congrats on the launch and best
of luck.

~~~
flyx
Always happy to hear people complimenting the logo. I've been told it's not
professional enough, but I really wanted our site to have some personality.
Thanks!

------
gault8121
In the spreadsheet view, do users need to upload labels as a text file to then
assign them to items? I work with Quill.org, a nonprofit edtech tool that
helps students improve their writing skills, and we do a lot of labeling work
now where we may need to assign say one of 20 labels to 1,000 responses at a
time. I uploaded some sample data, but didn't understand how I could quickly
assign labels to my content. Please let me know if I'm missing something here.

~~~
flyx
Hi there - you may choose to upload labels as a text file or create your own.
I'd be curious to hear more about your use case in batch-applying labels. I'll
follow up offline (well, via email).

------
milani
Congratulation for the launch!

To understand the scope of your work a little bit, if I have Prodigy with
custom labeling needs set up for me, do I still benefit from switching to
datasaur?

~~~
flyx
Apologies for the delay! There is some overlap with what Prodigy works on and
I'm a big fan of what they're working on. We cover some additional use cases
(like coreference parsing) and additionally help with managing teams of
labelers. We're complementary in many regards. Happy to discuss further, based
on your labeling needs.

------
narrationbox
This looks wonderful, will definitely try it out. We ran into the labeling
issue when doing NER a couple years ago on Reddit books dataset. If only this
existed then.

~~~
flyx
Thanks for the kind words! Yea we're building out what I wish we had at my
last few companies. Looking forward to your feedback.

------
dunky11
Wish you good luck, the website looks clean, the product idea is good:) You
request an image however which width is 3000+ pixels:
[https://s.datasaur.ai/static/media/homepage-
hero.4917b8af.pn...](https://s.datasaur.ai/static/media/homepage-
hero.4917b8af.png) . 1200px in width should be enough, I would resize the
image, it slows down the page.

~~~
flyx
Yikes - good point. We'll optimize.

------
comet_trail
Interesting product. Could have used this at previous companies. How is this
different from FigureEight or Scale?

~~~
veeralpatel979
Scale offers labeling as a service. Datasaur is an interface that companies
can buy for their own labeling personnel, if I understand correctly.

~~~
flyx
That's right! Scale probably has some awesome internal tools that help them
label faster. Datasaur wants to make those same optimizations available to
anyone with their own labelers.

------
hbcondo714
Any chance you could support HTML files? We've been using
[https://www.tagtog.net/](https://www.tagtog.net/) for some of our data
labeling / annotations needs but their tool for these file types is still
"experimental".

~~~
flyx
Sure can! Would love to hear more - what do you want to extract from the HTML
files?

~~~
hbcondo714
Thanks! We actually just need to be able to upload HTML files and have it
rendered as a web page (and not just display the HTML code) so our team can
data label / annotate certain sentences throughout the document.

~~~
flyx
Yea, we can 100% handle this. If you sign up for a demo, happy to discuss
further!

------
inerte
LinkedIn suggested a post from you a couple weeks ago and I remember thinking
“what’s Ivan up to?” and I saw Datasaur. Congrats on YC! I know that our time
at Yahoo was a brief overlap but I remember the swirl of ML, Knowledge Graph
and labelling our org was at 5 years ago.

Good luck with Datasaur!

\- Julio Nobrega

~~~
flyx
Julio - great to hear from you, and thanks for the kind words :) In many ways,
my journey to Datasaur began with that team/project 5 years ago.

------
lerax
This name is the best part of the project (and the project itself it's already
an awesome tool).

------
_prometheus
Datasaur looks awesome! Can't wait to try it out. Congrats on the launch!

Curious about data security and privacy? How do you guarantee privacy? Is
there some cryptography or secure enclaves used? Some sets of documents (and
email) are super high trust.

Guessing the on-prem version is probably safest route

~~~
flyx
Thanks - good to see so many people concerned about privacy here. I consider a
privacy a top-level priority at Datasaur - all data is fully encrypted. Our
employees will never be able to see or access any customer data. We already
work with a bank and cleared their security bar :)

------
boreas
I've got a question, a lot of startup websites have a similar look to this
one. It's a look I actually really like. What technologies are they all using?
How would I build a site like this?

wappalyzer doesn't give me anything and I don't have a ton of webdev
experience.

~~~
flyx
So here's the secret. Our awesome designer actually put a lot of hard work
into putting this together. 1 week later a friend told me about
[https://www.landen.co/](https://www.landen.co/) and I wish I had used this to
save us some time (don't know the team, just a fan of the product).

------
WFHRenaissance
Very cool logo. Just signed up.

~~~
WFHRenaissance
Following up. Your confirmation email gets flagged as leading to an untrusted
site in Gmail. Might be worth figuring out.

~~~
flyx
Yikes, will look into it. Thanks for the heads up!

------
hbcondo714
On the pricing page, the Growth box shows a checkmark for "Unlimited labels"
but right below in the "Choose the right plan for you", the Growth plan says
the number of labels is 10,000,000.

~~~
flyx
Great catch! We'll correct it asap. Since you caught it, we'll give you
unlimited labels :)

------
inthewoods
Great idea - and this is an odd comment: I think you're pricing it too low
relative to the number of people in the market. Just my gut - could well be
wrong, wrong, and wrong again.

~~~
flyx
Music to my ears! I think (and hope) you're probably right - so let's consider
this an introductory launch price and assume it'll go up over time ;)

------
mroll
Hey Ivan, this looks great! What are the privacy implications for my data that
I want to label with your tool? I’m assuming I upload it to your servers?

~~~
flyx
Great question! Data privacy is a top-level priority for us. We actually offer
both a cloud-based and on-prem solution. One of our clients needed a fully on-
prem, air-gapped (no connection to internet) option. Many are choosing to use
us _because_ they can't send their data to outsourced, external parties.

------
braindead_in
Congrats. Do you guys use AllenNLP, by any chance?

~~~
flyx
We've looked into it! So far we've chosen to integrate with spaCy. Can I ask
what you like about AllenNLP?

~~~
gault8121
Hi there, great product! I'm with a nonprofit edtech writing tool that uses
both spaCy and AllenNLP, and we've found AllenNLP's models to be more accurate
for tasks like co-reference resolution. AllenNLP's models are built on top of
spaCy and tools like neural coref. It'd be great if we could harness things
like AllenNLP's semantic role labeling.

~~~
flyx
We actually allow you to plug in any existing model (including your own). So
we should be able to support AllenNLP as well. happy to chat further!

------
seaturtles
Awesome! Congrats, excited for this!

------
foobaw
Any plans to support image annotation (something similar to what CVAT does)?

~~~
flyx
We currently support image classification. CVAT is a great tool and we'd love
to support all forms and types of data in the future!

------
ymt
Looks awesome! I need to convince my team to use datasaur

Congratulations for the launch!

------
chownation
Roarrsome, congrats!

~~~
flyx
roar (:

------
felixkurniawan
Congratulations Ivan for the launch! Best of luck!

