
Launch HN: AssemblyAI (YC S17) – API for customizable speech recognition - dylanbfox
Hey HN, I’m the founder of AssemblyAI (<a href="https:&#x2F;&#x2F;www.assemblyai.com" rel="nofollow">https:&#x2F;&#x2F;www.assemblyai.com</a>). We&#x27;re building an API for customizable speech recognition. Developers and companies use our API for things like transcribing phone calls and building voice powered smart devices. Unlike current speech recognition APIs, developers can customize our API to more accurately recognize an unlimited amount of industry specific words or phrases unique to what they&#x27;re building without any training required. For example, you can recognize thousands of product or person names with our API. Or you can more accurately recognize commands&#x2F;phrases common or custom to your use case.<p>We&#x27;ve developed our own deep neural network speech recognition architecture, and aren&#x27;t using any open source speech frameworks like Kaldi or Sphinx (just Tensorflow). Because of this, we&#x27;re able to run things more affordably and pass those savings on to developers.<p>I used to work on projects that had speech recognition requirements before starting AssemblyAI, and saw how limiting, expensive, and hard to work with traditional speech recognition services and APIs were. We want to help developers and companies easily build products with speech recognition.<p>Would love feedback from the HN community on what we&#x27;re building, and if you have any questions about deep learning or deep learning in production ask away!
======
asrbash
You seem to be using a slightly tweaked CTC-based architecture built in
tensorflow (possibly with Baidu's warp-ctc) but marketing it as some super-
secret technology you invented in-house. I don't see any performance
benchmarks or WER results we can compare with other APIs, but the pricing is
the same. Surely character-based approach lets you add new words without
pronunciations, but that process is not as flawless as you make it seem,
especially when you lack language model data for new words. Now I'm still a
bit confused why somebody would use AssemblyAI over other APIs given the same
price. And FYI you are not using Kaldi / Sphinx because the guys behind them
did not endorse CTC and are purposefully avoiding putting it in there, though
for example Kaldi's chain models are also sequence based. There was also Eesen
that tried to implement CTC on top of Kaldi. Sorry if this came off too harsh,
but I am a little suspicious about the novelty of the approach here.

------
phrixus
I remember 10 years ago Nuance used legal threats to eliminate competition in
this field, to the extent that greatly discouraged any startup speech
recognition companies.

Google was able to get around it, just because they became heavier..

Did this significantly change since then?

------
candiodari
> We've developed our own deep neural network speech recognition architecture,
> and aren't using any open source speech frameworks like Kaldi or Sphinx
> (just Tensorflow). Because of this, we're able to run things more affordably
> and pass those savings on to developers.

Kaldi and Sphinx are _far_ more efficient than any tensorflow transcription
model I've ever seen.

I assume this is an oversight ?

~~~
dylanbfox
We can get pretty good throughput with our setup. But the main thing about our
architecture versus others is that it makes the automatic customization we
offer possible. For example it takes under 2 minutes to customize the API to
be able to recognize ~10,000 custom words/phrases for whatever you're
building.

------
trevyn
Your pricing page contains no pricing information.

~~~
dsacco
Don't you just hate that?

~~~
dylanbfox
Haha totally get this. We wanted to give info about how we bill (ie, per
second and we don't charge for customization) but we haven't publicly
disclosed our rate yet. It's pretty low (fraction of a cent per second).

~~~
trevyn
I'm sure you're aware of this, but for those who aren't, your YC buddy
scaleapi.com offers _human_ transcription for a fraction of a cent per second
as well.

~~~
mipmap04
With scaleAPI, you also have a 1 day delay in getting your response.

------
MycroftJones
Been wanting something like this for years. I have a bunch of old speeches and
radio shows I'd like to transcribe. They all have "terms of art", and noone at
Google would tell me how to train their API to adapt to my use case. Too bad I
missed this Beta; hope you allow more people in soon.

Can you clarify: does your API allow me to run the transcriber, pause it when
I see an error, tell it what the corrected text is, then continue with that
correction taken into account?

~~~
dylanbfox
The way it works right now is that you would upload one of those old speeches
or radio shows to the API, and then you'd get back a transcript that includes
all the "terms of art" you customized the API to be able to recognize. Right
now there's no feedback loop for you to tell the API where it was wrong, but
that's something we have in mind to build! We do run QA across our entire API
usage though, and from that are able to improve recognition accuracy over
time. We also release updates to the models that power the API every few weeks
which improves recognition accuracy too.

And then we do provide confidence scores, so we can at least give you some
indication when we're not confident in the automatic transcript we're
returning to you.

If you want to try out the API, you can email beta@assemblyai.com and I will
look for your mail!

~~~
MycroftJones
Thank you, I will. I also wonder, does your API take context into account? For
instance, depending on WHO is talking, and WHAT they've recently talked about
in the recording, one transcription may be preferable to another. People often
recognize other people by the types of things they talk about, characteristic
phrases, etc.

------
empyrical
Small issue I notice that the email links on the pricing page: they're
swapped, with "Basic" having an "Enterprise Plan" subject line and vice versa

~~~
dylanbfox
Good catch! Thanks!

~~~
empyrical
No problem! Can't wait to try out the beta, especially the streaming audio
transcription feature!

------
braindead_in
Any WER benchmarks for TED, Librisvox, etc?

~~~
dylanbfox
Right now we only test against a few internal test sets. On some clean speech
test sets we're at ~8% WER. That's a good idea to test against some open
source datasets for reporting accuracy metrics!

~~~
arisAlexis
Shouldn't that be before you release MVP?

------
elipollak
Maybe a silly question but could you use this to recognize phrases or words in
a language other than English?

~~~
dylanbfox
We haven't actually tried that yet. I imagine if you customized your model
with words from another language and then pronounced them with an english
accent the API might be able to recognize them ok. Would be a fun experiment
to try at least!

~~~
yorwba
If I understand correctly, "customizing the model" essentially adds new words
to the vocabulary and adjusts the language model to change the probability of
some phrases, but does not require any information about pronunciation, let
alone audio samples.

But isn't having just the English text really error prone, especially when you
are dealing with terms of art and proper names, that might even have roots in
foreign languages? E.g. some people pronounce SQL as "sequel", and the English
pronunciation of French words varies between "French pronunciation with
English accent" and "French orthography interpreted as English orthography".
(I'm guessing your model would tend towards the latter?)

So what I'm interested in is whether you have encountered examples of this
during your testing, and whether you have some way to work around it (I would
try phonemic transcriptions in addition to English); or whether this is not
relevant for the use-cases you are trying to cover and the convenience of just
using English text trumps the accuracy loss due to just using English text.

~~~
dylanbfox
Hey! Great question. Our system is actually able to handle transcribing
"sequel" as "SQL" automatically if you were to "customize the model" for
phrases like "what was my latest SQL query". It can also get words like
"colonel" pronounced "kernel". In both cases, without needing the explicit
pronunciation of the word. We have some customers who've uploaded thousands of
proper names, for example, and we're able to transcribe all of them without
needing the explicit pronunciation. This is possible because our ASR
implementation is pretty different than traditional setups like Kaldi. You're
right that there are some edge cases, especially with foreign words, but we're
working hard on smoothing those out.

~~~
yorwba
Sounds amazing! Now I'm _really_ interested how your setup can do that. Will
you publish anything about it, or is this the kind of secret sauce you'd
rather keep secret?

------
DanBC
Is your product compatible with medical privacy law? Could it be made to be
compatible with such law?

~~~
dylanbfox
This is something we are looking into!

------
garysieling
Can you separate multiple speakers in audio when you do the transcription?

~~~
dylanbfox
Thanks for the question! If you have a stereo file, from a phone call for
example, we can do separate transcripts for each channel...but we haven't
launched any algorithms to auto-split speakers into the API yet! Definitely
something we want to offer in the future though.

------
sbr464
Just FYI, the cta buttons near bottom overlap on mobile

~~~
dylanbfox
Thanks for the heads up on this!

~~~
mrjaeger
Similarly menu items in the navbar become unclickable on mobile (or at the
very least when emulating iphone6+ in Chrome)

~~~
dylanbfox
Thanks for taking the time to post this! On it...

------
peternicky
Any plans on a JavaScript SDK?

~~~
dylanbfox
Yup! This is in the works and should be released soon!

------
arisAlexis
Your pricing seems on par with google and ibm

------
dayve
Great work guys. Was excited to see AssemblyAI is free for open-source
projects. Looking forward to see big relevant projects hop on the train.

