
AssemblyAI: speech-to-text API - juancampa
http://assemblyai.com
======
dylanbfox
This is Dylan from AssemblyAI. We're really surprised to see ourselves on HN
tonight!

We had a big launch planned for 4-6 weeks from now, and have been working
towards getting things ready for that. As a result, we're missing a lot of
things we know we need like benchmarks comparing ourselves to other services.
Please bear with us!

If you do end up trying the API, we'd love your feedback. We're trying to
build a really simple speech-to-text API for developers, that you can get up
and running with in just a few minutes, and that doesn't require you to
implement <insert big tech co here> into more of your stack.

There's a lot more we offer too, like:

\- customization via transfer learning for higher accuracy (language models
now, acoustic models soon)

\- supporting lossy audio like low bitrate mp3 files from phone calls

\- transcribe any audio file without having to specify it's metadata like
sampling rate or encoding

We're also constantly improving our models for higher accuracy. Every few
weeks we ship accuracy improvements based on improvements to our DNN
architectures, better data, better data augmentation, etc.

In terms of our STT stack, we're using CTC based models combined with RNN-LMs
-- all built in TensorFlow and PyTorch -- for decoding. Happy to provide more
info around our stack if you're interested!

For any questions off HN -- you can reach me at dylan at assemblyai dot com

Thank you!!

~~~
IanCal
Looks interesting, I also want to just say thank you for having a simple _and
clearly shown_ price.

~~~
dylanbfox
Thank you for the feedback!

------
ironfootnz
A very bold statement, no blog post commenting on which technique, hard to
sell out.

> The world's most advanced Speech-to-Text, customized for your application.

~~~
glup
Agreed. Excited to try it but I want to know a teensy bit more about why I
should switch to this over an in-house DeepSpeech based solution. The custom
models part in particular seems like it could be a step forward, but right now
there are few details on implementation. I get wanting to keep a competitive
edge, but I think you need a little bit more details on what is going on.

~~~
dylanbfox
Dylan from AssemblyAI here.

We're surprised to see ourselves on HN tonight -- but if you do try out the
API I would love to see what you think! Thanks for your interest.

DeepSpeech is an awesome open source project, and we absolutely support open
source speech-to-text. We're going to be open sourcing parts of our stack in
the future as well.

We're a lot more accurate on real-world data (like phone calls, podcasts,
accents, noise, etc.) than the current DeepSpeech model. We're actually less
accurate on LibriSpeech Clean than what DeepSpeech reports, but we've found
Libri Clean isn't very representative to real-world data and as a result isn't
a great benchmark.

We're planning to put out more thorough benchmarks comparing our API to other
services (including DeepSpeech, Kaldi, and CMU Sphinx) in about 1-2 weeks. If
you want to email me at dylan[at]assemblyai.com I can send you the benchmarks
once we have them.

Another thing that's hard to do is host RNN based models like DeepSpeech in
production at scale. It gets expensive fast, since you need to do inference on
GPUs to keep your latency time down. We spend a lot of time optimizing our
infrastructure to keep costs down, so there's a good chance we can host RNN
based models cheaper than if you were to do it on your own since we specialize
in this.

------
devilman666
Faq page answers all questions,
[https://assemblyai.com/faq/](https://assemblyai.com/faq/)

~~~
rahimnathwani
When I load that page, all I get is a load of nonsense:

''' Q: How is your result compare to Google/AWS? I have hinted that I would
often jerk poor Queequeg from between the whale and the ship—where he would
occasionally fall, from the incessant rolling and swaying of both. But this
was not the only jamming jeopardy he was exposed to. Unappalled by the
massacre made upon them during the night, the sharks now freshly and more
keenly allured by the before pent blood which began to flow from the
carcass—the rabid creatures swarmed round it like bees in a beehive.

Q: What is the Developer plan? In the tumultuous business of cutting-in and
attending to a whale, there is much running backwards and forwards among the
crew. Now hands are wanted here, and then again hands are wanted there. There
is no staying in any one place; for at one and the same time everything has to
be done everywhere.

Q: How does it work? Tousled food truck polaroid, salvia bespoke small batch
Pinterest Marfa. Fingerstache authentic craft beer, food truck Banksy Carles
kale chips hoodie. Trust fund artisan master cleanse fingerstache post-ironic,
fashion axe art party Etsy direct trade retro organic. Cliche Shoreditch Odd
Future. In the tumultuous business of cutting-in and attending to a whale,
there is much running backwards and forwards among the crew. Now hands are
wanted here, and then again hands are wanted there. There is no staying in any
one place; for at one and the same time everything has to be done everywhere.

Q: I'm stucked. What do I do? I have hinted that I would often jerk poor
Support page from between the whale and the ship—where he would occasionally
fall, from the incessant rolling and swaying of both. But this was not the
only jamming jeopardy he was exposed to. Unappalled by the massacre made upon
them during the night, the sharks now freshly and more keenly allured by the
before pent blood which began to flow from the carcass—the rabid creatures
swarmed round it like bees in a beehive.

Still have questions? Contact Us here '''

~~~
smt88
It seems to be placeholder text taken from the novel Moby Dick.

Pretty sloppy to launch without double-checking the FAQ...

~~~
mdrzn
Did they launch tho? Or someone else just posted the link on HN because they
randomly found it?

"We didn’t plan to launch on Hacker News for another few weeks"

~~~
smt88
They published their website...

------
aabajian
Wow, I just spent several hours today trying to integrate Google's text-to-
speech with a simple webapp. Here's a comparison.

Google: "There is a multiloculated fluid collection adjacent to the liver."

AssemblyAI: "There is a multi located fluid collection adjacent to the liver."

Google does better out of the box, but it isn't customizable (e.g. to a
medical vocabulary). I think this is a major limitation of Google's approach,
but certainly something they might offer in the future.

I do think AssemblyAI needs to offer a real-time dictation as this is the
major need in medicine.

~~~
dylanbfox
Dylan from AssemblyAI here. Thanks for trying the API!

Time to implementation is one thing we’re focusing a lot on. We’re really
trying to make it fast to get up and running with Assembly — for example not
requiring you to specify any meta info about your audio (like sample rate,
bitrate, etc).

We do have a real time endpoint but it’s not production ready yet — our
primary use case right now is for phone call recording transcription.

If you want to try the real time endpoint, you can email me at dylan at
assemblyai dot com and I would love to get your feedback!

------
StudentStuff
Are there any good guides out there for setting up Sphinx like Google/Amazon's
API? I'd just rather not schlep audio data off to an unknown party when I have
plenty of excess compute on my KVM cluster, accuracy isn't particularly
critical.

~~~
dylanbfox
There are some good open source STT solutions out there like Mozilla's
DeepSpeech
([https://github.com/mozilla/DeepSpeech](https://github.com/mozilla/DeepSpeech))
that wouldn't be too hard to put behind an API. We also plan to open source
more of our stack in the future so you can run it yourself.

Privacy aside, the biggest problem with self hosting these kinds of models in
our opinion is the compute required. The really accurate models are so large,
they require GPUs for inference. You could run the models on fast CPUs if you
don't care too much about latency, but the throughput would be pretty low.
Either way, GPUs and fast CPUs get expensive fast, so our hope is that by us
specializing in hosting these models, we can offer you a price point that
would be cheaper than if you were to try to host it yourself.

------
userbinator
Why the name? As far as I can tell, there is no mention of how to use this
from Asm.

The polling loop in the example screenshot doesn't seem a very nice thing to
do to their servers either.

~~~
chrismorgan
I assumed that the “Assembly” part of the name would mean that it was a
WebAssembly module—that seems quite a reasonable thing to assume given the
context. But there doesn’t seem to be any significance in the name at all for
Assembly code or WebAssembly.

------
braindead_in
Any benchmark on LibrisSpeech or TED dataset?

~~~
dylanbfox
Dylan from AssemblyAI. We're working on this! We didn't plan to launch on HN
tonight which is why the benchmarks aren't ready, but we know we definitely
need these for the community.

We've found most public benchmarks, especially Libri, are not that
representative of real world data we see in production. Most real world data
we see is a lot noisier, and has worse recording quality like low bitrates and
compression from mp3 encoding.

We do worse than state of the art benchmarks on Libri Clean today, for example
(I think we are around 7% WER last time I checked), but are much more accurate
on real world data than models reporting 3-5% WER on Libri. This is why we
want to make sure we are thorough when we report our benchmarks on popular
datasets like Libri.

~~~
braindead_in
Yeah, I would agree that the academic datasets are not representative,
especially for STT. Our models do around 8.7% on LibrisSpeech Clean. But on
our internal dataset PaddlePaddle's WER is 29% whereas we do around 15%. And
we regularly see higher WER's in production, especially for accented and noisy
files. Hopefully continuous re-training will help improve the generalization.
Here's the output from our model the 6063 file.

    
    
      0:00:00.7 S1: I'd say there's a such thing as eating too much, but I just have a massively fast the table of them and so, I constantly eating so.
    

Do you do diarisation and punctuations as well?

~~~
dylanbfox
Right, we noticed similar findings. We do automatic punctuation now, and do
diarization when there is more than one channel in the audio file. We're
launching diarization on single channel audio with multiple speakers very
soon. We're currently focused on improving some of our customization features,
and then we plan to ship single channel diarization.

Thanks for sharing your results! We have more samples here if you want to do
more comparisons: [https://blog.assemblyai.com/2018/08/09/cutting-edge-phone-
ca...](https://blog.assemblyai.com/2018/08/09/cutting-edge-phone-call-
transcription-with-assemblyai/)

~~~
braindead_in
That's great. We have really struggled with diarization. None of the systems
out there actually work! We get close but still mess it up from time to time.

Here are the results on the other files.

    
    
      4333.mp3: Oh yeah, it's still pretty tight, though. It's very challenging. They actually pull everything out of your.
    
      7510.mp3: Demons on TV like that. And for people to expose themselves to being rejected on TV or humiliated by fear factor or.
    
      8036.mp3: Well, I feel like as far as... as far as cursing and language, because I feel like as long as it's not necessarily in context, but.
    
      8522.mp3: Stuff to you, so you don't have to spend any body. He.

~~~
sushanthiray
I've been working on the problem of speaker diarization in the wild for better
part of this year. Would love to have a chat with you to see if our
diarization utility can help you guys out. Here is the link:
[https://www.deepaffects.com/apis](https://www.deepaffects.com/apis)

My email is in my profile.

------
beautybasics
Congrats, how does this stack against other offerings? Any benchmarks?

Is there a streaming API support like IBM Watson Speech to Text?

~~~
dylanbfox
Hi there! Thanks for the support. We were very surprised to see ourselves on
HN tonight!

We've been working towards a big launch in a few weeks, which is why we don't
have the benchmarks ready just yet. But we are comparable to most of the big
guys (IBM, Google, etc.) out of the box. And if you customize the API for your
specific use case, you can get a lot better accuracy.

We do have a real-time endpoint but it's not production ready yet. Our primary
use case to date has been async transcription for phone call recordings and
podcast recordings, but we are definitely working on making the real-time
endpoint production ready.

Real-time is just a little bit trickier, because it's more expensive to run
since we deploy our models onto GPUs.

------
gbajson
What do I need to provide to get support for the other languages? Will you
provide any module for training?

~~~
dylanbfox
Dylan from AssemblyAI here.

Right now we only support English, but are going to be launching more
languages very soon.

We are also getting closer to launching a transfer learning API, so you can
train your own acoustic models. This _could_ work for new languages if you had
enough data, but we haven't done much testing around this yet. It's something
that is definitely on our to do list!

------
epr
Is my math right on this? They only offer volume discounts if you are spending
~130k annually?

------
urvader
Is this russia wanting to get hold of our data?

From their privacy policy:

”...you may provide us with digital data including, without limitation audio
files, containing Personally Identifiable Information of the customers or
users of your applications that may be disclosed, collected, or stored in
connection with the use of our API. We use this data, which in most instances
is less than all of the data we have received, to provide our services, to
improve our services and for training purposes.

~~~
dylanbfox
This is very important. We do not store or copy any of the audio data you send
to the API. We do not train on any audio data you send to the API. We’re a
small company and absolutely believe in security and privacy.

We were caught off guard seeing ourselves on HN tonight — and our ToS is
something we’re improving for when we had planned to launch in 1-2 months.

The reason why we include some of this language is because right now you can
send in text data to the API for us to do transfer learning in order to create
a custom language model for your use case. And soon, you can also send in
audio data + transcripts to the API for us to do transfer learning on our base
RNN acoustic model, to fine tune the model for your audio data.

~~~
atrilumen
I sent you (Dylan) an email, but openly for discussion:

I'm building a product around an interactive learning system. Users train it
themselves, with our assistance as needed. But I don't want to ship a puppy
that will piss on your rug... and it would be great if it already had some
useful skills and intuitions out of the box...

So I want to carefully anonymize that data by hand, and retain it for training
models that all customers can benefit from. (I also want to train models to
recognize PII, but only to assist humans in doing the task; no amount of error
would be considered acceptable for this.)

I honestly don't see any problem, but I had a security consultant nearly spew
beer on me when I got to that part, and insist that I drop that line of
thinking.

I need more advice on this.

(The only other path I can think of is homomorphic encryption... but I would
not want to retain it in the the case where the original is deleted...)

------
jlebrech
I want to make a voice assisted keyboard.

