
Launch HN: Deepgram (YC W16) – Scalable Speech API for Businesses - stephensonsco
Hey HN,<p>I’m Scott Stephenson, one of the cofounders of Deepgram (<a href="https:&#x2F;&#x2F;www.deepgram.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.deepgram.com&#x2F;</a>). Getting information from recorded phone calls and meetings is time-intensive, costly, and imprecise. Our speech recognition API allows businesses to reliably translate high-value unstructured audio into accurate, parsable data.<p>Deepgram started when my cofounder Noah Shutty and I had just finished looking for dark matter (while in a particle physics lab at University of Michigan). Noah had the idea to start recording all audio from his life, 24&#x2F;7. After gathering hundreds of hours of recordings, we wanted to search inside this fresh dataset, but realized there wasn’t a good way to find specific moments. So, we built a tool utilizing the same AI techniques we used for finding dark matter particle events, and it ended up working pretty well. A few months later, we made a single page demo to show off “searching through sound” and posted to HN. Pretty soon we were in the winter batch of YC in 2016 (<a href="https:&#x2F;&#x2F;techcrunch.com&#x2F;2016&#x2F;09&#x2F;27&#x2F;launching-a-google-for-sound-deepgram-raises-1-8-million&#x2F;" rel="nofollow">https:&#x2F;&#x2F;techcrunch.com&#x2F;2016&#x2F;09&#x2F;27&#x2F;launching-a-google-for-sou...</a>).<p>I’d say we didn’t know what we were getting ourselves into. Speech is a really big problem with a huge market, but it’s also a tough nut to crack. For decades, companies have been unable to get real learnings from their massive amounts of recorded audio (some companies record more than 1,000,000 minutes of call center calls every single day). They have a few reasons why they record the audio — some for compliance, some for training, and some for market research. The questions they’re trying to answer are usually as simple as:<p><pre><code>  - “What is the topic of the call?” 
  - “Is this call compliant?” (did I say: my company name, my name, and “this call may be recorded”)
  - “Are people getting their problems solved quickly?” 
  - “Do my agents need training?” 
  - “What are our customers talking about? Competitors? Our latest marketing campaign?”

</code></pre>
It’s the most intimate view you can get on your customers, but the problem is so large and difficult to solve that companies pushed it into the corner over the past couple decades, only trying to mitigate the bleeding. Current tools only transcribe with around 50-60% accuracy on real-world, noisy, accented, industry-specific audio (don’t believe the ‘human level accuracy’ hype). When companies start solving problems using speech data, they first want transcription that’s accurate. After accuracy, comes scale — another big problem. Speech processing is computationally expensive and slow. Imagine trying to get into an iterative problem solving loop when you have to wait 24 hours to get your transcripts back.<p>So we’ve set our sights on building <i>the</i> speech company. Competition from companies like Google, Amazon, and Nuance is real, but none of these approach speech recognition like we do. We&#x27;ve rebuilt the entire speech processing stack, replacing heuristics and stats based speech processing with fully end-to-end deep learning (we use CNNs and RNNs). Using GPUs, we train speech models to learn customer’s unique vocabularies, accents, product names, and acoustic environments. This can be the difference between correctly capturing “wasn’t delivered” and “was in the liver.” We’ve focused on speed since we think that’s very important for exploration and scale. Our API returns hour-long transcripts interactively in seconds. It’s a tool many businesses wish they had.<p>So far we’ve released tools that:<p><pre><code>  - transcribe speech with timestamps
  - support real-time streaming
  - have multi-channel support
  - understand multiple languages (in beta now)
  - allow you to deeply search for keywords and phrases
  - transcribe to phonemes
  - get more accurate with use
</code></pre>
Some of those are better mousetraps of things you’re familiar with and some are completely new levers to pull in your audio data. We’ve built the core on English but now we’re releasing the tools for all of the Americas. (aside: You can transfer learn speech and it works well!)<p>Accuracy will continue to improve for transcription, but I think we can do more. It&#x27;s such a large problem, and we really want to make a dent in “solving speech”. That means asking, truly: “What can a human do?“<p>People can, with little context, jump into a conversation and determine:<p><pre><code>  - What are the words? When are they said? Who said what?
  - Is this person young&#x2F;old? Male&#x2F;Female? Exhausted&#x2F;energetic?
  - Where is there confusion?
  - What language are they speaking? What’s the speaker’s accent?
  - What’s the topic of the conversation? Small talk or real? Is it going well?
</code></pre>
Some of those things are being worked on now: additional language support, language and accent detection, sentiment analysis, auto-summarization, topic modeling, and more.<p>We’d love to hear your feedback and ideas.
======
btown
(FYI your [https://deepgram.com/v2/docs](https://deepgram.com/v2/docs) links
are giving "error": "Not Found" JSON responses.)

I love progress in this space. Something I also think is necessary, though, is
innovation in the discoverability interfaces around speech data. Can you
search over potential transcriptions weighted by their likelihood, rather than
just doing full-text search on the most-likely transcriptions? Can you
visualize multiple potential transcriptions inline without overloading
someone's visual cortex with information? Can you one-click-to-listen to any
specific line? Can you enable people to switch conversations on the fly to an
"off-the-record" mode, with such confidence that the default can be that every
conversation is highlighted? Can you do all of this from Slack? Can you make
setup a one-click process with Twilio OAuth? Can you do all of this from a web
app that requires no coding?

All this, I'm sure, is part of an ecosystem that will be built on tools like
yours, and that ecosystem fundamentally depends on the quality of the data -
so it makes sense for you all to focus there first. But to the extent you want
to capture the entire "stack," there's a tremendous space for someone to take
the level of "passion" for data quality and apply that same instinct to
quality-of-experience.

~~~
stephensonsco
This is a seriously fertile area where you get to "define the new interface".

It's a big problem though, since few buyers know they want those things.
Around 95% of customers come into it with "give me the transcripts" and
discover over time they want these other things too (some graphical, some
technical). They just didn't know it was available.

New GUIs and data representations is a big part of it. Getting accuracy and
scale in place is a big part. Building awareness and distribution of what's
possible now is another big part.

Re: JSON Error; We fixed that doc link error you saw (it was pointing in the
wrong place since we _just_ updated it).

The real docs link is:
[https://brain.deepgram.com/docs](https://brain.deepgram.com/docs)

~~~
btown
Tableau (and the general business analytics space) have done a good job at
reframing the problem as: "don't think about what you want as a leader at a
company; instead, democratize data access so your team can decide what it
wants, and pay for democratization not for your own features." See for
instance: [https://www.forbes.com/sites/briansolomon/2016/05/04/how-
tab...](https://www.forbes.com/sites/briansolomon/2016/05/04/how-tableau-
built-a-3-billion-data-empire-on-top-of-beautiful-charts/#5622a32610ea)

Arguably Elastic is a success story about bridging the worlds of an API-first
technical stack with a democratized non-technical analytics framework. And
they started by just powering excellent search, and building value-add layers
over time. But they built into a then-vacuum of API offerings, whereas there
are many other (potentially inferior, but well-funded) speech-to-text APIs.
I'll be avidly following you guys as you navigate the space, and hopefully
you're able to find some good "hooks" or uniquely-easy-to-roll-out integration
stories that strike a balance between focus on technical excellence and
driving awareness in a super-linear way.

------
trevyn
> _Noah had the idea to start recording all audio from his life, 24 /7_

Want this as a product. :)

~~~
stephensonsco
You find out very interesting things even randomly sampling your life in
audio.

We still come back to this for fun. The original device was an intel edison
but recent variants have been based on the raspberry pi zero w.

------
vitovito
Do you plan to offer something around one-shot machine transcription with
offline/on-prem search?

I have ~200k hours of legacy audio I'd love to be able to do a fuzzy
(phonetic?) search on, to pull content from and get real (human-edited)
transcriptions of important stuff to resurface it, but there's not a lot of
incentive to push it through a service for a quarter million dollars and then
also pay to store and search it, since we're currently doing without it. Doing
it at extremely low priority, delivering it over a long span of time, for an
order of magnitude cheaper, with our IT standing up some stock fuzzy search
engine, is a pretty easy sell, though.

~~~
stephensonsco
We do custom models (train the full DNN, not just tack on a new text language
model) using transfer learning and it works for small numbers of examples too.

Glad to hear you asking about fuzzy search. That's something we do (it's
actually what Deepgram started on!). It's not in the docs at the moment (tends
to confuse people who are looking for transcription, we're working on how to
present it in a better way). You can submit with queries and get back
confidences and timestamps.

Many times the model doesn't need any training but it does increase accuracy
if you do training and can get really good if it's focused (it's a lot like
wake word detection -- we don't offer WWD as a real product yet either, just
saying the challenges are similar). Best thing to do is search for phrases if
you can, that really helps signal/noise.

------
dumbfoundded
Hi! Thanks for sharing and I have a few questions.

\- How does your WER compare to other engines?
[https://medium.com/descript/which-automatic-transcription-
se...](https://medium.com/descript/which-automatic-transcription-service-is-
the-most-accurate-2018-2e859b23ed19)

\- How do you gather data?

\- Where do you see your long-term differentiation? Is it the features you
build on top of other engines or is it the engine itself?

Disclaimer: I led engineering for temi.com (a competitor of your's) but am no
longer affiliated with it.

~~~
stephensonsco
It's a metric that's hard to nail down because there is so much parameter
space that you are flattening into one number. Also it doesn't address the "I
care about these five high value words (that are made up), can you recognize
them?" like product names and company names.

There's ~4 types of audio:

Phone call \- close microphone \- conversational \- low bandwidth audio \- two
way conversation \- more industry specific terminology

Meetings \- 2-5 people \- conversational \- far away mic \- better bandwidth
audio \- more industry specific terminology

Broadcast \- usually good diction \- close mic \- good bandwidth audio \- more
general terminology

Command&Control (saying to your phone: "go to <this address>") \- close mic or
array or mics far away \- short audio chunks, 2-10 seconds \- spoken in a way
that makes it easier to recognize (learned behavior) \- usually a lot of
widely known named entities are said

In that full aggregated line up I bet we'd be in the 22-24% WER pack. That'd
mostly be because we focus only on phone calls and meetings. We don't try to
improve command&control/broadcast/podcast type yet. Broadcast because it's
perceived as lower value (so customers tend not to pay for good recognition
for it [we do train models to make them better for specific
customers/verticals(usually a reduction of errors by 20-40%), but the buyer
has to have a budget for it for now, but there are ways to make it cheaper in
the long term]), command and control because you have to have a fleet of
devices out in the field collecting data and driving use cases and we don't
have customers there yet.

~~~
dumbfoundded
I guess maybe a better way to ask is which acoustic environments do you excel
in?

In terms of gathering data, I'm curious how to plan to get the 15K audio hours
it takes to train each of these models. The most you want to segment it (like
through acoustic environment or genders), the more data you need. Do you have
a cheap way of generating high quality data?

~~~
stephensonsco
If you're training from scratch around 10k hours is needed to get a good
model, but when you are transfer learning you don't need nearly that much (100
hours gets you a lot).

We excel in phone call and meetings settings. I.e. the typical
sales/office/support environment.

~~~
metildaa
Baidu trained their DeepSpeech model with 6000 hours of English to get a model
similarly accurate to Google/Microsoft, it may just be the type of quick model
your using that needs 10k hours to achieve good results.

Mozilla's DeepSpeech is quite interesting, languages like Turkish can get a
decently usable (~20% WER) model with just 80hrs of training data (no transfer
learning, starting from a clean slate).

~~~
stephensonsco
Yep, all good points. One thing to consider is that generalization is a big
problem. It's easy to get good on a specific dataset nowadays (like 5-10% word
error rate level on academic datasets), but that same model might do 40% WER
on data in the wild.

------
jaredwiener
This looks interesting. Curious what the pricing is? I don't see it on the
website.

~~~
stephensonsco
Price starts at $1/hr billed in 1 second increments. Frequently we charge less
than that, since the price is dropped with volume, and that's typically
businesses have a steady amount running through them (a few thousands hours).
Medium usage scale would be $0.25-0.75/hr (e.g. 10,000 hours to 100,000 hours
a month scale). Large usage is around 10,000 hrs+ per day and the price can be
much lower per transcribed hour (like $0.15/hr).

That's the ballpark for cloud + batch mode. If it's cloud + realtime it's a
little more. If you need it on premise it's a little more (we work with
integration partners to do parts of it).

Pricing for speech is interesting since there's more than just how many $/hr
in the equation. Usually businesses care about turnaround time, throughput,
failover/availability, and a collection of features. So we usually want to
talk about those goals and price accordingly to support 'em. I definitely wish
I had a better way to frame it than "it's complicated"!

~~~
Donald
What about custom models?

~~~
stephensonsco
There's no additional charge for training a custom model when your usage is a
minimum of 10k hrs/mo.

~~~
metildaa
So basically a $10k monthly commit is required to train a custom model? Would
it be possible to pay for the training itself if you are a lower volume user?

------
ivankirigin
With multiple speakers, can you identify who is speaking?

If you were in a conference room with multiple threads of conversation, could
you tease out all of them?

~~~
stephensonsco
Best to say "yes! but only some of the time". It's something we're working on
right now. You can be 80% accurate, by some metric, but it's still not good
enough usually to pass a human's sniff test. Good speaker labeled audio in
various settings is hard to find.

There are several ways to look at this problem too.

L1: exact speaker is known (voiceprint) and can be picked from all humans with
accuracy, even when others are talking L2: exact speaker is known from a
subset of people, even while talking in a conversation with others L3:
speaker1,2,3,... are identified accurately L4: speaker changes are identified
accurately

L1 is a really hard problem. L2 is fine if you don't care about the time
domain (knowing exactly when they spoke), but is harder if you have to
accurately detect changes. L3 is about as hard as L2 but the big goal isn't
who anymore, it's when. And L4 is easier, kinda like putting line breaks in
when human transcribing a file. Not too bad. All of them need better data
sources.

------
pouta
How does this compare to Trint in terms of speech recognition performance?

