Hacker News new | past | comments | ask | show | jobs | submit login
Automatically transcribe an interview, meeting or video (voicedocs.com)
71 points by MajidMM on May 10, 2021 | hide | past | favorite | 51 comments



Quite a few large companies intentionally do not use audio transcription services, because they don't want the liability of everything everybody has ever said being written down in a format that can easily searched during legal discovery.


Not against the idea per se, but man, I wish people understood that a recording of a conversation is NOT a replacement for good notes or documentation.

Recording conversation means saving time of an expert in exchange of additional time spent by the student, when looking things up in it. Having good notes/docs is easier for students, but more expensive for the expert, who needs to spend more time to organize the information properly.

So depending on what you're doing, there might be different tradeoffs.


can the recording and then transcribing with good editor (with automatic speech-to-text built-in) be a good solution?


My point is, recording (even if we talk about something like a chat log) is just data, but to convert it to notes, i.e. information, you need to do additional (editing) work. That is the hard problem.

Sure, searchable conversation data are better than nothing. But it is, by definition, disorganized. I worry about the future where people will stop making notes/docs just because they can record everything.


Fun fact: In Germany, most state parliaments and the state parliament still use hand-written stenography for protocols because it is still most reliable (catching all: shouts, noise-expressions from the crowd, etc.) and wasn't replaced by a typing system because up to date there is no typing stenography that keeps up with the speed of hand-written stenography (in German language).


I once tried to build a German service for transcribing online meeting calls, similar to what UberConference now offers, by using a cloud API for the STT.

Oh wow was I surprised to see the quality. All of the cloud providers are abysmally bad at transcribing German.

I believe the reason is that in German, you can make up word combinations on the fly and use them as valid nouns. And people do that, if it's convenient or if it enables you to be more precise.

"Dampfschiffahrtsgesellschaft" = Society (Gesellschaft) for Driving (Fahrt) of Boats (Schiff) with Steam (Dampf)


> Society (Gesellschaft)

In this context, Gesellschaft translates to Company. (GmbH=LLC)

The spelling also depends on whether you're talking about the historical Erste Donau-Dampfschiffahrts-Gesellschaft or any generic Dampfschifffahrts-Gesellschaft – note the ff vs. fff in middle; the old company name retains its pre-1996 spelling.

Donaudampfschiffahrtsgesellschaft without hyphens was as far as I can tell never officially used by the company, but used informally as part of the name of the Donaudampfschiffahrtsgesellschaftskapitänstango, a 1930s song.


In my experience, they're pretty poor in English too. Especially when it's ad hoc conversation where people don't finish sentences, repeat themselves, "um" and "err" etc


A problem most people don't think about when talking about transcription is that people don't talk like books. Not only do you get unfinished sentences and filler words, you also get garbled words, non-standard pronunciation, and so on.

In the case of pronunciation this primarily poses a problem with detecting the intended word, but in other cases "cleaning up" the output may lose contextual information (e.g. what a speaker was going to say before cutting themselves off and using a different word). This is difficult enough for a human to get right, let alone a machine.


Also, when transcription programs (most familiar example: YouTube) fail, they usually fail on the words that a human listener would also have trouble understanding / telling apart. So the transcription is useful if you are deaf or forced to watch the video without sound, but if you're using subtitles because your English is not good enough to understand the speakers without them, their usefulness is pretty limited...


Once you go beyond ~7 words (= what people would utter to their virtual assistant), the quality of all off the shelf tools (both open source and offered services) is laughable. Sentence boundary detection, punctuation, speaker segmentation, and all those features you would need for good transcription are in a really bad state.


"Really bad" is an exaggeration, I think. The auto-transcription features in both Google Meet and Zoom are more than acceptable, they're often very useful in catching missed words during a meeting.

They trip up on technical jargon but handle everyday conversations just fine, including speaker detection, punctuation, idioms, etc.

But that's also a slightly different use case, where each speaker is in their own (somewhat) quiet environment and on separate connections (and thus audio tracks).

It's much harder to do all that after the fact, like with a recorded video.

I find Trint.com, which is partially automatic, to be good for that... the AI does a first pass, and a human cleans it up afterward. YouTube has a similar assisted-auto feature for their captions, minus speaker separation.


After the simplification of spelling system ( https://en.wikipedia.org/wiki/German_orthography_reform_of_1... ), German got a big advantage: you can write and read nearly everything, even if you don't know its meaning.

Due to much more complicated grammar German is much more difficult to learn than English, but at least the spelling is easy.

I wonder why more languages never try to simplify their orthographies. Children could spend years learning useful things, instead of wasting time on spelling.

Controversial opinion here: they should have removed ß (sharffes S) completely. It is still used in some relatively rare cases.


After reading the last line of your comment I put the words back in the German word (in English) and got -

SteamBoatDrivingSociety

Which actually made complete sense even in English!


Note that recording via stenography is a two-step process.

The first is to record the sounds you hear. Look at a common stenographic "alphabet" (often called "shorthand alphabet" though that practice is essentially dead) or at the keyboard of a stenographic machine.

Then the stenographer reads the output (either hand or machine generated) and writes a text using a combination of cue (from the paper) and memory.

This is quite different from trying to do straight text-to-speech.


Any SAAS service that process spoken or written language should clearly state what languages they support on the frontpage.


Very very good point


The pricing on this transcription is very high ($12/hr) for automated transcription. Compare to existing solutions like Descript, Rev, Otter.ai - what makes it so much better?


No, because of data protection.

I won't upload recordings (with possibly sensitive information) to a third party.


There is a data protection policy, of course.


> "Trusted by organizations of all sizes".

Apart from Itep Pictures, all others seem to be in Turkey. Are you based in Turkey?

If yes, allow me to place zero trust on everything-Turkey, under the current government/leadership. I strongly believe that Turkey lacks the basic/fundamental freedoms and rule of law is going whichever way this regime's leader wants it to go.

I would similarly hesitate to upload such data to Iran, North Korea, Syria.

If no, where are you based?


No, the company is based in Germany.


Which only confirms that it's impossible to use your service and stay in compliance with GDPR.


Can you clarify? They're a German company and state that they do not share uploaded audio recordings with third parties.

You'll need to sign a DPA with them to be compliant with the GDPR tho, and they'd need to disclose where the data will be stored and processed and how they maintain control over that data if it's a third party.


I'm curious about your service, can you explain what's different from similar services like happy scribe for instance?

As it was already said you should make clear which languages are supported.

And I think you should put prices in USD and/or Euros instead of TL (turkish lira), ideally Euro's for european visitors and UDS for the rest of the world. Besides the free tier, if I'm serious about the service I will be less keen to test it out before knowing the cost of it and at first I've seen the price without looking too much and thought it was pretty expensive before understanding it was expressed in TL's.


Sorry, there was a bug in pricing page. It should now show the prices in USD. The difference is own Speech recognition engine, easy document-like editor and separate subtitle editor.


:s/voicedocs/My own shameless plugin/g

We provide the same functionnality (except for the Word export)

+ direct recording and upload from Zoom, Hangouts, etc.

+ video / audio editing & sharing by high-lighting which part of the transcript you'd like to keep.

www.spoke.app :)

In 70 languages (see language list here: https://spoke-for-sumo-lings.webflow.io/)


This is a good tool to use when dealing with health insurance in the United States. You should at minimum keep an Excel spreadsheet of date, whom you talked to, which department they are from, purpose of the call, follow up actions, etc.

But, with the way insurance has been going in the US lately, you better be recording and transcribing that call. Usually, if the call line is recorded (basically all US health insurance companies do this) you can legally record the phone call without permission from the other party.

I personally have an NVIDIA Jetson AGX Xavier with AI tools for speech-to-text, person identification, and transcribing, which I use for important phone calls. I use my own AI tools and devices for privacy reasons.


Please let us know what models you use for STT!


Obvious caveat that automatic transcriptions are not a replacement for manual transcriptions. They're better than nothing but the problem with mistakes in automated transcriptions is that they can entirely change the meaning of a statement in ways that are not necessarily obvious if you don't listen to the audio at the same time.

They also struggle with domain specific jargon depending on what data they were trained on. While manual transcriptions will mark ambiguous utterances as such (or ask for additional information), automation can create a false sense of certainty while just "guessing" whatever it matches most closely. This is a hard problem and unlikely to be solved soon.


I find they serve different use cases.

ML transcriptions are fast/cheap and they're fine if you mostly want to pull out some quotes or check some things in your notes. But, in general, I find they're not remotely worth my time if I'm going to publish a transcript in which case I get a human transcription. (And even that can be a bit tough with accents, technical jargon, overlapping voices, etc.)


I would agree but given that automatic transcriptions are cheaper, many people treat it as an alternative when manual transcriptions would be more appropriate.

Some tech conferences were pretty good about hiring actual people for live captioning, which was great, but with conferences mostly happening online via video streams at the moment, automated captions and transcriptions might seem like an obvious choice if you don't understand the limitations.


I think you’re going to have a hard time competing with the major cloud providers on transcription alone. AWS Transcribe, for example, is quite easy to use and supports batch transcription as well as streaming, custom language models, etc.

There’s still quite a bit of value-add possible on top of that, however. The ability to edit transcriptions is a great start, especially if you maintain timecodes against the media. Developing or curating domain-specific language models to improve accuracy is also a likely option. There also appears to be a lot of interest in using real time transcription to augment live events with content derived from the conversation.

Good luck!


Thanks for the note! As you have stated, there's still a lot of work for researcher/journalist after getting raw transcription, so good editing tool syncing audio and text is valuable here.


I wouldn't feel easy uploading sensitive information for "transcription". Who is this service for? As an interviewee I also wouldn't consent that potential employer could disclose my information in such way.


I’ve used Sonix.ai for this type of thing before. Simple pricing and nice editor. The only thing it could benefit from is speaker identification.

Apart from your service offering an onprem option is there much else difference?


Considering the AWS API is essentially open for everyone to start a transcription service, what exactly is the difference here. If you know what you're doing you can build this in about 4 hours.


And "Dropbox is just SVN mounted on top of curlftpfs" - doesn't mean there isn't a market to make technological capabilities more easily accessible for the masses.

That being said, I am skeptical about the quality and would like to see some demos. Audio recordings of meetings are especially difficult to transcribe accurately.


That's a great point, but I'm seeing an absolute explosion of transcription services which are all essentially based on AWS.

The only real innovation here is when this is combined with language learning apps to help me practice my Chinese pronunciation, but even then I know I'll have to look to hire a tutor soon.


I wonder if you could transcribe really-real time into something like Etherpad? https://etherpad.org


This is cool, but what exactly do you transcribe? What happens if I upload a Spanish or Japanese video?


Anyone know why this is interesting enough to be on Hacker News? There are lots of services which do this, many of which have significantly better functionality. This just looks like a very thin wrapper on a cloud speech-to-text service.


The company builds its own speech-to-text engine and has better accuracy than Google in German and Turkish languages. Independent review: https://www.abtipper.de/transkription/sprache-zu-text/


A lot of spoken text is highly inefficient to read.


Doesn't Google recorder does freely for you?


Yes, but there's still a lot of work after getting the transcript, even if it is accurate. Researchers/journalists need a good editing tools for reviewing, editing, summarizing and etc.


Do you support speaker diarization?


is the spanish language supported?


I just noticed but even the German language footer doesn't included a link clearly labelled "Impressum". That information seems to be in the privacy policy (which I can only get in English even when switching to German?) but that is not sufficient to meet German legal requirements.

The privacy policy also doesn't provide all the information the GDPR generally requires you to provide, e.g. spelling out users' rights under the GDPR and what legal basis is given for collecting each specific piece of information.

I'm mostly pointing this out because it could get them sued, but I'd also expect a company based on a service like this to take privacy a bit more seriously, or at least present themselves as if they do so.


Thanks for the review. The privacy policy lists all collected information, how (if any) they're shared with other parties, also right for the users to delete the information any time. What else should be listed here? I didn't get the "legal basis for collecting each information" -- is it required? This is just basic information that software needs to operate.


Well, first of all, you need a link clearly indicating it's the "Impressum" (usually translated as "imprint", "legal" or similar in English versions) as per §5 TMG: https://de.wikipedia.org/wiki/Impressumspflicht#Telemedienge...

You can get sued for omitting such a page (by any bored lawyer really) because it's considered anti-competitive and a misdemeanor: https://de.wikipedia.org/wiki/Impressumspflicht#Ordnungswidr...

Here's a lengthy explainer of what should go in a privacy policy to be fully compliant (in German), note that "clear and precise" language is generally understood to mean being explicit about the legal basis (i.e. parts of the GDPR) under which the data is collected and processed: https://www.datenschutz.org/datenschutzerklaerung/

In any case, your privacy policy link on the German language version of your website gives me the policy in English, which violates the GDPR's requirements for "clear language" regardless of the actual content by not being in German: https://voicedocs.com/de/legal/privacy-policy

But to be honest, you shouldn't be asking a random person on HN, you should talk to a lawyer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: