Pricing is currently somewhat high, until I figure out how much it costs to run Whisper on cloud GPUs. I'll adjust the pricing accordingly, and perhaps switch to per-minute model.
Probably not realistic. On an M1 Pro MBP, Whisper runs far slower than real time. Think on the order of days for a 2 hour recording.
I’ve been doing transcription work for public meetings. Whisper is truly incredible in terms of error rate even in extremely challenging circumstances (obscure acronyms, unusual terms, unusual names, poor recording quality). I was seeing only a few errors per hour; most things that look like errors are in fact accurate representation of humans saying weird things. But I have to run it on my desktop with CUDA enabled. With the medium model it is iirc barely faster than real time. I only have a 1070 so maybe it is better with more modern hardware.
Whisper does also have some slightly strange behavior with silence and very long recordings. I might do a blog post once I’ve got more experience.
There were at least two errors in the video demo, and that was just 15 seconds of audio. “I can take some notes from a meeting” was transcribed as “I can take some notes from meeting”, and “I click stop [recording]” ended up as “And click the stop”.
Me too - I recently ported the model to plain C/C++ and I am now planning to run it on device and see if the performance is any good. Will post an update when/if it works out
Apple will keep up with anything that SOTA, just with a bit of a lag - so just expect they will be better soon if not already
Word of warning from someone who built an SDK that filled in a processing gap that Apple had (6DOF Monocular SLAM)[1] Apple will eventually make your technology obsolete and their version will be way better. See: ARKit
We open sourced it once ARKit came out because there was no way to monetize it further
Whisper is a game changer in terms of accuracy. It makes Zoom, YouTube, Zoom, Office/Azure, Descript, and Otter.ai transcription look like jokes in comparison.
The step change in transcription accuracy here is significant enough to cross an important threshold for usefulness.
You can use the lock screen widget to start recording - allowing you to quickly capture what's on your mind.
This comment was the inspiration behind this: https://news.ycombinator.com/item?id=32928118
Pricing is currently somewhat high, until I figure out how much it costs to run Whisper on cloud GPUs. I'll adjust the pricing accordingly, and perhaps switch to per-minute model.