Hacker News new | past | comments | ask | show | jobs | submit login

This an iOS wrapper for OpenAI Whisper library.

You can use the lock screen widget to start recording - allowing you to quickly capture what's on your mind.

This comment was the inspiration behind this: https://news.ycombinator.com/item?id=32928118

Pricing is currently somewhat high, until I figure out how much it costs to run Whisper on cloud GPUs. I'll adjust the pricing accordingly, and perhaps switch to per-minute model.




This is cool! Curious if it would be possible to run the model on device?


Probably not realistic. On an M1 Pro MBP, Whisper runs far slower than real time. Think on the order of days for a 2 hour recording.

I’ve been doing transcription work for public meetings. Whisper is truly incredible in terms of error rate even in extremely challenging circumstances (obscure acronyms, unusual terms, unusual names, poor recording quality). I was seeing only a few errors per hour; most things that look like errors are in fact accurate representation of humans saying weird things. But I have to run it on my desktop with CUDA enabled. With the medium model it is iirc barely faster than real time. I only have a 1070 so maybe it is better with more modern hardware.

Whisper does also have some slightly strange behavior with silence and very long recordings. I might do a blog post once I’ve got more experience.


On M1 Pro, with Greedy decoder and medium model, I can transcribe 1 hour audio in just 10 minutes (~x6 real-time) [0].

[0] https://github.com/ggerganov/whisper.cpp


I just transcribed a 32 minute audio recording of someone doing a speech that someone recorded using their phone mic.

I used default settings of "import audio file" with the Buzz application, and it was transcribed in less than 10 minutes. 24KB text file or so.

I'm on a windows PC with AMD ryzen 3


There were at least two errors in the video demo, and that was just 15 seconds of audio. “I can take some notes from a meeting” was transcribed as “I can take some notes from meeting”, and “I click stop [recording]” ended up as “And click the stop”.


Me too - I recently ported the model to plain C/C++ and I am now planning to run it on device and see if the performance is any good. Will post an update when/if it works out


For PCs there is the buzz application https://github.com/chidiwilliams/buzz/tree/main.


I suppose the built-in iOS voice recognition would be better for that.

I haven't really compared those two properly. Wonder how much better Whisper is.


Apple will keep up with anything that SOTA, just with a bit of a lag - so just expect they will be better soon if not already

Word of warning from someone who built an SDK that filled in a processing gap that Apple had (6DOF Monocular SLAM)[1] Apple will eventually make your technology obsolete and their version will be way better. See: ARKit

We open sourced it once ARKit came out because there was no way to monetize it further

[1] https://github.com/Pair3D/PairSDK


Whisper is a game changer in terms of accuracy. It makes Zoom, YouTube, Zoom, Office/Azure, Descript, and Otter.ai transcription look like jokes in comparison.

The step change in transcription accuracy here is significant enough to cross an important threshold for usefulness.


If you're not running it on Cloud GPUs right now and it's not on device, how is it running the Whisper library now?


How did you select your GPU provider?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: