I've been using whisper to get transcripts from my local radio stations. I know it's out of scope for the original project but I hope someone can build a streaming input around it in the future. I currently pipe in and save 10 minute chunks that get sent off for processing.
The streaming I was speaking to is from a url or network resource. I keep a directory of m3u files which is just a url inside that you can open in vlc etc. `So ffmpeg -i [url] -c copy [file-name].mp3` does the trick for now. `mpv` can do this while saving to a file but it's a nice point to start from as I'm already getting some ideas how I could use this or roll my own, thanks!
whisper.cpp provides similar functionality via the `livestream.sh` script that performs transcription of a remote stream [0]. For example, you can transcribe BBC radio in 10s chunks like this:
$ make base.en
$ ./examples/livestream.sh http://a.files.bbci.co.uk/media/live/manifesto/audio/simulcast/hls/nonuk/sbr_low/ak/bbc_world_service.m3u8 10
[+] Transcribing stream with model 'base.en', step_s 10 (press Ctrl+C to stop):
Buffering audio. Please wait...
here at the BBC in London. This is Gordon Brown, a former British Prime Minister, who since 2012 has been UN Special Envoy-
Lemboy for global education. We were speaking just after he'd issued a rallying cry on the eve of the 2022 Football World Cup. For
Governments around the world to pressure Afghanistan to let girls go to school. What human rights abuses are what are being discussed as we...
run up and start and have happened and have seen the World Cup matches begin. And it's important to draw attention to one human rights abuse.
that everyone and that includes Qatar, the UAE, the Islamic organization of countries, the Gulf Corps...
Wow! Thanks for sharing, I didn't explore the repo much beyond that link, this looks very promising, I was going to tomorrow but checking out the src and building now! The confidence color coding btw is chef's kiss. Great job with this.
The codecs and narrow frequencies used on most public safety trunked radio networks is truly terrible. IMBE and AMBE+2 should have never made it past Y2K, yet Motorola and Project 25 have ensured these remain in widespread use on these networks: https://en.wikipedia.org/wiki/Project_25
If Whisper achieves 85% or higher accuracy on this audio, it would be a miracle. Garbage in, garbage out tbh. Project 25 needs to move to a modern codec, ideally not one seeing little development done by one small company.
Being a very constrained domain with it's own speech rules, transcripting ATC conversation would probably benefit at lot from fine-tuning on ATC speech data.
Yeah the tech is there for radio, same for POTS as well (still 8 bit/8000Hz at its core). I do listen on the scanner from time to time and assuming a clear signal traditional NFM always beats digital in terms of intelligibility.
Speaking of the telephone, they definitely could put improved audio quality as part of the 4G and 5G specs, but they don't. A modern telephone network is all IP anyways and backwards compatibility can be maintained.
I wonder how much training on degraded/radio encoded samples would be needed to improve performance in this area. You’re probably not the only person who wants to monitor police radios in their city.
Great question, I've not encountered it too much as I'm looking for specific stuff in the transcript based on keywords so some loss is acceptable and I don't always need the data immediately so it has not been an issue so far but what I do to alleviate this somewhat is have "stitched passes" at set intervals.
News, betting games and some shows HAVE to happen exactly at a certain time and are very rarely late so you can use these known times as checkpoints with some bias. So say I run this command `ffmpeg -i http://someexamplesite.fm/8b0hqm93yceuv -c copy -f segment -segment_time 60 -reset_timestamps 1 zip-%03d.mp3` that automatically chunks the stream into 1 minute files and I know news gets read at 1pm, I can merge everything from 10am to 1pm into say "Segment B" and then process that, you get the idea.
I also have a step in the pipeline after on the transcript after that tries to summarize what it can so any small gaps would likely be inferred as I have not noticed anything too wonky and the summaries and text I get so far have been pretty clean and good enough for my needs. Once I bench this some more however I'm sure I will have this and other interesting problems to solve, the one I'm fighting now a bit is ads that have music and fast talking.
I came up with this specifically because of whisper. This looks nice but doesn't suit my use case as I'm "listening" to over a dozen stations at once and I also stitch the chunks together end of day in a fat wav file.
Stupid simple setup until I streamline the process some more. Just running on my pc (i7 8700k, gtx 1080) and using ffmpeg for copying the stream to disk and a simple python script that creates the chunks and transcripts.