Hacker News new | comments | show | ask | jobs | submit login
Codec2: A Whole Podcast on a Floppy Disk (auphonic.com)
486 points by ericdanielski 28 days ago | hide | past | web | favorite | 131 comments

Aside from the seriously impressive WaveNet based results, I think the article doesn't do the codec itself enough justice. I mean, low-bitrate speech codecs have been around for some time (hey, vocoders are the oldest kind of audio codecs in history!), and I grew skeptical when they started to compare with mp3 and opus.

But looking at this page Codec2 really holds its own when compared to AMBE and especially MELP, two of the most prominent ultra low-bandwidth speech codecs used today: https://www.rowetel.com/?p=5520

Here is a fascination video history of the vocoder. Complete with coverage of the early room size machines. https://video.newyorker.com/watch/object-of-interest-the-voc...

The article failed to mention the original reason why Codec2 is invented.

In digital amateur radio communication, currently the most widely-used codec is AMBE. But AMBE is a proprietary codec, covered by patents, unhackable - the counter-thesis of amateur radio. Codec2 was born to bring freedom to digital amateur radio communication, and technically even better than AMBE.

The article does mention why Codec2 was invented, under "Background".

FWIW the main AMBE patent expired in December, but I was always surprised hams chose to use it.

Codec2 is also fully open source and patent-free, in contrast to virtually every other ultra-low-bitrate voice codec (which are proprietary and have expensive patent licensing attached). He has a Patreon if you want to support him in the ongoing development of Codec2 and his SDR modems to enable use of it in amateur radio: https://www.patreon.com/drowe67

Codec2 might be patent-free, but Codec2 with a WaveNet decoder isn't because WaveNet (convolutional neural networks for generating audio sequence data) is patented: https://patents.justia.com/patent/20180075343

When it was patented? When I was working with AI about 15 years ago I was experimenting with conv nn to generate audio. I wouldn't have expected for this to be patented as this is so friggin obvious thing to do. It is like patenting 2+2=4 once you discover numbers.

> It is like patenting 2+2=4 once you discover numbers.

Welcome to software patents.

[Serious question] Does your prior art invalidate the patent?

I am not a scientist, just I was very interested in that space and it would be a long way to create scientific paper out of my experiments. Since patent law has been created for the privileged to reap profits I wouldn't stand a chance contesting that.

Isn't that specifically what software patents are? Pythagoras could have become a billionaire in his time and don't get me started on Al Khwarizmi.

raises hand

Question for IP experts: now that I have heard of the existence of WaveNet and a rough idea of how it works (training a neural network to decode low-bitrate speech data with as much fidelity as possible to the original), would I be prohibited from selling a similar product built with the same technique? How about if I had never heard of WaveNet and went about doing the same thing?

Yes, independent implementations of patented works are covered by the patent.

BUT: patents are far more specific than just "a neural network to decode low-bitrate speech data with as much fidelity as possible to the original)". Starting with that goal, you are unlikely to recreate WaveNet's specific structure that is patented.

In fact, WaveNet describes a more general method to efficiently work with sound signals, somewhat comparable to convolutions for images. It's also not impossible to work with sound using alternative MM structures that are not patented, and might actually perform better than WaveNet.

WaveNet is actually a bit more complicated than that. But its still probably recreatable if you read the paper.

Is Speex also in that category?

Speex and Opus bottom out around 6000-8000 bps. Codec2 starts at 3200 bps and goes down to 700 bps. The original target use for Codec2 is real-time transmission in the HF (shortwave) and VHF/UHF amateur radio bands where those are about as much as you can transmit within the same bandwidth as analog voice modes once you factor in error correction.

Speex has actually been superseded by Opus. Both are patent free as well.

Opus is patented[0] but just royalty-free patents.

[0]: http://opus-codec.org/license/

Having grown accustom to MP3 artifacts, it's strange to hear artifacts that are natural, but just aren't quite right. More specifically, in the male voice sample: "sold about seventy-seven", I received it as "sold about sethenty-seven".

Yes, and "certificates" sounds like "certiticates".

Reminds me of a story about a copying machine that had a image compression algorithm for scans which changed some numbers on the scanned page to make the compressed image smaller. (Can't remember where I read about that, must have been a couple years ago on HN)

It's the lossy jbig2 compression in Xerox copiers: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...

And yes, I think this is a relevant comparison. As the entropy model becomes more sophisticated, errors are more likely to be plausible texts with different meaning, and less likely to be degraded in ways that human processing can intuitively detect and compensate for.

> t's the lossy jbig2 compression in Xerox copiers: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_....

My understanding of this fault was that it was a bug in their implementation of JBIG2, not the actual compression? Linked article seems to support this.

I think it was just overly aggressive settings of compression parameters. I don't see any evidence that the jbig2 compressor was implemented incorrectly. Source: [1]

[1]: https://www.xerox.com/assets/pdf/ScanningQAincludingAppendix...

Right. Jbig2 supports lossless compression. I'm not very familiar with the bug, but it could have been a setting somewhere in the scanner/copier that it was changed to lossy compression instead. Or they had lossy compression on by default or misconfigured some other way (probably a bad idea for text documents).

The bad thing was that it used lossy compression when copying. That was the problem.

No. The bug was when using the "Scan to PDF" function. It happened on all quality settings. Copying (scanning+printing in one step, no PDF) was not effected.

I remember differently, but I don't want to pull up the source right now.

I did check some of the sources, but was not able to find the one I remember which had statistics on it.

The xerox FAQ to it does lead me to consider that I might be confusing this with some other incident though, as they claim that Scanning is the only thing that is affected.


I'd believe him more then any other source.

He did his presentation in english at FrOSCon 2015, can be seen here: https://www.youtube.com/watch?time_continue=95&v=c0O6UXrOZJo

No compression system in the world forces you to share parts of the image that shouldn't be shared. So that's true in a vacuous sense.

But the nature of the algorithm means that you have this danger by default. So it's fair to put some blame there.

This is a big rabbit hole of issues I'd never even considered before. Should we be striving to hide our mistakes by making our best guess, or make a guess, that if wrong, is easy to detect?

The algorithm detected similar patterns and replaced these with references. This lead to characters being changed into similar looking characters that also appeared on the page.

Xerox copier flaw changes numbers in scanned docs: https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_m...

If we're abandoning accurate reproduction of sound and just making up anything that sounds plausible, there's already a far more efficient codec: plain text.

Assuming 150wpm and an average 2 bytes per word (with lossless compression), we get about 5bps, which makes 2400bps look much less impressive. Add some markup for prosody and it will still be much lower.

This codec also has the great advantage that you can turn off the speech synthesis and just read it, which is much more convenient than listening to a linear sound file.

That codec sounds great, if it exists.

If you have such a codec, it would be worth testing the word error rate on a long sample of audio. e.g. take a few hours of call centre recordings, pass them through each of {your codec, codec2}, and then have a human transcribe each of:

- the original recording

- the audio output from your proposed codec (which presumably does STT followed by TTS)

- the audio output from CODEC2 at 2048

Based on the current state of open source single-language STT models, I would imagine that CODEC2 would be much closer to the original. And if the input audio contains two or more languages, I cannot imagine the output of your codec will be useful at all.

Speech to text is certainly getting better but it makes mistakes. If the transcribed text was sent over the link and then a text to speech spoke at the other end you'd lose one of the great things about codec2 - the voice that comes out is recognisable as it sounds a bit like the person.

A few of us have a contact on Sunday mornings here in Eastern Australia and it's amazing how the ear gets used to the sound and it quickly becomes quite listenable and easy to understand.

Could you elaborate on "a contact"?

Are you using Codec2 over radio?

Yeah, the main use case for codec2 right now is over ham radio. David Rowe, along with a few others, also developed a couple of modems and a GUI program[1]. On Sunday mornings, around 10AM, they do a broadcast of something from the WIA and answer callbacks.

[1] - https://freedv.org/

What you might be able to do is your the text codec as the first pass, then augment the audio with Codec2 or so to capture the extra information (inflections, accent, etc...), for something in between 2 and 700bps.

One of the very few things I know about audio codecs is that they at least implicitly embody a "psychoaccoustic model". The "psycho" is crucial because the human mind is the standard that tells us what we can afford to throw away.

So a codec that agressively throws away data but still gets good results must somehow enbody sophisticated facts about what human minds really care about. Hence "artifacts that are natural".

I thought the same thing. Compression artifacts that don't sound like compression artifacts, could lead to hard-to-detect mistakes.

I found the artifacts odd too. It sounds like the guy speaking has gotten a bad cold or allergy and has stuffed sinuses.

I'd love to hear how it sounds with a 700bps stream.

Yes, I heard the same artifacts!

In the normal codec2 decoding it sounds like "seventy" but muffled and crunchy.

In the wavenet decoding, the voice sounds clearly higher quality and crisp, but the word sounds more like "suthenty". And not because the audio quality makes it ambiguous but it sounds like it's very deliberately pronouncing "suthenty".

It's as if in trying to enhance and crisp up the sound, it corrected in the wrong direction. It sounds like the compressed data that would otherwise code for a muffled and indistinct "seventy", was interpreted by wavenet but "misheard" in a sense. When wavenet reconstructs the speech, it confidently outputs a much clearer/crisper voice, except it locks onto the wrong speech sounds.

With the standard "muffled/crunchy" decoding, a listener can sort of "hear" this uncertainty. The speech sound is "clearly" indistinct, and we're prompted to do our own correction (in our heads), but also knowing it might be wrong. When the machine learning net does this correction for us, we don't get the additional information of how its guess is uncertain.

This is exactly the sort of artifact I'd expect with this kind of system. As soon as I heard the ridiculously good and crisp audio quality of the wavenet decoder, that fidelity just isn't included in the encoding bits, that's impossible. It's a great accomplishment and just impressive, but it has to "make up" some of those details in a sense very similar to image super resolution algorithms.

I'm just thinking we should perhaps be careful to not get into a situation like the children's "telephone" game, if for some reason the speech gets re/de/re/encoded more than once. Which is of course bad practice, but even if it happens by accident, the wavenet will decode into confident and crisp audio, so it may be hard to notice if you don't expect it.

If audio is encoded and decoded a few times, it's possible that the wavenet will in fact amplify misheard speech sounds into radically different speech sounds, syllables or even words, changing the meaning. Kind of like the "deep dreaming" networks. Sounds like a particularly bad idea for encoding audio books, because small flourishes in wording really can matter.

Edit: I just realised that repeated re/de/re-encoding can in fact happen quite easily if this codec is ever implemented and used in real world phone networks. Many networks use different codecs and re-encoding just has to be done if something is to pass through a particular network.

But the whole thing is ridiculously cool regardless :) And I wonder if they can improve on this problem.

That is very impressive! I wonder if a WaveNet decoder could be built for phone calls, as those still sound awful. If it's possible to do this only on the decoder side you don't have to wait for your network to start supporting HD voice or VoLTE to get better quality audio!

I'm curious as to how fast the WaveNet decoder is? The last time I saw an article on it, it took multiple minutes to generate a second of audio.

The original WaveNet repeated a lot of computations; with caching/dynamic programming, it became a lot faster. Other optimizations were also doable. In any case, that was eventually made moot by using model distillation to train a wide flat (not deep) NN, which is 20x realtime: https://deepmind.com/blog/high-fidelity-speech-synthesis-wav... (This was necessary to make it cost-effective to deploy onto Google Assistant.)

Actually if you're lucky and make a phone call with HDVoice, or whatever they're calling it, the quality is excellent. It makes a huge difference. Unfortunately the place where you really want good quality is call centres - it's often hard to hear people and half of the reason is the shitty POTS quality - and call centres will probably get HDVoice in about 40-50 years. Maybe.

Edit: nm should have read all of your comment before replying!

>Actually if you're lucky and make a phone call with HDVoice, or whatever they're calling it, the quality is excellent

Can confirm. I spend a lot of time in fringe reception areas, but every now and then I get a good, strong signal and the HD Voice kicks in between my iPhone and my wife's and it sounds like she's standing right next to me. It really is something to experience, especially if the previous phone call was over regular tech.

Back when AT&T was running the "You get what you pay for" ads to combat SPRINT and MCI, it had a service you could sign up for that would give your landline phone calls amazing quality.

Sadly, a majority of people would rather pay less for crap than more for quality; even back then.

Also why no one really appreciated ISDN over here in central Europe. Yes, there are ways to do better _now_, and it would have been trivial to support channels with better codecs by negotiating something different than u-law 8kHz PCM, but back then that resulted in rather good quality. The issue was that few people got ISDN phones, which resulted in them using analog outputs on an adapter device, which got later incorporated into the internet router, which at one point switched from ISDN to VoIP. And people plug a phone via an analog jacket into the router, instead of using a VoIP capable phone or even anything digital. While many do use DECT cordless phones, those rarely use the DECT hardware inside the router, and instead use the one in the charging dock, which itself connects via an analog, POTS-bandpass-filtered, phone jacket to the VoIP router.

Oh well, we will probably never get that kind of quality, which is only possible with QoS on the whole path, if there is any congestion. That is the one thing something like rocket.chat and discord can't provide.

Edit: the way to do this is to force quality upon people, wherever you won't drive them away with the cost this incurs. That way people will associate your brand as a whole with the quality, i.e., in that case, people will associate AT&T with quality, not AT&T premium. Normal people do not even know what kind of plan they are on, except for about one hour before and after they sign the contract.

>Also why no one really appreciated ISDN over here in central Europe.

Except Germany. Which had probably the best telephone system in the world when they deployed ISDN nationwide.

I am speaking as a German. Sure, larger businesses used proper ISDN, but your uncle/mom didn't. The best you could hope for there was only DECT compression, aka ADPCM 4bit/8kHz.

Sure they did. Adoption rate was about 30%, possibly even higher at it's peak.

DECT codec is useless if it's transmitted through analog lines, as it gets converted to standard 3.4k-Hz quality anyway. Except for in-house calls of course.

The problem with land line voice quality is that people expect a land line phone / VoIP adapter to cost like 20 USD or Euro. At this price point you can't have fancy codecs and audio hardware that delivers a decent signal. Bad audio hardware with a good codec can actually decrease audio quality (the mic's noise no longer gets filtered as it is with 8kHz PCM).

I think it was AT&T that had a test number you could call to hear a higher quality phone call (I was pretty young at the time, so my memory is fuzzy). I remember it sounding very good, but that test number was the only time I remember hearing that quality over the POTS. The VoLTE and HD voice I occasionally get on my iPhone reminds me of that system.

What do you mean with "HDVoice"? On landline connections this usually means G722. G711u/a is definitly not "HD".

I don't know what technology it is specifically, but it's a brand name they used for actual high quality calls. Think, 128 kB/s MP3, rather than the standard cups-and-string quality.

It only seems to work on mobile.

I know the difference, used G722. On mobile its G722.2, a totally different codec, but with the same ~7KHz range.

But there were some companies that advertised a lower frequency range as "HD".

He probably means Adaptive Multi-Rate Wideband (AMR-WB) [1], AKA G.722.2. There is a common misconception that it is VoLTE only, but actually it works pretty well on 3G too. It is night and day compared to legacy codecs.

1. https://en.wikipedia.org/wiki/Adaptive_Multi-Rate_Wideband

I assume the usual wideband codecs used with VoLTE.

Chaining codecs arbitrarily tends to create really bad artifacting. Current cell and VOIP systems already utilise multiple compression algos.

Everything spoken in a whole life could fit on a 128GB pendrive (assuming 5% talk time). Astounding.

Black Mirror is now technically possible.

Make sure you get to the end and listen to the WaveNet samples, amazing stuff.

Let say we have Codec2 with WaveNet, its 3.2Kbps now does similar to may be 16Kbps EVS. ( EVS being the codec used in VoLTE, which is slightly better then even Opus in Speeches. )

What "value" / "uses" does this bring us?

It cant be used in podcast because as shown it isn't very good with Music. And many podcast has Music in it.

While Codec 2 with WaveNet can have a 2-4x reduction in bitrate. I cant think of a application that benefits from this immediately.

The other thing I keep having in my mind is convolutional neural networks on Codec in general, Music, Movies, etc. What sort of benefits it bring us.

> What "value" / "uses" does this bring us?

Maybe not too much for "us" with LTE and 128GB storage on our phones, but in cases of low bandwith (think digital police radio), or when you have low storage availability, that's really awesome.

If you'd be recording a huge number of phone calls, such size reduction might bring significant savings.

Seriously impressive and game-changing results, especially when you take Wavenet into account. I'm curious to see how Wavenet would perform w/Opus.

I've become almost entranced with the concept of comparing things to the size of a Floppy Disk. I'm actually planning to get a tattoo of one on my right forearm. I've been working on a large business management platform for the last couple of years and noticed that after investing $500k (salaries/etc) and building a huge amount of functionality, the frontend and backend codebases are still under 1.5mb. Pretty amazing.

I actually got a floppy disk tattoo on my foot in a moment of spontaneity (bottomless mimosas). https://imgur.com/a/slCG519

nice haha

When we had a 486 running Windows 95, I used to convert CDs to WAV for fun. The GSM 6.10 codec in Sound Recorder (22050 Hz, Mono, 4 KB/s) could fit about 1 song onto a floppy.

With Real Audio, you could fit one side of a LP onto a floppy diskette


Of course sound quality is lacking, but it was really cool at the time, but the amount of of time and resources needed was insane for the time.

Would be a fun experiment to use something like 3 or even 1 sine to get unintelligible speech, but then pair it with subtitles where each syllable of the text is animated synchronized with the speech. (Like the "follow the bouncing ball" song lyric animations.)

By pairing the audio with the text, you would almost certainly convince the listener that they can understand it.

Edit: typo


Sine-Wave Speech Demonstration https://youtu.be/EWzt1bI8AZ0?t=74

> Sine-wave speech is an intelligible synthetic acoustic signal composed of three or four time-varying sinusoids. Together, these few sinusoids replicate the estimated frequency and amplitude pattern of the resonance peaks of a natural utterance (Remez et al., 1981). The intelligibility of sine-wave speech, stripped of the acoustic constituents of natural speech, cannot depend on simple recognition of familiar momentary acoustic correlates of phonemes. In consequence, proof of the intelligibility of such signals refutes many descriptions of speech perception that feature canonical acoustic cues to phonemes. The perception of the linguistic properties of sine-wave speech is said to depend instead on sensitivity to acoustic modulation independent of the elements composing the signal and their specific auditory effects.

~ http://www.scholarpedia.org/article/Sine-wave_speech

To anyone who listens to this, I recommend rewinding to the segment starting at 1:23 a few times and not letting it reach the spoilers. After a few rounds, my brain adjusted to the distortion and I could make it out perfectly, without ever hearing the original.

Wow this is amazing, after listening to this a couple of times the voice became super clear

Or what if you scrunched the audio down to a bandwidth beyond what was still intelligible, but still captured some semblance of the speaker's voice. Use the original audio to compute subtitles and store them alongside the audio. That's your file.

Then the player uses both as inputs to ai (some hand waving), which now has enough to put the pieces together and produce something intelligible again, in the speaker's voice.

Without the "intelligible" part, this makes me think of what the game Celeste does to give its characters voices without voice acting.

They make voice-like synth sounds, different for each character, that are about the length of the text they're saying. It adds prosody and intonation to the text-based dialogue of the game.


This is how I imagine an intelligent car would sound if it figured out how to produce speech through the antigravity engine.

Edit: oh sweet there's intonation too. Were these all made manually?

My guess is they're mostly procedurally generated with manual tweaks for particularly significant lines.

Basically turn the speaker’s voice into a “font”, and then render text with it. Pretty sure it’s been done. Large initial delay while you get the whole font download, then basically just the text to be rendered and the occasional hint to the renderer

The WaveNet demos are indeed impressive. But I wonder if the WaveNet decoder needed to be trained for those specific voices.

On a related note, I wish more (any!) podcasts were distributed in opus.

As far as I know, enough podcast apps require MP3 (and not even VBR!) that you have to use MP3, and you can't have multiple <enclosure>s, so how would you do this? A separate RSS feed for Opus, linked only on the website and not submitted to aggregators?

> As far as I know, enough podcast apps require MP3 (and not even VBR!) that you have to use MP3…

Nope! Podcast episodes can be encoded using AAC (which is as ubiquitous as MP3) without issue.

That won't realistically possible with Opus until Opus hardware decoding has available in mobile devices for 5-10 years.

I highly doubt there are any devices that are capable of accessing the modern web, with all its JavaScript bloat, yet cannot decode a simple audio codec. Even when Apple was installing AAC hardware decoders, they were already almost obsolete by modern embedded CPU development (especially the rise of medium-power ARM SoCs). I highly doubt any devices released in the past 5 years have any sort of fixed-function audio decoder. Maybe an encoder, possibly some general-purpose DSPs, but not a format-specific decoder.

Yeah, the last time hardware audio decoders were relevant was like... back in the Nokia N-Gage days.

The N-Gage QD removed the MP3 decoder that was present in the original model. And you could install a software player, and it would struggle with bitrates above 128kbps :D

Modern phones can decode video in software (sucks for battery life, and framerate/resolution are more limited than with hardware, but it's possible). Audio is nothing for them.

> Yeah, the last time hardware audio decoders were relevant was like... back in the Nokia N-Gage days.

I guess it's irrelevant you feel overwhelmed by how long your phone can go on a charge. Plus, low-power/low-CPU requirements are an order of magnitude more critical in devices like smartwatches.

I use Opus on my phone all the time. I was in a place where internet was really expensive, so I'd download and convert things on a Linux server.

In conjunction with youtube-dl I could listen to pretty much anything I wanted, using almost no data.

These days I use it mostly for audiobooks, if storage is limited.

Opus is awesome for audiobooks at 24kbps (probably one could go lower than this even) and music at 96kbps. I don't hear any difference in quality. It makes a big difference for my mobile which is limited to 128Gb.

Man that's a big collection! Though to be fair the iPod 160GB came out like ten years ago, and I thought we'd have advanced a bit more in that department by now. (Like imagine an iPod but instead of the spinny hard drive it's all microSD! There's just no market for it I guess.)

Distribute podcast as HTML file with WASM based decoder, whole file, self contained with either a byte stream out or play/pause

Ignoring other issues, this will have rather poor power usage, which is especially relevant given how many people listen to podcasts on mobile devices.

Or more like never unless you get Apple to front it...

Presumably a separate RSS feed. There are podcasts that have separate Ogg RSS feeds.

All the +2k podcasts hosted on Podigee (a podcast hosting company mainly known in German-speaking countries) are distributed in opus. But it is, and probably will always be, a rather niche distribution format. AAC had its moment, but MP3 is alive and kicking. Even Apple acknowledges its importance by adding support for chapter markers in iOS12.

> AAC had its moment…

That moment is 15 years in with no signs of losing steam[1]. AAC effectively replaced MP3 for most online audio use cases, with podcasting as a notable exception[2]. And of course, AAC is the audio format for all basically all online video distribution.

[1] Apple kicked off the transition in 2003 with the introduction of AAC-based digital music sales.

[2] Because podcasting is a decentralized medium, and the vast majority of podcasters don't know much (if anything) about media encoding.

Perceptions are probably influenced heavily by your own usage and places for consumption. I'm also in the camp of "AAC is very rare, if at all"...

Considering also that YouTube uses WebM, which very explicitly is only Vorbis or Opus for audio, "basically all online video distribution" must exclude the web's most popular video distribution site...

Every YouTube video has had AAC audio from the very beginning. Same goes for every Vimeo video, every Netflix video, every Hulu video, etc. Streaming audio services like Pandora use AAC too.

That's because AAC is the only format you can count on to work on all devices, and to be hardware decoded on all devices where battery life matters.

There are a lot of playback issues - VBR MP3 and older OS releases of both iOS and Android, never mind the car players and similar all contribute to the problem.

The post-show of this podcast talks about these and other issues in detail - Marco is on both sides of the issue as a podcast producer and podcast app developer: http://atp.fm/episodes/182

If you're a podcaster, another benefit of AAC over MP3 is that VBR is not an issue.

The Wavenet stuff sounds great, but I'm curious how big the model is. The audio files may be tiny, but you may need a huge neural network to decode them.

"The man behind it, David Rowe, is an electronic engineer currently living in South Australia. He started the project in September 2009, with the main aim of improving low-cost radio communication for people living in remote areas of the world. With this in mind, he set out to develop a codec that would significantly reduce file sizes and the bandwidth required when streaming."

What do you know, it's sort of like Pied Piper without the magical compression or cloud handwaving.

I've been reading David Rowe's blog [0] since 2008, there are some other really interesting projects and products on it. One of my favorites back then was his home build electric car.

[0] https://www.rowetel.com/

To anyone else confused, I think the statement refers to http://www.piedpiper.com/ and not the folk legend.

I noticed that when you listen to compressed audio first you hear the unnaturality of voice and clicks (probably when one frame's ending doesn't match next frame start). But in a few seconds you adapt to it and now voice sounds pretty clear.

It is impressive how far one can compress speech.

I read, and listemn, to this, and am impressed.

Then I think of the possible negative applications.

a noation of 100m people, talking an hour per day on phone or other audio channel, could be stored on 100m * 365 * 1.5 MB of storage annually: 54 PB.

In raw storage, that's less than $2 million. Far below national actor budgets.

> However, where it starts to get more interesting is the work done by W. Bastiaan Kleijn from Cornell University Library.

The authors are not from Cornell. I think the author made this mistake because the paper is posted on arXiv, and that’s what’s it says at the top of every page?

This is amazing! With this codec and enough processing power, you could do this bidirectionally and have enough bandwidth to stream a two way realtime voice chat using 2400bps modems over a standard analog phone line!!! ... Oh... Wait a minute...

The plain Codec2 decoder sounds like a TI-99/4A (and works on somewhat similar principles). If I hook a TI-99/4A to the WaveNet decoder, will it sound natural?

But this guy a beer. What a feat!

Side note: I'm still waiting for an open source, cheap way to do FreeDV/Codec2 on VHF either with a dongle that goes between a raspi/SBC or a laptop and a cheap ass radio like a baofeng, or an inexpensive radio with Codec2 support.

I think 2400B support is coming to the FreeDV GUI soon. I've seen some work done on that. That'll let you use a cheap FM radio and a laptop to get on the air with something codec2 based. I'm slowly chipping away at a TDMA mode for SDRs, but that's still probably a ways off.

Would be interesting to combine this Codec2 with LoRa modulation. Of course the latter is patented, but it combines both chirped and direct sequence spread spectrum to yield some very resilient modulation.

"Enhance" - said every movie guy ever.

Ikzmjzn nsh

None of the audio samples play for me (In neither Chrome nor Edge... Other sites play just fine.)

Makes it very hard to evaluate claims of codec quality, which seems like the primary purpose of the blog post. :(

Even works in the embedded browser of the "Materialistic" HN reader app, on Android 5.1.

Works for me on my iPad.

Works on iOS 11 for me but I had to press play and then wait a couple of seconds, press pause and then play again and wait another couple of seconds. Try that.

I confirm, doesn't work for me too. Neither in Safari 56 nor Chrome 67 (macOS 10.13).

Fine by me on Firefox/Linux.

There is a option to download the audio for offline listening.

Working on Chrome 67 on Win 10.

I have no problems in Chrome.

Working on Firefox Android

confirm, works fine on latest Firefox stable Android

Works in Chrome for macOS

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact