Hacker News new | past | comments | ask | show | jobs | submit login
Lyra V2 – a better, faster, and more versatile speech codec (googleblog.com)
229 points by HieronymusBosch on Sept 30, 2022 | hide | past | favorite | 68 comments

Check out also Codec2, Open Source as well, which offers really good quality down to 700 bit/s. It has been ported to small MCUs such as the ESP32, STM32 etc. also supported by Arduino libraries.




> Check out also Codec2, Open Source as well, which offers really good quality down to 700 bit/s

Yes, but it tops out at 3200b/s:

* https://en.wikipedia.org/wiki/Codec_2

Lyra V2 seems to start there and then goes up to teens of kb/s, at which point Opus can perhaps take over the job.

Yes, that is done on purpose. The goal of the project is to be used in HF/VHF voice communications so it makes sense. That low bandwidth usage allows it to be employed from ordinary HAM gear down to cheap LoRa modules, a feature that opens huge possibilities like building point to point or multipoint encrypted communications with portable devices not tied to cellphone towers, which in some areas of the world would be so useful these days.

>employed from ordinary HAM gear down to cheap LoRa modules, a feature that opens huge possibilities like building point to point or multipoint encrypted communications

I hope you aren't suggesting that encryption should be used on amateur bands.

Of course not; I'm aware that encryption is illegal in HAM bands, I was referring to other uses in emergency situations.

It's not a big deal, there's rarely any enforcement of this, no-one cares except for the usual angry hams.

That’s how the commons gets tragedied…

Is that not a good idea?

It's not legal. Most countries prohibit encryption of Amateur Radio transmissions in most cases. Some countries have exceptions such as emergency communications or satellite control. [0]

[0] https://ham.stackexchange.com/questions/72/encrypted-traffic...

It's a can-of-worms topic...

You mean like the Helium network where they are trying to get enough users with LoRa boxes to replace the telcos ?

[1] https://www.helium.com/

Helium's LoRa network has a vanishingly small number of paying users despite it's size.

Great move for a pump and dump scheme tho, now they are moving onto CBRS LTE with a whole new token separate from HNT.

It's astounding to me that you can get intelligible speech at seven 700 bps.

700 bits is less than the ASCII of this comment.

Have fun reading that comment out loud in 1 second.

Is there an open codec that concentrates on low CPU usage? I'm fine with it not being very bandwidth efficient.

Opus is a very good codec, but it's not amazing CPU-wise. I work on a VR world, and audio encoding is usually our most limiting factor when running on an VPS. We have the capability to negotiate codecs, so the high cpu/low bandwidth use case is already covered.

What I'm looking for specifically:

* Low CPU usage

* Support for high bitrate, suitable for music and sounds other than voice

* Low latency

It sounds like you’re asking for uncompressed audio? That meets all of your listed requirements. 48kHz * 16bit, single channel = 768kbit/s

We support that already, yup. But it never hurts to see if there's something better than that out there.

You can bootleg your own fast lossless codec by doing delta-encoding on the raw PCM to get a lot of zeros and then feed it through an off-the-shelf fast compressor like snappy/lz4/zstandard/etc. It won't get remotely close to the dedicated audio algorithms, but I wouldn't be surprised if you cut your data size by a factor 2-4 and essentially no CPU cost compared to raw uncompressed audio.

You’ve not done this before have you ?

I haven't, but now I have. I took https://opus-codec.org/static/examples/samples/music_orig.wa... from https://opus-codec.org/examples/. Then I wrote the following snippet of Python code:

    from scipy.io import wavfile
    import numpy as np
    import zstd

    sampling_rate, samples = wavfile.read(r'data/bootleg-compress/music_orig.wav')
    orig = samples.tobytes()

    naive_compressed = zstd.ZSTD_compress(orig)
    deltas = np.diff(samples, prepend=samples.dtype.type(0), axis=0) # Per-channel deltas.
    compressed_deltas = zstd.ZSTD_compress(deltas.ravel()) # Interleave channels and compress.

    decompressed_deltas = np.frombuffer(zstd.ZSTD_uncompress(compressed_deltas), dtype=samples.dtype)
    decompressed = np.cumsum(decompressed_deltas.reshape(deltas.shape), axis=0, dtype=samples.dtype)
    assert np.array_equal(samples, decompressed)


Looks like my initial estimation of 2-4 was way off (when FLAC achieves ~2 this should've been a red flag), but you do get a ~1.36x reduction in space at basically memory read speed.

Using an encoding for second order differences with storing -127 <= d <= 127 using 1 byte and the others 2 bytes (for an input of 16-bit audio) I got a ratio of ~1.50 for something that can still operate entirely at RAM speed:

    orig = samples.tobytes()
    deltas = np.diff(samples, prepend=samples.dtype.type(0), axis=0)      # Per-channel deltas.
    delta_deltas = np.diff(deltas, prepend=samples.dtype.type(0), axis=0) # Per-channel second-order differences.

    # Many small differences, encode almost all 1-byte differences using 1 byte,
    # using 3 bytes for larger differences. Interleave channels and encode.
    small = np.sum(np.abs(delta_deltas.ravel()) <= 127)
    bootleg = np.zeros(small + (len(delta_deltas.ravel()) - small) * 3, dtype=np.uint8)
    i = 0
    for dda in delta_deltas.flatten():
        if -127 <= dda <= 127:
            bootleg[i] = dda + 127
            i += 1
            bootleg[i] = 255
            bootleg[i + 1] = (dda + 2**15) % 256
            bootleg[i + 2] = (dda + 2**15) // 256
            i += 3

    compressed_bootleg = zstd.ZSTD_compress(bootleg)

    decompressed_bootleg = zstd.ZSTD_uncompress(compressed_bootleg)
    result = []

    i = 0
    while i < len(bootleg):
        if bootleg[i] < 255:
            result.append(decompressed_bootleg[i] - 127)
            i += 1
            lo = decompressed_bootleg[i + 1]
            hi = decompressed_bootleg[i + 2]
            result.append(256*hi + lo - 2**15)
            i += 3

    decompressed_delta_deltas = np.array(result, dtype=samples.dtype).reshape(delta_deltas.shape)
    decompressed_deltas = np.cumsum(decompressed_delta_deltas, axis=0, dtype=samples.dtype)
    decompressed = np.cumsum(decompressed_deltas, axis=0, dtype=samples.dtype)
    assert np.array_equal(samples, decompressed)
Prints 11593846.

While I also want a low-computation codec that can save space, the historical use cases unfortunately assumes a lot more CPU power to be compensated for a lot less bandwidth, so there's little research in this area, and there's no real incentive to make something like ProRes and DNxHD as if you are editing audio the SSD speeds has been so fast that you'll run into CPU problems first.

Either that or G.711.

G711 is neither high bitrate nor usable for music.

Then use G.722, it works fine for music.

No, g722 is still a wideband speech codec. Its available frequency goes up to 7 kHz. The uncompressed audio this thread began with goes up to 22 kHz. With g722 you're losing most overtones, or even all overtones from the top of a piano. Please don't use g722 for music apart from on-hold muzak.

How is Audio encoding the most limiting factor in a VR project? :o Afaik Opus encoder eats something like 30-50MHz of one cpu core.

It sounds plausible that it's the most expensive thing on the server side, if you have cheap simulation/behaviour and many concurrent users.

But unless it's a non commercial project, the cost shouldn't be a big deal, so it's still a bit strange.

We work on a community-led fork of the dead commercial High Fidelity project. The server requirements are indeed very light except for audio.

Physics are actually farmed out to the clients themselves, it's a bit of a quirky idea, but it actually works if one isn't concerned with accuracy.

I did a prototype of a 3D low-latency server side mixing system, based on a hypothetical 4k clients, @48k each being mixed with the 64 loudest clients.. Using Opus, forced to Celt mode only and running 256 stereo sample frames at 128kbps.. Worked well, using only 6 cores for that workload.. The mixing was trivial, but the decode and encode of 4k streams was entirely doable.. This issue at that rate was 1.5M network packets a second.. If I was to revisit it, I’d look at using a simple MDCT based codec, with a simple Psychacoustic model based on MPC (minus CVD) and modified for shorter frames + Mdct behaviour versus PQMF behaviour, without any Huffman coding or entropy coding.. And put that codec on the GPU.. Small tests I did using a 1080ti indicated ~1M clients could be decoded, mixed and encoded (same specs as above) problem is then how to handle ~370M network packets a second :)

Edit: Had high hopes for High Fidelity, and came very close to asking for a job there ;) Shame it’s kaput, didn’t know that :(

Those are interesting ideas, thanks! I'll have to try and play with that.

High Fidelity the company is still around, but they pivoted multiple times radically. Initially their plan was social VR of sorts. Then they tried to make a corporate product for meetings and such, and gave up on that right before COVID19 hit!

And after that they ripped out all the 3D and VR and scaled down to a 2D, overhead spatial audio web thing. Think something like Zoom, only you have an icon that you can move around to get closer or further to other people.

The original code still lives on, we picked it up and are working on improvements. Feel free to visit out Discord (see my profile).

Apparently RP1 team handle bigger crowd loads through muxing on the server but not sure exactly how that works out for spatial audio there is a Kent Bye Voices of VR podcast discussing how they got 4k users in the same shard.

Ventrilo/teampspeak servers run great on shared hosting. https://www.myteamspeak.com/addons/9ddfa0b2-25c2-4302-8a43-0... gives you positional audio support on teamspeak server

Why is your VPS server encoding rather than clients? Are you combining talkers together into one source for doing crowds and avoiding N^2 or something and need to reencode after combining?

Correct, server does spatial audio.

It's a community-led continuation of High Fidelity, a dead commercial project. They made their own proprietary codec with excellent performance we can't use and managed to have a couple thousand people in the same server.

I would like to know the answer to this question from dale_glass too.

Replied to the parent

The Bluetooth codecs are all designed to be very cheap on CPU and low latency - e.g. LC3, AptX or SBC.

I'm skeptical, those are almost always going to be implemented in hardware, so the complexity of a software encoder isn't a design concern.

There is some correlation between the cost of a hardware implementation and complexity of a software implementation. SBC is a very simple codec, but AptX and LC3 might not be much better than Opus.

I couldn't find data on CPU requirements for encode/decode versus, say, Opus, but Apple uses AAC-LD for similar scenarios.

If you are OK with moderately high bitrates, you might prefer something simpler like an ADPCM scheme. It's pretty damn easy to implement ADPCM, certainly a lot less math heavy than MDCT-based schemes, and they achieve good quality at a somewhat higher bitrate (I have no data, but I'd guess 200-250%~ish.)

I believe codec2 is pretty easy computationally. The M17 project uses it IIRC, and implements it on an STM32.

That'd be a good choice except for the requirement to support non-speech audio.

LC3Plus or AAC-LD. Although they likely don’t fit the definition of Open Codec.

Vorbis might be a good choice there

That sample at 3200 bits per second is fantastic for such a low bitrate. I also love how that works out at 1.44MB/hr.. one floppy disk per hour!

Imagine a portable podcast player for fans of vintage tech.

(Won't work with music, of course.)

Books-on-floppy would work too. Like a digital version of books on tape, with about as many disk changes.

Please I wish something like that came out. Same with modern tech with cassette and minidisc. Something new but with that kind of hardware. I love the physical mediums!

A Diskman, if you will..

I still remember the times when I regularly visited my friend with 2 floppies just to get one ~4 minutes MP3 song back home.

Subjectively beating Opus on quality-to-bit rate is quite impressive, but I noticed the samples had some interesting audible artifacts. I wonder where these come from, and if they're related in any way to this codec using machine learning techniques.

Very impressive.

It'd be interesting to see what the lift would be to get encoding & decoding running in webassembly/wasm. Further, it'd be really neat to try to take something like the tflife_model_wrapper[1] and to get it backed by something like tsjs-tflite[2] perhaps running atop for example tfjs-backend-webgpu[3].

Longer run, the web-nn[4] spec should hopefully simplify/bake-in some of these libraries to the web platform, make running inference much easier. But there's still an interesting challenge & question, that I'm not sure how to tackle; how to take native code, compile it to wasm, but to have some of the implementation provided else-where.

At the moment, Lyra V2 can already use XNNPACK[4], which does have a pretty good wasm implementation. But trying to swap out implementations, so for example we might be able to use the GPU or other accelerators, could still have some good benefits on various platforms.

[1] https://github.com/google/lyra/pull/89/files#diff-ed2f131a63...

[2] https://www.npmjs.com/package/@tensorflow/tfjs-tflite

[3] https://www.npmjs.com/package/@tensorflow/tfjs-backend-webgp...

[4] https://www.w3.org/TR/webnn/

[5] https://github.com/google/XNNPACK

Why would you want to run codecs I'm WASM? Makes no sense to me.

Forwards and backwards compatibility and not having to rely on every vendor to ship support for the codec you want to use in your software.

Cause the web is awesome and anyone could use something built for it with zero friction.

Things missing that I think would have to be added before this could become a widely used standard:

* Variable bitrate. For many uses, the goal isn't 'fill this channel', but instead 'transmit this audio stream at the best quality possible'. That means filling the channel sometimes, but other times transmitting less data (ie. When there is silence, or when the entropy of the speech being sent is low - for example the user is saying something very predictable).

* Handling all types of audio. Even something designed for phone calls will occasionally be asked by users to transmit music, sound effects, etc. The codec should do an acceptable job at those other tasks.

Alas, the sources require Bazel to build. That's going to limit adoption because Bazel is difficult to deal with and most projects don't use it.

1. Install numpy (pip3 install numpy)

2. Download a bazel binary (https://github.com/bazelbuild/bazel/releases or use package manager)

3. bazel build -c opt :encoder_main

4. bazel-bin/encoder_main --input_path=testdata/sample1_16kHz.wav --output_dir=$HOME/temp --bitrate=3200


This is an amazing contribution by Google. I wonder if there is a simple WebRTC demo app available with this codec plugged in?

What's the difference between googleblog.com and blog.google ?

blog.google.com = blogger.l.google.com =

googleblog.com =

It would be nice to compare it to a higher quality sample, currently the samples sound like they were recorded through a telephone (4 khz).

It's meant to be a phone codec.

"HD" phone codecs seem higher quality than the example given.

That's true, but in this era of high quality voice and video calls, it's not that uncommon for someone to want to play a song or even a live instrument, so some capability for handling that intelligently seems important.

The mics built into modern smartphones aren't limited to the fidelity level of ancient telephones, so it seems reasonable to hope for more.

I believe the limitation was actually not the mics. You can fit a lot more phone calls into a given bandwidth if you heavily restrict the bandwidth of each phone call.

As an ML person I totally get the how NN works to accomplish this and it's very cool. What's really cool is how they get this to work in real time with little/no latency.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact