In my experience, the weakness of current conferencing systems isn't in the software - it's a hardware problem.
Specifically, their ability to capture and transmit usable speech is really poor. Between echo and poor noise gating, I'm not exaggerating when I say that I can make out fewer than 1/3 of the words in my team's conference calls.
I don't care how efficient the video codec is, if I can't clearly understand what the other people are saying, it's all useless.
Are you sure the problem isn’t the phone system? We use Polycoms with Skype and everything is crystal clear. Our AV guy says Polycom is the best you can buy at that price point, and he’s trying to sell me $15,000 systems.
We only ever get reports about problems when people are actually dialing into something over POTS.
Its likely a network problem causing large jitter buffers (see my other comment). Buying Polycom won't solve this problem, unless the jitter buffer is in adaptive mode and its wildly overestimating the jitter (a problem I've seen on Polycoms and Grandstreams before).
I think the latency is the killer for our remote conference calls. So much extra delay and effort goes into not talking over each other that isn't an issue in a live meeting.
Sounds like your conferencing system has large jitter buffers. If your internet is stable, you should be able to reduce the buffer.
To test your network, go to https://www.dslreports.com/speedtest and see if you have significant jitter or bufferbloat. If your rating is an A, then your good to go, otherwise its time to clean up your local network :P
Echo cancellation and voice activity detection (VAD/what they use for gating), AGC, etc is all done in software in modern VOIP (or firmware? It's on a dedicated processor). And it's quite good if you spend the money on it.
Those cheap Cisco conference phones do sound terrible, and I don't think people replace them very often.
Money isn't a fix all for AEC, frankly Google's WebRTC AEC is the best commercially available AEC. You can see it in action when on a call over Signal Private Messenger, it knocks out background noise (eg: loud car noises if your walking down the street) like a champ.
Like as a standalone product or integrated into conferencing systems? Because there are plenty of solutions made by audio companies with more institutional knowledge and time invested than Google.
Google, Mozilla and all the organizations that have collaborated on WebRTC have spent 7+ years tuning this component. Much like Opus versus AMR-WB, I doubt there is anything notably better in the proprietary software/hardware world.
The orgs that develop the WebRTC standard have spent years implementing and experimenting with nearly every algorithm that is available for each of the various components (resulting in Opus, AV1, WebRTC AEC, etc), combing research papers daily and often working with researchers directly.
Personally, I've heard (and been used for testing) proprietary AEC and it is way better than anything using WebRTC. The artifacting can be really bad, enough so that it's worth disabling (although noise reduction is quite good, it's the echo cancellation that kinda sucks and harms voice quality, giving you the classic garbling sound).
Stuff like WebRTC/OPUS is limited by the fact it runs in the browser or consumer hardware. It's not bad, it's just a different problem domain than commercial audio, and has different constraints. High quality conferencing equipment runs on a dedicated unit that your A/V integrator installs either in the room, or in a server closet and connects over Dante to the analog front end in the room, and usually is a package deal with good microphones (which could be wireless, or the hot thing now is to use beamforming).
Coincidentally i'm trying out rav1e today (the open-source command line av1 encoder). At 1080p for a 2m40s movie trailer on a current-model mac mini 6-core, it's running at 0.153FPS. That means it's running at 157x slower than real-time. Kudos to these cisco chaps for being roughly 157x faster than rav1e!
You simply turn off all the tools that are slow, and you end up with an AV1 encode that is very fast, but quality wont be anywhere close to your normal AV1 encode.
I don't doubt it will be of higher quality than AVC. After all AVC was the work started pre 2000 and published in 2003.
It is easy to dismiss what Cisco demonstrated like this, but if you listen to the talk, they argue that having more tools available gives you more room to optimize for quality per bit per cycle, in the same way that having more choices allows you to optimize better for just quality per bit (or any other objective, really). So it's not just a matter of turning things off, but of making good choices.
It is also not just beating AVC. They claim that it is higher quality than their own HEVC work, which they invested in heavily before it become clear that the licensing situation was not going to just "work itself out". They once told us that even they were surprised how many of the HEVC tools you could wind up using for real time, if you used them judiciously (basically: all of them).
I have worked with some of the team responsible for this project before, and we collaborated with them on some AV1 tools (most notably the CDEF loop filter, but also entropy coding and a few other things). They know their stuff, and really do seem to be leaps and bounds ahead of everyone else on the RTC side of things.
I encoded about 100 1080p-sized images to AVIF yesterday. I was a little disappointed that each of them were taking about 30 seconds, fully loading an i7-2600 CPU. Realtime AV1 encoding is good news, maybe those optimizations can reach the still image format?
Real-time is such an unfortunate term when talking about video collaboration. A codec that can achieve real-time encoding speed is still useless if it has a 10 frame latency to output the first data, and arguably not real-time.
~1/3 second is perfectly acceptable to me to be within the range of "real time", and is quick enough that I'd be willing to bet most people won't notice even in a two way conversation.
For live streaming and chat that's fairly acceptable. However, don't forget there are other potential use cases, particularly since cloud gaming is supposedly becoming a thing again.
Cisco are not using dedicated hardware logic like Allegro DVT I think. For phones etc, dedicated hardware will be very useful, but it takes time to get into products, which is why a real time CPU encoder is very interesting.
Looking at the video (https://vimeo.com/344366650) of the Cisco talk, at 20m18s they say a "real time low latency HD software encoder" [emphasis added], which confirms it.
Fair enough, but those laptops are meant to be used tethered to a power source. Set top box chips are generally far more power hungry than their mobile counterparts, just like desktop chips vs laptop chips. The issue here is that I'm not sure there's much of a market for phones that don't work on battery for more than 15 minutes.
Then again, there's a market for water-cooling blocks for phones so I guess anything is possible.
Those laptops do come with a battery, though, so someone out there is enjoying a whole 22 minutes of battery capacity.
In all seriousness, though, integration into a set top box chip is a sign of commodity: A low power chip with limited application, where it is implemented not to power a large marketing department, but simply out of potential necessity.
This is to me a far bigger indicator of "things coming soon" than, say, integration into a high-end GPU.
So how far away is it until my phone can record 4K @ 60fps like it can with HEVC? I'm guessing we'll need hardware support built on these optimizations?
My guess, 5 years. And the cisco “optimizations” May work well for a conference call with a perfectly still camera and low spacial and temporal information density. Applying the same techniques to handheld action video would produce very poor results.
I haven't used Webex or any Cisco conferencing offerings recently but I have been very impressed by Zoom. It works really well when using computer audio with AirPods. I've used this setup for meetings across continents without thought to latency. This may be a result of Internet connections improving in general. I found video conferencing to be distracting in the past but surprisingly good now.
Since that time HEVC encoding has only become faster and less expensive.
This real-time AV1 encoder achieves 1080p30, which is simply speaking 8 times less pixels than that 2014 demo. And compares against H.264 which says it all.
They compared against HEVC as well. To quote the article: "This means that we can substantially raise quality, while saving bits, all with a very usable CPU footprint. We have found that the real-world compression/speed trade-offs for AV1 are in fact excellent, and better than HEVC."
And the business problem with HEVC is: "HEVC (aka H.265) comes with unacceptable patent cost, risk and uncertainty."
What do you believe HEVC is winning? The major online video platforms (YouTube, Vimeo, Netflix, Twitch, etc.) are all going to AV1.
HEVC had its shot but the terrible licensing stunted its growth and AV1 will replace it. Leonardo Chiariglione says the MPEG business model is broken and I think he's right:
Leonardo Chiariglione also said "AOM will certainly give much needed stability to the video codec market but this will come at the cost of reduced if not entirely halted technical progress. "
which is a very real concern. (a bit tired arguing with people defending Google's contraption vs a codec with innovation from dozens of companies)
AV1 is not jut Google's child, but rather Amazon, Cisco, Google, Intel, Microsoft, Mozilla and Netflix's collaboration to create a standard that isn't beholden to MPEG-LA: https://en.m.wikipedia.org/wiki/AV1
It isn't. Innovation in video coding is driven by the need to achieve the same image quality in fewer bits for easier and cheaper distribution. All video platforms, providers, and users have a stake in achieving that improvement.
Video coding is not finished yet. There's still more to do and there will be an AV2 eventually.
I did, that's why I was a little confused by your summary. It was a lot more nuanced than I was expecting.
I enjoyed the bit when he flashed up the logos of Netflix, Amazon, Vimeo, Microsoft and other AOM members as showing strong support for HEVC from streaming platforms. And the bit when he said that support by AOM member Apple was really key to HEVC success in America. Or that AOM member Samsung was putting HEVC in TVs.
The guy seems smart and informed, but the video had a lot of dramatic irony since many of the corporate big guns he uses to quantify HEVC success seem to be more and more vocal about their dissatisfaction with HEVC and plans to ditch it for AV1 as soon as possible, which is a phyrric kind of victory.
> many of the corporate big guns he uses to quantify HEVC success seem to be more and more vocal about their dissatisfaction with HEVC and plans to ditch it for AV1 as soon as possible
Version 1 of HEVC was approved in 2013. Version 1 of AV1 was approved in 2018. So if realtime encoding for HEVC appeared in 2014 and we are in 2019, then it seems like the technical progress is on par.
That's a Xilinx based FPGA encoder, which is a different thing from a software encoder using 1 core.
One of the other talks at the event where this was announced was by NGCodec who where showing off their Xilinx based AV1 encoder, though I believe it's intra only at the moment.
Specifically, their ability to capture and transmit usable speech is really poor. Between echo and poor noise gating, I'm not exaggerating when I say that I can make out fewer than 1/3 of the words in my team's conference calls.
I don't care how efficient the video codec is, if I can't clearly understand what the other people are saying, it's all useless.