Hacker News new | past | comments | ask | show | jobs | submit login
How to stream media using WebRTC and FFmpeg, and why it's a bad idea (maxwellgale.com)
127 points by dimes on Jan 30, 2021 | hide | past | favorite | 39 comments



>And finally, we encounter a large issue without a good solution. In encoded videos, a key frame is a frame in the video that contains all the visual information needed to render itself without any additional metadata. These are much larger than normal frames, and contribute greatly to the bitrate. Ideally, there would be as a few keyframes as possible. However, when a new user starts consuming a stream, they need at least one keyframe to view the video. WebRTC solves this problem using the RTP Control Protocl (RTCP). When a new user consumes a stream, they send a Full Intra Request (FIR) to the producer. When a producer receives this request, they insert a keyframe into the stream. This keeps the bitrate low while ensuring all the users can view the stream. FFmpeg does not support RTCP. This means that the default FFmpeg settings will produce output that won’t be viewable if consumed mid-stream, at least until a key frame is received. Therefore, the parameter -force_key_frames expr:gte(t,n_forced*4) is needed, which produces a key frame every 4 seconds.

in case someone was wondering why it was a bad idea


H.264 and most other modern codecs support “intra refresh” to avoid this problem, at the cost of a marginally higher bitrate overall. Think of this as a “rolling keyframe slice” which marches across the screen every few seconds.

http://www.chaneru.com/Roku/HLS/X264_Settings.htm#intra-refr...


I would say intra refresh solves a different problem. You still have to wait for the intra refresh to cover the frame before you can start watching properly. That takes just as long as waiting for a keyframe, and needs slightly more bytes.

The benefit of intra refresh is that you avoid having any particularly large frames. If you're using a sub-second buffer, then intra refresh makes your maximum frame size much smaller without sacrificing quality. It's a godsend for getting latency down to tiny amounts. But if you have 1 or 2 seconds of buffer then it's no big deal if a keyframe is an order of magnitude bigger than other frames, and intra refresh is pointless.

Also it's not really a codec thing, it's a clever encoder trick that you can do on basically anything.


You could do on most things, but it has to be part of the codec to actually work.

If you aren’t aware of this scheme, you can just wait until the next key frame. With IDR there is a lot more bookkeeping to do so you can figure out when every single needed pixel has been accounted for.

And although it is part of the standard, I’ve encountered players that don’t support it even though they are based on ffmpeg - probably because they do look for key frames to seek to or something.


> And although it is part of the standard, I’ve encountered players that don’t support it even though they are based on ffmpeg - probably because they do look for key frames to seek to or something.

I think that supports my point. It doesn't matter if the technique is explicitly listed in the codec or not. You need clients that will tolerate infinite P-frames, and they will work equally well whether you're using intra refresh techniques on h.264 or h.262

The dumbest possible renderer would have full support; it takes extra logic to get in the way and break it.


Yes, this is a good point. Intra refresh does reduce variability of the bitrate, but the bitrate is still higher than it would need to be if rtcp was supported.


I was not aware of this at the time of writing, but it solves a large problem we've been having. Thank you so much for pointing that out.

Edit: I've just tried using intra refresh, and it works pretty well, but the key frame interval is still required.


Thanks for the easy summary.

One thing to consider: for some IRL performances, it's not uncommon that if you arrive late, you might be seated at the timing discretion of an usher. I understand digital experiences may carry different expectations, but I could see building an experience around this, perhaps starting with audio-only and maybe even a countdown to a next keyframe event (every minute?) while a "please wait to be seated" is shown.


Incredible -- "Our digital usher is finding you a seat in the cloud" slapped on a screen might just save us months of planning and millions on infrastructure.


If you might be hiring for any product roles with the millions saved, feel free to reach out. :)


(The quoted paragraph is no longer in the article. But I'm still curious about it.)

> Therefore, the parameter -force_key_frames expr:gte(t,n_forced*4) is needed, which produces a key frame every 4 seconds.

How often would you like it to be producing key frames? My video experience is mostly with security cameras, and the ones I've used produce an I-frame every 2 seconds by default. Their encoders don't seem to be real high-quality; sometimes there's visible pulsing where the image will get worse until the next I-frame, so I wouldn't want to increase the interval much beyond that.

> WebRTC solves this problem using the RTP Control Protocl (RTCP). When a new user consumes a stream, they send a Full Intra Request (FIR) to the producer. When a producer receives this request, they insert a keyframe into the stream. This keeps the bitrate low while ensuring all the users can view the stream.

I'm writing an NVR that does live streaming with HTML5 Media Source Extensions, sending one-frame .mp4 fragments over a WebSocket connection. My approach to this problem is different: when a new client connects, I send them everything since the last key frame. IIRC there's more data in the I-frame than in the (up to 2 seconds of) P-frames since then, so this seems to work pretty well. If there were only an I-frame (say) every minute, I'd probably look at that inserting a keyframe approach...there is an ONVIF command to insert a key frame IIRC.


A few random videos I pulled from youtube had a key frame on average every 4.5 seconds, ranging from 0.1 to 5.5 seconds apart, seems pretty consistent regardless of type of video, at least for the few I tried.

The worst I've seen in production was a poorly configured encoder that insert a keyframe exactly every 30 seconds, along with segments every 10 seconds, which surprisingly, caused some players to crash trying to find a keyframe.


As a reference, I think Google Chrome sends a key frame every 90 seconds by default


Still sounds better than a FIR. If you consider a big streamer with thousands of users. Users are constantly arriving and leaving, so the keyframe requests are going to be so constant that I can see keyframes being generated much more often than 4 seconds (assuming I understand it all correctly).


Usually you’ll have an SFU between the users and the streamer that can limit the number of requests to one every X seconds.


I guess it comes down to latency requirements?

I would expect where latency isn’t a huge concern, the best user experience would be to start the new receiver back at the last keyframe and fill the buffer up to “present” so they can start watching instantly, and keep a few seconds in the buffer for stability.

In more latency critical streams where you still want the perception of instant video startup I suppose you would have to start at the last keyframe and then as soon as the next key frame came through you could just jump ahead.


"In order for users to watch the video, they must be able to download it in real time, so the maximum bitrate has to be lower than the slowest connection among your users."

Why not just (progressive) download, watch/save, delete? IOW, playback from saved file.

Better for variety of conditions, e.g., connection might be slow.


Streaming really is just downloading without the saving part.


I'm sure this could be implemented if someone were to sit down and implement it.


Ok, so only a problem in live streams?

(And I suppose also when seeking inside a stream)


It's a problem when playing from a non-start, non-keyframe point (which in practice means any arbitrary point). I'm guessing that's what you meant.


It seems to me that you can't seek with a webrtc stream, as it is at least.


webrtc is just a stream but you can absolutely tell the sending side to seek to a certain point.

If webrtc is your TV then the sending side is your VHS. You can tell the VHS to rewind or forward, but telling your TV to do the same is impossible. It just shows you what it gets.


I meant it's not part of webrtc, but you indeed can implement a lot of things around it.


To get it into the browser check out rtp-to-webrtc[0]

Another big piece missing here is congestion control. It isn’t just about keeping bitrate low, but figuring out what you can use. It is a really interesting topic to measure RTT/Loss to figure out what is available. You don’t get that in ffmpeg or GStreamer yet. The best intro to this is the BBR IETF doc IMO [1]

[0] https://github.com/pion/webrtc/tree/master/examples/rtp-to-w...

[1] https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congest...


Note that this post doesn't really cover how to stream media using WebRTC. First and foremost, because WebRTC mandates the use of DTLS to encrypt the RTP flow, thus a plain RTP stream won't work. A more apt title would be "How to use FFmpeg to generate an encoded stream that happens to match the requirements for WebRTC".

Still, thanks for the article; it is always interesting to see specific applications of the FFmpeg command line, because in my opinion after having read them top to bottom, FFmpeg docs are very lacking in the department of explaining the whys.

Random example: You read the docs of genpts and it is something on the line of "Enables generation of PTS timestamps". Well, thank you (/s) But really, when should I use it? What does it actually change between using it or not? What scenarios would benefit from using it? Etc. Etc.


Technically, it's SRTP with keys derived from the handshake of a DTLS connection though. That DTLS connection can be used for SCTP, the underlying protocol for WebRTC data channels.

So yeah, that won't work to stream to a WebRTC endpoint as you said!


This is one of my pet peeves with documentation. "When to" and "when not to" are essentially the wisdom to match the intelligence of a how-to.


a lot of open source media docs are like this. one of the worst offenders is gstreamer. geez is that stuff uninformative.

but at least its awesome software for free so who am I to complain?


What about the WebRTC part?

The post ends at RTP out from FFMPEG. Maybe I’m supposed to know how to consume that with WebRTC but in my investigation it’s not at all straightforward... the WebRTC consumer needs to become aware of the stream through a whole complicated signaling and negotiation process. How is that handled after the FFMPEG RTP stream is produced?


The WebRTC part would indeed be convoluted.

First, you would need to encrypt the RTP packets with DTLS.

Then, you would need an SDP message generator, where you would include all sorts of info:

* Codec and tunings of video and audio streams.

* RTCP ports where you'll be listening from RTCP Receiver feedback, if any.

* The TLS keys used for encryption.

* Some fake ICE candidates that the other part can use to reach you.

Then provide this as an SDP Offer to the WebRTC API of the other side (i.e. the RTCPeerConnection if we're talking about a web browser), and receive in response an SDP Answer. You should then be able to parse this Answer because the other participant might have rejected some of the parameters you gave it in the Offer (e.g. it could be ready only for audio and reject your video). Or just ignore the Answer and hope that you know the other party so well that they won't reject any of the parameters you provided in the Offer.

Finally you would need to receive ICE candidates from the other party, and parse them in order to know where (what IP and port) to send your RTP packets (and RTCP Sender Reports, if any)


I use MediaSoup to bridge between ffmpeg and WebRTC. It works pretty well, and I like that it’s all node based.


For the purpose of one to many type of live streaming you would probably want to use HLS.

Twitch uses it's own transcoding system. Here is a interesting read from their engineering blog [0]

[0] https://blog.twitch.tv/en/2017/10/10/live-video-transmuxing-...


If you want to acheive something approaching the latency advantages of WebRTC with HLS its well worth checking out the low latency HLS work by Apple and the wider video-dev community.

https://developer.apple.com/documentation/http_live_streamin... https://tools.ietf.org/html/draft-pantos-hls-rfc8216bis-08


> -bsv:v h264_metadata=level=3.1

This should be `-bsf:v` and it's not required since this command encodes and the encoder has been informed via `-level`.


Thanks for the feedback. I've removed it.


Isn't Opus the only codec WebRTC supports? If so, I think it's another main parameter to note.


It's not the only codec, but it's the only high quality codec mandated by the spec and supported by all the browsers.

G711 is also mandated by the spec but it's a low quality codec intended for speech with a fixed 8kHz sampling rate. There are a few other codecs supported by Chrome and Safari but not Firefox.


H264 is also supported though it's limited to the Constrained Baseline profile [0]. That said, I have been able to use an H264 stream encoded with a Main profile that still worked in Chrome, so it could just be a strong recommendation.

https://developer.mozilla.org/en-US/docs/Web/Media/Formats/W...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: