This article is incorrect about WebRTC. I don’t know about other protocols and what they offer.
* Clock Recovery
I have had no problems with measuring NTP drift. As the clocks change I would measure.
* Common Clock for Audio and Video
Sender Reports contain a mapping of Sequence Number to RTP Sequence Numbers. This is respected by every player I have used. My guess is author put their media in different MediaStreams. If you want all your tracks to be synced you need to mark them as one MediaStream.
>I have had no problems with measuring NTP drift. As the clocks change I would measure.
Did you read the article? NTP is not the same as the video/audio clock which is what you need to care about. I have to now take a drink even though it's 5am here in Singapore.
> Common Clock for Audio and Video
No idea what sequence numbers have to do with clocks here. Maybe you mean a mapping of absolute time to relative time in RTCP? If the RTCP SR value for absolute time is using NTP (or any other wrong clock as it's not mandated to match audio/video clock), then it's by definition impossible to know how to sync audio and video after several wraparounds of each RTP timestamp.
> WebRTC provides playoutDelay.
This is not the same as a defined, video frame or audio sample accurate delay (ten milliseconds as a unit...) to allow for variations in frame size to maximise quality. It also appears to mix up network delays vs VBV delays. They are separate delays and are handled at different layers of the stack.
> You can transport anything you want via RTP or DataChannels
None of this is standardised and therefore requires control of both ends. Also high end applications need Uncompressed Audio and for the above RTP timestamp reasons this can't be precisely synced with video.
Each RTP packet has a 32bit timestamp, and a 32 bit SSRC. Each "sender" in an RTP session must use the same SSRC, this is how synchronisation between audio and video streams from the same sender (lip-sync) is achieved.
The timestamps have a resolution defined by the clock rate communicated externally through a signalling channel.
For example audio at 48k and video at 90k. The problem with that being that as they wrap around at different times. So video wraps first, then audio wraps later, and this can happen several times over
> I have had no problems with measuring NTP drift.
Yeah, their claim is just weird. RTP does not impose an accuracy requirement on its timestamps (despite the name "NTP timestamp" in the Sender Reports, they are not actually expected to be synchronized with an NTP source), but I am skeptical such requirements would be met in practice if they did exist. The author only talks about video, but audio is a much bigger problem: if you do not send the right number of samples to the sound card at the right rate, you are going to get annoying clicks and pops, and "dropping or duplicating" audio frames will only cause more problems. You do not need to send media for a day to have these issues. Oscillators are bad enough they show up in minutes, and WebRTC obviously has strategies for dealing with them.
> If you want all your tracks to be synced you need to mark them as one MediaStream.
More specifically, the underlying mechanism of giving tracks that need to be synchronized the same CNAME has existed in RTP since the original RFC 1889 from the year 1996. The timestamp wrapping does require some care to get right, but is basically a non-issue.
That said, a lot of WebRTC applications will not ask for synchronization because it necessarily introduces latency, and for interactive use cases you are often better served by sacrificing exact sync for lower latency (as long as latency is low enough, sync is never going to get too bad, anyway). But that is very different from saying the standards do not support it if you want it.
The RFC does not mandate that the RTCP timestamp (which you need to handle wraparound if you join a stream halfway through) needs to be the same as the video/audio clock.
RTCP SRs are sent quite rarely (defaulting to 1s for video, 5s for audio) so quite poor for precise clock recovery required in professional applications.
Probably practical implementations just use buffer fullness to drive their resampler.
> In practice this clock is generated via the PC clock so it isn't the same clock at all...
Yes, that is the point. WebRTC has to work with cheap / commodity hardware running a general purpose OS with clocks that are not synchronized. You get a bunch of clocks in different units running at different rates, and periodically are told the mappings between them. It is your job not to "run out of memory, or advance too quickly and run out of data to process," and indeed, WebRTC implementations have methods of solving those problems.
> Probably practical implementations just use buffer fullness to drive their resampler.
You are correct that in practice this gets driven by the jitter buffer in libwebrtc, but there is no resampler at all. Small changes in sampling rate are hard / computationally expensive to do well (as you certainly know). That is an implementation detail, though. You can use a different strategy if you have different requirements.
> ...precise clock recovery required in professional applications.
What are your actual precision requirements?
I also do not understand your concerns with wrap-around. Even SRs once every 5 seconds are more than enough to resolve ambiguity. You would have to go over half a day without seeing one before you could actually get confused, by which point you would have bigger issues.
Keep in mind that even at the start of the stream, the media timestamps are not absolute: each chooses an initial offset randomly.
Don't get myopic here - the timing constraints are critical for systems like DVB-T/C/S (and whatever the US equivalent is), not as much web. The things you're talking about might be dismissable when you're sending things from your blog app, but TS is primarily used in broadcasting.
The DVB machines I've worked with were very sensitive to any kind of jitter and clock skew.
Regarding clock, ideally you'd want to be able to genlock them.
The goal is to ensure that all input sourced send their video frames at the exact same point in time, and that each audio device also samples at the exact same points in time.
In the past that was even more important, as you'd want to make sure the scanline of CRTs in the studio and of the camera were perfectly synced.
MPEG-TS still exists because it’s still the best media container/transport system in existence.
It was designed and has evolved over a long time based on solid foundations by very smart people in the industry, across multiple institutions and professions.
As opposed to the “format du jour” that was quickly thrown together by some eastern block script kiddie who saw the basic outline of an AVI file and figured they could do better…
Case in point: MPEG-TS has scaled from DVD across BluRay (and DVD-HD) is the backbone of both satellite and terrestrial digital broadcasting.
> based on solid foundations by very smart people in the industry, across multiple institutions and professions.
You really do get the sense of that when reading ISO 13818-1, too. It provides solid technical information but plenty of theory and extrapolation on these ideas to ensure that you truly understand their purpose. After reading it I had a much deeper appreciation of what it took to define it.
Although many DVD video recorders did use MPEG-TS (I don't even know if you can buy a new DVD recorder anymore, I assume HD recorders destroyed their market over a decade ago)
> able to recover the clock of the source in order to know when to drop or duplicate video frames
So this is why I can never watch a whole movie and not have a single dropped frame!
Back in the old TV days, the TV camera set the 60 fps frequency, and if it was slightly slow at just 59.9 fps, then your TV would also run slightly slow. The whole system was synced perfectly.
Whereas now every device seems to think it can set it's own version of "60 Hz". Your GPU's 60Hz is inevitably a bit different from your displays idea of "60Hz", which in turn is a bit different from your web browsers idea of "60Hz", which is again a bit different from the livestream you're watchings version of "60 Hz". At every stage, when one clock is slightly faster, frames are dropped, and at other stages duplicate frames are inserted. End result: You can't watch a film without random occasional jitters.
Please, everyone in the AV industry: Design your shit better. We should take away your quartz crystals and force you to have only inaccurate timing devices just so you are forced to properly sync everything!
What system(s) are you having these problems on? I haven't encountered that type of issue for years, neither on web video on my computer or smartphone, nor on streaming video (VOD and livestreams) on various smart and not-so-smart home devices.
For VOD, your playback system's clock running slightly fast or slow should not be an issue, since all buffers are under its control (it can just send disk loads, HTTP requests etc. at the pace it needs them), and for livestreaming, the most common solutions (HLS and MPEG-DASH, ultimately using MPEG-TS or fragmented MPEG-4 container files) also explicitly specify clock synchronization.
> We should take away your quartz crystals and force you to have only inaccurate timing devices just so you are forced to properly sync everything!
For livestreams, that would ultimately require synchronizing your entire video pipeline to a remote source (i.e. running your video output slightly slow/fast as required) – I'm not sure if devices like set-top boxes or streaming receivers support that, but it seems entirely impossible for anything capable of displaying two video streams (synchronized to different source clocks) at once.
> So this is why I can never watch a whole movie and not have a single dropped frame!
No, it is because bandwidth is expensive.
That's why you get a 4k UHD HDR stream at 3MB/s. Your whole stream is calculated based on the few full frames that you receive.
A low bit rate normally does not cause dropped frames (unless it's so low that not even the required amount of I-frames can be encoded, but that would be a pretty extreme mismatch of resolution and bitrate).
MPEG-TS is incorporated in many digital TV standards. They will continue to be around for a long time simply because of that, regardless of technical points.
Fun fact. DOCSIS 1.0 through 3.0 cable Internet uses MPEG-2 Transport Streams to deliver the IP packets. It has to, because the QAM specification (ANSI/SCTE 07) is built around 188 byte TS packets.
And in those early days many cable providers didn't even implement the Baseline Privacy Interface, so you could sniff the entire neighborhood's downstream traffic with a modified modem or possibly even a DVB-C capable TV card.
There was literally a register bit in the cable modem MAC to enable promiscuous mode, and you could just set it: https://pastebin.com/18702Ziq
Sadly the Broadcom modem hardware I used to experiment with then seemed to lack the ability to get at the raw MPEG-TS packets, so I didn't manage to repurpose it as a TV receiver. The idea was to tune to a TV channel and stream it over multicast RTP on the LAN.
As used modems were very cheap, I wanted to have a whole bank of them, to demodulate every carrier on the CATV system. All fed into a giant, noisy 24 port 100Mb Ethernet switch that had IGMP snooping support. That was back in the 2000s...
You would have loved the old ATI/AMD demodulator evaluation board that I have. Does both ATSC 1.0 and QAM. TS output is DVB-ASI (which requires a pretty expensive DVB-ASI to USB converter).
I’ve always wondered if that was done to allow mixed video and DOCSIS channels, shared hardware on either end, or just to ensure that TVs and STBs can quickly and safely skip DOCSIS channels that they won’t be able to decode anyway.
If I remember correctly, the 188-byte packets are a result of using ATM AAL-5 frames, each 53-byte frame having 5 bytes of ATM header, 1 byte of AAL-5 framing, and 47 bytes of payload (47x4 = 188).
The new ATSC 3.0 standard (NextGen TV) in the U.S. allows MPEG-TS, but prefers DASH using an OFDP wrapped UDP stream, which after hardware decoding is consumed just like any other network stream. It's really well done. Digital TV in the U.S. will eventually be like a giant, one-way WiFi signal.
Right, but transport stream should only be used in broadcasting. It shouldn't be used on discs (ahem Blu-ray) or other storage and it shouldn't be used over the Internet. Program stream is usually a better choice.
> It shouldn't be used on discs (ahem Blu-ray) or other storage and it shouldn't be used over the Internet.
I thought Blu-Ray switched to program stream, but DVD was transport stream (sounds like I've got things backwards though). Transport stream for optical media isn't that insane though, because while you can seek, seeking isn't great, and if you have a failed read, it's probably better to just move on than to retry that sector a bunch of times. Of course, using transport streams allows for smaller buffers and lower BOM, so there's that too.
"The Transport Stream is a stream definition which is tailored for communicating or storing one or more programs of coded data according to ITU-T Rec. H.262 | ISO/IEC 13818-2 and ISO/IEC 13818-3 and other data in environments in which significant errors may occur. Such errors may be manifested as bit value errors or loss of packets."
"The Program Stream is designed for use in relatively error-free environments and is suitable for applications which may involve software processing of system information such as interactive multi-media applications. Program Stream packets
may be of variable and relatively great length."
How, in a way that preserves all the mentioned properties of MPEG-TS?
There's RTP, but I would definitely not call translating MPEG-TS to RTP and back "dead simple" (unless you're just encapsulating MPEG-TS via RFC 2250, but then you're still using MPEG-TS).
I use it for storage because I can stream a recording, with full seeking, while I'm actively capturing it, or I can stream it over a websocket for a little less latency.
Having one thing that works just as well as WebSockets as it does over HLS makes things a lot easier on low performance devices where a format conversion would be a bad thing.
MPEG-TS is like a virtual analog RCA jack, it's easy to understand and work with and manipulate as long as you don't have to touch the encoding or decoding stuff, and lets you do lots of odd applications.
I started using mpeg_ts when I discovered that it streams well, that is, it can start randomly in the middle without much trouble. A few other formats I tried choked under this use case.
I don't know what this property is called so I don't really know how to search how well a format can start mid stream. any hints?
The term you are looking for is ‘random access point’.
This property is more a function of the encoding parameters of the video stream. It just happens that MPEG-TS is typically encoded so as to minimize the reasonable time between RAPs.
Also, MPEG-TS streams have sync bytes every 188 bytes, making it trivial to align the bitstream and then find the RAPs. Basically the use case you are describing is just the normal operation of a broadcast decode; get a stream at some arbitrary point of broadcast and decode it.
> This property is more a function of the encoding parameters of the video stream
Not really. Formats like mp4 can't be parsed without reading the container header, that may be at the start or end of the file. Thus you can't read an mp4 by starting in the middle without reading the header and then seek to the middle of the file. This is the case regardless of encoding parameters. With mpeg-ts on the other hand, you can seek to a random place in the middle, and recover the stream on the next Iframe. Not many other file formats allow this.
It's also a property of TS being designed for exactly this use-case, which is why it has these sync bytes and mechanisms to allow decoders to randomly access the stream and still get all the necessary data.
For a web dev, the most notable thing is that every block is the same length.
Have gstreamer output to a named pipe. Read from it and send it over websockets. Be sure that every packet or stream starts at a 188 byte boundary.
No matter where you start, as long as you're on a 188 byte line, you're good, the decoder knows what's up.
You don't have to decode or understand anything, you just need to count bytes.
Sub-second latency over LAN WiFi into a regular browser is trivial, even with your server written in pure Python.
I'm sure someday something might replace it, but right now it seems like one of those things that Just Works, and it's really nice when stuff actually works.
I am not an expert in this area, but I've worked around its edges, and video has always struck me as one of tech's great HARD problems. It's a really frustrating combination of: meant for human consumption, difficult to characterize algorithmically, realtime, having a distinct future temporal envelope, etc. The problem is precisely that many people want to many different things with video - and depending on what you want to do with it you may prefer an entirely different stack!
I almost want to compare it to making good vaccines in the medical world - some of the most beneficial work to all parties, but also some of the least commercially rewarding.
This feels like one of those classic "because it's very simply not bad enough for people to stop using it, given how only a little better the newer alternatives are".
For clock recovery, it's more about frame-accuracy here. The drift and jitter requirements on the PCRs (Program Clock References, one of the timestamps in the transport stream), stream is limited very strictly on a wide timescale. Here's a nice reference about it: https://download.tek.com/document/25W_14617_1.pdf, see for example Figure 8.
The limits here are surprisingly hard to achieve, both on the multiplexer side and on the decoder side. I've implemented clock recovery for the transport stream products at my company and it has been quite surprising how many multiplexers inject bad PCR data that's outside of the acceptable range.
Writing a T-STD compliant TS multiplexer is not trivial. When I was at LSI Logic, I spent many hours perfecting the TS multiplexer used on the MPEG-2 encoder in the Motorola DCT-6412 P3 STB.
Here's a short clip that's T-STD compliant with about 29 ns of PCR jitter.
This takes me back to when I was working on real-time digital video streaming for drones and one of the engineers I was working with had a prized mpeg-ts poster breaking out the protocol on a packet field level. We would frequently be found around that poster reviewing the interpretation of one of the packets making sure we were handling things correctly. Good times!
You mean those Predator drones used in the early 2000s that had a standard DVB-S uplink for the video feed, which were unencrypted back then? I think the video feed was IP encapsulated, inside DVB-MPE, but am not 100% sure about it?
* Clock Recovery
I have had no problems with measuring NTP drift. As the clocks change I would measure.
* Common Clock for Audio and Video
Sender Reports contain a mapping of Sequence Number to RTP Sequence Numbers. This is respected by every player I have used. My guess is author put their media in different MediaStreams. If you want all your tracks to be synced you need to mark them as one MediaStream.
* Defined latency
WebRTC provides playoutDelay. https://webrtc.googlesource.com/src/+/main/docs/native-code/.... This allows the sender to add delay to give a better experience.
* Legacy transport
You can transport anything you want via RTP or DataChannels. Maybe I am missing something with this one?