This is wonderful. I read the whole series in one sitting. It actually made video codecs feel way more approachable, rather than some patented black box magic I'll never understand.
It also reminded me of a recent article talking about how you can break audio codecs by guessing which quantizer was used by the packet, then using it in reverse to produce speech! Which I suppose is obvious in retrospect, that lossy codecs are trying to compress data by making it perceptually similar, whatever the domain.
I also appreciated the ties to video game networking. Gaffer on Games has had a long-running series on designing multiplayer networking protocols with UDP and you two approach bit-shaving very similarly (unsurprisingly I suppose - it's a very specific process with its own tools).
It was a blast to write. Glenn is a smart guy with some great content around game networking. There are good ways to do networking for games and other real-time applications and TCP isn't really one of them.
This is a variant of "should you compress or encrypt first?"
Compression relies on pattern matching, and compressed size will leak details about what you compressed, even if that result is encrypted. (Unless you then pad the encrypted size, but then what was the point of compressing? I can see some more or less secure ways to do this like establishing a compression ratio/bandwidth/entropy limit, then padding and achieving that constraint so each encrypted payload looks more or less the same, but latency sensitivity makes this difficult)
In the case of VOIP, the codec uses a lookup table for distinct parts of speech (tch, sp, buh, etc). Then "all it has to send" is table cell numbers around (certainly not all). On the receiving side, you just look in your speech table and reconstruct.
These values have distinct output patterns, particularly when compressed. If you can guess better than 70% of the time (I forget the exact number they achieved) what table value was used, then reconstruct it, you can listen in on what they're saying, without having to break the underlying encryption.
Voice codecs are also awful at encoding music which may explain why when you're on hold, the hold music may just be dropped and replaced with white noise because it's reached some bandwidth cap. C.f. video encoding and falling snow.
Hey, cool. Always nice to see this work show up on HN. But I don't think this is the paper you're looking for. In '08, we could only spot phrases that we knew in advance, and they had to be at least a certain length.
The most impressive results -- going from encrypted VoIP to text -- were done by Andy White and others, a couple years after the paper you linked above. It's this one:
A.M. White, A.R. Matthews, K.Z. Snow, and F. Monrose. "Phonotactic Reconstruction of Encrypted VoIP Conversations:
Hookt on fon-iks." In Proceedings of IEEE S&P, 2011.
http://www.cs.unc.edu/~fabian/papers/foniks-oak11.pdf
My absolute favorite kind of writing is the kind that leaves me itching to try coding something myself since it now seems so much more approachable than it did before. Though I knew a bit about encoding, I never would have thought to build something like this, and yet now I find myself wishing I could carve out some time to try.
If I may, I'd like to inject a plea for sanity here. Please, please, pretty please with sugar on top, don't reinvent the wheel when it comes to chat protocols. Right now, on my desktop, I have 5 different windows open dedicated to various chat networks and chat protocols. I have steam chat, a window with my irssi session, a pidgin session with connections to a slack session and an aim session, a telegram session, and a skype session. Yes, implementing someone else's protocols is complicated and sometimes painful, but when you end up needing a half dozen different /types/ of connections, there is something wrong and broken with how we're approaching the whole talk to other people thing.
There are business reasons why we have so many different protocols and apps. Everyone wants a piece of the pie and letting "your" users talk on someone else's network is seen as a risk. Your competitor might not want you on their network, either.
In the case of Apple, I'm pretty sure they aggressively ban anything that doesn't look like a real iDevice. I've heard the same of WhatsApp but I've successfully tested a couple of third party protocol implementations.
XMPP isn't great for mobile (bandwidth/battery usage) but I don't know enough to comment on that.
Have a browse back through some HN threads on chat and IM - these kinds of points come up every time. We've lost the war. One of my friends just asked me to install LINE and I'd be up to 6 different apps if I didn't just give up and go back to SMS.
The claim that XMPP on mobile can only result in lots of bandwith and battery usage is false, according to Daniel Gultsch, developer of the Android XMPP client conversations: https://gultsch.de/xmpp_2016.html
Money quote: “XMPP is not suited for mobile devices. That’s a myth that has been around for ages. It is mostly spread by people who want to sell you their own proprietary instant messaging solution.”
From my own experience, high battery usage is the fault of the XMPP client. Regarding bandwith, I have been chatting over throttled 3G and regular 2G connections. The initial connection takes noticeably longer, otherwise everything except file transfers seems fine.
You do not need to convince them to use a single client. The one must-have feature for mobile XMPP that many clients have built-in is message receipts (XEP-0184).
I have successfully used this for voice calls between the chat clients Gajim/Pidgin (I do not remember which) and Google Talk (the now-discontinued XMPP client by Google). I have also done video calls between two Nokia N900 phones, to see if it works. Voice and video via XMPP has worked since approximately 2009 and is just neglected by the companies for business reasons.
XMPP has the draft extension of Jingle, and there are Jingle extensions for things like file transfer. Similarly SIP/SIMPLE could be used, or WebRTC. Any of the three could be used instead of the mess we have now.
I am not aware of a good obvious default answer. And even when there is a reasonable one (ie xmpp for text chat) it doesn't appear that the open flavor has won in the market place.
Ever tried bitlbee? Not sure how well it's keeping up with the new protocols but years ago it was a way to consolidate at least a few of the older ones into a single application. Also not sure why this idea has not been replicated. Maybe it has.
The idea has been replicated in XMPP as so-called transports, which maps other protocols to XMPP. This arguably works better than bitlbee and other IRC gateways, since XMPP is a superset of many protocols, while IRC is a subset of many protocols.
https://en.wikipedia.org/wiki/XMPP#Connecting_to_other_proto...
I can join chatrooms in my XMPP client by joining a multi-user chat, like maemo%irc.freenode.net@irc.netlab.cz. It may seem useless at first, but I have found out that when on a train and using IRC directly, I timeout much more often than if using XMPP in the same situation.
In the "last generation" of chat apps, there were certainly a lot of multiprotocol messengers. Trillium and Pidgin come to mind, and were way more popular with bitlbee.
So all four parts are available, just click on the link in the last paragraph to go to the next.
I love projects/blogs like this, since it is "back to the basics" and we all learn something by better understanding how things from codecs, to compression, and so on work. This one is wonderful and one of the best reads this week.
While TCP is not ideal for this application by nature of it trying to be a fully reliable stream protocol, one often overlooked advantage of its congestion control is that it allows the stream to play nicely with others. For example, if you develop a datagram-based transport whose data rate results in congestion at some point in the network path, any TCP going through the same point would back off to nearly nothing in an attempt to save the link.
You can be greedy and take the bandwidth anyway at the expense of everything else, but possibly (in some conditions) this may even cause a worse outcome even for your traffic. It's likely better to change your data rate target and drop rarely than to send too much and drop randomly at a higher rate.
The series was amazing. Could not stop clicking through. I was sorely let down at the end when there was no more to read. Also when it didn't turn into an open-source high-performing P2P video conferencing app :)
This intro (and series as a whole) was AWESOME! My background didn't really touch on compression at all and these parts were what I really loved learning from the post. More please! (Any other resources for compression are welcome :D)
Regarding overhead of H.264 and VP9, using Intel's QuickSync or AMD's VCE would make sense in a production version, in order to have a fast implementation. That's VAAPI (and maybe VDPAU) on Linux and BSD. The encoders' output will look good enough for video streaming.
Thanks for writing this, I learned a lot! Last weekend I started a peer-to-peer group video calling project. Seeing your whole approach made the entire system easier to understand.
imgui is great, but I have a bit of an allergy on C++. There's Nuklear https://github.com/vurtun/nuklear which is a re-take on it, but in ANSI C. It's interesting to see GUI rendering takes so much of your processing time slot, or is it that everything else is so little?
GUI is slow primarily because of naive uploading of six video frames sixty times per second. Not a detail that mattered for this initial work. It should take almost no time with a smarter implementation.
Ah, so you're counting video upload to texture and render of it as part of GUI cycle? Number makes sense then. You're basicaly redrawing whole gui each frame along with video texture and uploading/updating that texture at 60fps.
When he wrote the video chat app I was wondering if 'middle-out' compression would even work for a video stream.. you'd have to buffer some amount of the data in order to get to the 'middle'.
"Top-down" and "bottom-up" in compression usually refer to which direction you build the prefix tree. In Shannon-Fano coding (top-down), you start with the set of all symbols and their frequency distributions and then recursively divide them into roughly even-sized subsets, assigning a 0 as the prefix to the first and a 1 as the prefix to the second. In Huffman coding (bottom-up), you start with a priority queue of each individual symbol/frequency pair, and then merge the two lowest frequency nodes together, building the tree from the bottom up. In middle-out coding, presumably, the algorithm decides at runtime whether to merge two existing codes into a single prefix tree, or to split an existing prefix tree in some other way. There's some speculation [1] that this is done in a probabilistic way.
All of these algorithms require known frequency distributions, which requires that you have the full data available. In typical DEFLATE compressors (gzip, pkzip, zlib, etc.), this is handled by dividing the input stream into blocks and compressing each block individually.
A similar approach could be used for video - you could easily make the block size a single packet - but in practice, you'd rarely want to use a lossless compression algorithm for video chat anyway. Most video compression is lossy; you can drop a lot of detail before the human eye notices. That's how you can stream a full widescreen movie (which has an uncompressed size of 2 megapixel * 4 bytes/pixel * 30 frames/second = 240 MB/sec) over a typical 5 Mb/sec broadband connection.
It also reminded me of a recent article talking about how you can break audio codecs by guessing which quantizer was used by the packet, then using it in reverse to produce speech! Which I suppose is obvious in retrospect, that lossy codecs are trying to compress data by making it perceptually similar, whatever the domain.
I also appreciated the ties to video game networking. Gaffer on Games has had a long-running series on designing multiplayer networking protocols with UDP and you two approach bit-shaving very similarly (unsurprisingly I suppose - it's a very specific process with its own tools).
Anyway, thank you! I learned a lot.