- Everything is migrated from RTMP to HLS; RTMP couldn't scale across several consumer platforms
- This added massive delay (~30+s) early on, the video team has been crushing it getting this back down to the now sub-2s
- Flash is dead (now HTML5)
- F5 storms ("flash crowds") are still a concern teams design around; e.g. 900k people hitting F5 after a stream blips offline due to the venue's connection
- afaik Usher is still alive and well, in much better health today
- Most teams are on AWS now; video was the holdout for a while because they needed specialized GPUs.
EDIT: "This isn't quite right; it has more to do with the tight coupling of the video system with the network (eg, all the peering stuff described in the article)" -spenczar5
- Realtime transcoding is a really interesting architecture nowadays (I am not qualified to explain it)
- No more Ruby on Rails because no good way was found to scale it organizationally; almost everything is now Go microservices back + React front
- No more Twice
- Data layer was split up to per-team; some use PostgreSQL, some DynamoDB, etc.
- Of course many more than 2 software teams now :P
- Chat went through a major scaling overhaul during/after Twitch Plays Pokemon. John Rizzo has a great talk about it here: https://www.twitch.tv/videos/92636123?t=03h13m46s
Twitch was a great place to spend 5 years at. Would do again.
> video was the holdout for a while because they needed specialized GPUs
This isn't quite right; it has more to do with the tight coupling of the video system with the network (eg, all the peering stuff described in the article).
This could be achieved if twitch allowed broadcasters to upload a 2 second loop as a 'placeholder' for connection drops, and twitch could mix that in if it detects too many frame-drops.
It would need to be a setting though because some streamers just stream over a bad connection (e.g. Hitchhiking streams) and wouldn't want that interruption.
Ugh, I just... I keep trying to pretend I don't need to learn Go, but every highly scalable system I read about that's recently been written about seems to be using it. Maybe I just need to stay away from systems that need to scale? Heh...
Technically speaking you can build scalable systems using anything you want. But if you need to hire a couple of hundreds developers, you better go with Java 7 or Go than Ruby, Lisp or Perl. The dumber and more uniform the better.
I very much doubt that anyone is using Crystal at the scale Ruby/Rails is used yet. It's still far away from a 1.0 release and when I tried out 0.23 it had several issues, major ones for me were lacklustre debugging support and the type checker needing to load the whole AST into memory. Concurrency story was also not very strong but I know people were working on it actively at least.
Anyway, I wrote a non-trivial project in Crystal and overall really enjoyed it. Improving the debugging tooling would go a long way towards making it a real option.
You're just seeing golang in articles because of hype.
What you're probably referring to is GC pauses. The golang GC is tuned for latency, at the expense of throughput. The JVM has several GCs, and is gaining several more like Shenandoah and ZGC, which allows you to select the GC that best fits your use case. You can tune for latency or throughput.
A lot of Java deployments these days are in the form of uber/shaded jars, which is basically one jar file that contains the entire app, and run with a single command, not much different than running a binary.
Two areas where it is actually slower are
1) Memory allocation
2) Regular expression performance.
But you must understand your Java app won’t have performance advantage because of faster alloc speed IRL: GC will take lots of CPU, because Go’s one is much easier at memory release. You can only see allocation advantage in micro benchmarks where the app stops before the GC will start.
The golang gc is tuned for latency at the expense of throughput, meaning if you look at the duration of time spent in GC over the course of the code execution, it would actually be longer compared to a GC tuned for throughput.
If you have a use case that requires high throughput, then you cannot change the GC behavior. Unlike on the JVM, where you have several GCs to choose from. The JVM is also getting two new low latency GCs for use cases that require low latency.
And it's not just microbenchmarks where Java does better than golang, it's especially longer running processes where the JVM's runtime optimizations really kick in. Not to mention that the JVM is getting value types as well to ease the load on the GC when required (it does an excellent job as it is even without value types).
I did a dummy port of the C# version of the Havlak code here to Java, preserving the same behavior and not making any data structure changes. On the machine I tested on, the C# version took over 70 seconds to run, while the Java version took ~17 seconds. In comparison, the C++ version took ~24 seconds, and the golang version took ~30 seconds.
Yes, you could most likely spend much more time tuning the C++ version, avoiding allocations, and so on, but at the expense of readability. This is what the JVM gives you, you write straight-forward, readable code, and it does a lot of optimizations for you.
The brainfuck2 benchmark is especially interesting. Kotlin tops the list, but I was able to get Java to the same performance since Kotlin by writing it in a similar manner as the Kotlin code. Again, Java/Kotlin beat out even C++ when I tested them, and by quite a margin.
Request / response servers which keep caches and other allocations prone to middle age death out of the GC heap are consistent with the generational hypothesis and ought to spend no more than a few (low single digit) percent in GC with a generational collector.
c#, rust, go, c++, java, c, nim.. all tied at 7M.
again, this doesn't mean anything useful.
Crystal is still immature, Rust is more suited to use cases where you want to avoid garbage collection.
Hype is not the only factor but it makes hiring easier. And anything Google puts its weight behind will get hyped. More often than not it's better to choose a technology which suits your organisational (read: hiring) needs.
Still, you definitely do not need Go to scale systems. People scale Everything, perhaps most impressively PHP applications.
I'm curious: Do services like Twitch specify a specific desired codec/bitrate that doesn't get transcoded? Transcoding seems like a lot of effort for lower quality end result.
If I were streaming, I would want to avoid transcoding as much as possible. Since we're talking about live broadcasting, there is a unique ability for the streamer to choose the format they upload.
Excuse the simple question: When I hear "microservices", I think serverless backend. Is that right, or are they different? If they're the same, how do you stream video with serverless? (Seems like streaming, websockets, etc... shouldn't be possible in a serverless environment...)
"Serverless" describes how a deployment artifact is deployed and runs. Generally it refers to a class of technologies in multiple domains whereby intricate knowledge of the underlying host is abstracted behind a cleaner API, with things like scaling, security, patching, etc handled by an infrastructure provider. While the term rose in prominence alongside "functions as a service", which is certainly a technology that generally qualifies as serverless, there are many serverless products out there: AWS Fargate for running containers, DynamoDB for a database, S3 for object storage, all of these are "serverless". A good signal is: if I can SSH into it, its not serverless.
A microservice can certainly be deployed serverless (ECS/Fargate or Google Cloud Run comes to mind). A microservice can even refer to one or more logically related functions-as-a-service; the term more-so speaks to how the engineering teams organize their business domain into the code and how the APIs speak to each other, rather than the exact underlying technologies.
Like, instead of having the video decoding and the analytics code in the same monolith attached to same DB, you deploy a different server for each one, generally with a new DB for each. When the services need to talk to each other, they do it via network (REST, gRPC, etc.).
e.g. the Friends feature on Twitch is one microservice, running in its own autoscaling group, with internal APIs used by other microservices like Whispers.
Not necessarily a good idea, but one 'feature' of microservices is the ability to pick different stacks, languages and delivery methods on an individual service level.
We literally have services that are run entirely using AWS Lambda functions only.
This is a pretty big difference from teams I've worked on in the past, that have 8 engineers all working on a singular service.
"Microservices" is more of a philosophy than anything.
There may have been an element machine at one point that was used for testing/playing but I really don't think so, and know there wasn't one between 2010 and 2017.
By the time we decided transcoding was necessary, we had enough in-house video engineering knowledge to build our own system integrated with everything else.
Most people doesn't use the refresh button in the browser, so only a small amount of traffic will be uncontrolled.
That said, you certainly don't need your video streaming servers to handle those hundred-thousand refresh requests.
It also doesn't catch people watching on mobile web who refresh. And I don't believe their mobile app has a refresh button, whenever a stream glitches out for me I force kill and reload the mobile app.
> The point of having multiple datacenters is not for redundancy, it's to be as close as possible to all the major peering exchanges. They picked the best locations in the country so they would have access to the largest number of peers.
This is the kind of thing where I would have to hire some kind of network engineering expert, and he just figured this stuff out and made it work? I can't fathom other people's intelligence sometime.
I've worked on projects with Kyle, and he often goes into bulldozer mode. It is no surprise to me that Kyle could "learn" all he needed in order to get something like this set up (or at least learn enough to orchestrate a small group in constructing it). Kyle is, by all means, a "force of nature" as YC tends to define it.
The downside to Kyle's optimism is that he often has very little concern for the humanity of others. He can set up decent optics around his actions and decisions in the wake of what many might consider failures, but he has consistently abused those who try to give him good-faith constructive feedback and often brought co-workers to tears. This is all well-documented at least through the past 4-5 years. (Kyle does actually explicitly ask for "direct" feedback btw. He just is only capable of handing the feedback on a periodic, weekly or monthly basis).
A key lesson of this article (and in glacials post above) is a testament to what can be achieved very quickly if technical debt is of minor concern. Kyle's key strength is in building a proof of concept that supports rapid iteration. This point appears to be something the Justin / Twitch teams did very well.
A secondly lesson is in getting alignment among diverse engineers. Think about how the team might have debated the architecture presented. Think about how some of the choices might rub people the wrong way.
Finally, Kyle is a unique character in several ways but is not alone in possessing a transient "bulldozer" mentality. If you see yourself having the same pattern of behavior, get help before others get hurt. There are a variety of mitigations that can help, but they need explicit participation.
By this point in history, it wasn’t just him anymore and we’d done a few rounds of improvements already out of necessity. As I recall, he got us up and running at PAIX based mostly on research, but most of the other data centers were built out by a network engineer(1) we hired away from YouTube.
While he was working on the network engineering and keeping the original system afloat, I did a lot of the software work for the system described here.
(1) Name withheld out of courtesy
Bear in mind that I don't think this is some deliberate attempt to appear superhuman, I think it's just accidental
When your users start to complain, you tend to develop the domain knowledge necessary to solve their problems pretty quickly.
Especially back in 2010. I feel like I'd have a much better shot of being able to figure out that scale these days than a decade ago. (If I spent my free time studying and not watching Age of Empires 2 on Justin.tv/Twitch).
At the time PAIX had a reverse-billing setup: the more data you transferred, the cheaper your connection charge was; we managed to get all the way into the cheapest billing tier within the first billing cycle which was basically unheard-of at the time.
Funnily enough, that's pretty much how HLS, the modern live video standard, works - it's essentially a series of tiny video clips loaded & played one right after the other, distributed through the same CDN as normal video files.
Thanks to HLS, live video is actually much worse than it was 10 years ago with RTMP in terms of latency. There's been some recent efforts in getting it down, although they're generally not standardised, hard to scale (e.g. WebRTC) and or a bit awkward.
I don't think anyone really follows Apple's spec for various technical reasons, though. Most do some sort of chunked-transfer encoding, along with pre-signaling segments in playlists, as outlined by the Periscope folks here: https://medium.com/@periscopecode/introducing-lhls-media-str...
I would have assumed it's because Apple only announced this 4 weeks ago, and the only clients that support it are beta software.
RTMP, on the other hand, maintains a live socket between the server and client, and the server can forward each packet as it becomes available.
Just curious if this is still a thing. I've watched an unhealthy amount of Twitch (not all with chat open) and never noticed this.
In fact, Evasyst is a blend of Twitch and Discord and uses Janus for this reason. https://evasyst.com/
A brand-new competitor, though, won’t have to deal with those scaling problems for a while. They have to figure out product-market fit first, which probably can be done off-the-shelf these days. And then they’ll need to figure out how to keep the lights on, which will take a similar amount of engineering work as it did for Justin/Twitch.
For 10k concurrent stream receivers in the US and Europe, what would you be looking at to get the stream under 30 seconds lag time?
I realize this is a hard problem but would like to explore it.
The web is a terrible architecture for videos. Torrents solved this problem. If you trying to make a camel walk with two legs, you don't deserve praise but a tomato.
If you're interested in Atrium, reach out to me (see profile). We're hiring!
Who's with me?