Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I worked at Twitch from 2014 to 2018. I was never on the video team, but here are some details that have changed.

Video:

- Everything is migrated from RTMP to HLS; RTMP couldn't scale across several consumer platforms

- This added massive delay (~30+s) early on, the video team has been crushing it getting this back down to the now sub-2s

- Flash is dead (now HTML5)

- F5 storms ("flash crowds") are still a concern teams design around; e.g. 900k people hitting F5 after a stream blips offline due to the venue's connection

- afaik Usher is still alive and well, in much better health today

- Most teams are on AWS now; video was the holdout for a while because they needed specialized GPUs. EDIT: "This isn't quite right; it has more to do with the tight coupling of the video system with the network (eg, all the peering stuff described in the article)" -spenczar5

- Realtime transcoding is a really interesting architecture nowadays (I am not qualified to explain it)

Web:

- No more Ruby on Rails because no good way was found to scale it organizationally; almost everything is now Go microservices back + React front

- No more Twice

- Data layer was split up to per-team; some use PostgreSQL, some DynamoDB, etc.

- Of course many more than 2 software teams now :P

- Chat went through a major scaling overhaul during/after Twitch Plays Pokemon. John Rizzo has a great talk about it here: https://www.twitch.tv/videos/92636123?t=03h13m46s

Twitch was a great place to spend 5 years at. Would do again.



Hi glacials :) Small correction from someone at Twitch today:

> video was the holdout for a while because they needed specialized GPUs

This isn't quite right; it has more to do with the tight coupling of the video system with the network (eg, all the peering stuff described in the article).


Oh, edited! Thanks Spencer :)


F5 storm is an awesome name for the venue blip -> refresh reaction. I've certainly contributed my fair share to your storms. It's basically automatic.


Yeah, and the problem is that it often does work to fix issues. It's the web equivalent of "have you tried turning it off and turning it back on again?"...


Maybe something as simple as a textual overlay "Don't worry - the stream will be back soon" and a script to internally hammer the only element of the page that's actually needed for a video refresh


text overlays would still be clunky, what's really needed is dynamic video spliced directly into the stream so the viewer understands it's the broadcasters' connection that is poor not the viewer's.

This could be achieved if twitch allowed broadcasters to upload a 2 second loop as a 'placeholder' for connection drops, and twitch could mix that in if it detects too many frame-drops.

It would need to be a setting though because some streamers just stream over a bad connection (e.g. Hitchhiking streams) and wouldn't want that interruption.


I guess this is actually being tested right now. I saw something like this last week, though I've also seen streams just drop.


Or, the live video equivalent of "retry" :)


I initially thought 'F5 storm' refers to people typing "F" in the chat.


On my first read-through I thought 'F5 Storm' referred to F5 load balancers, not hitting the F5 key to refresh the page.


> - No more Ruby on Rails because no good way was found to scale it organizationally; almost everything is now Go microservices back + React front

Ugh, I just... I keep trying to pretend I don't need to learn Go, but every highly scalable system I read about that's recently been written about seems to be using it. Maybe I just need to stay away from systems that need to scale? Heh...


The keyword here is organizationally.

Technically speaking you can build scalable systems using anything you want. But if you need to hire a couple of hundreds developers, you better go with Java 7 or Go than Ruby, Lisp or Perl. The dumber and more uniform the better.


The key distinguisher there is static typing. Which is how Facebook manages to make do with PHP, and Dropbox with Python. By adding top annotations for a static type-checker...


Static typing doesn't help you much, when you have 20 teams hating each other and 20 different ways to write statically typed C++ code.


It would be interesting to hear a comparison from someone using https://crystal-lang.org/ at scale. It's basically Ruby + types, which would make it the closest for isolating that one feature. It can't run Rails, but there are very similar web frameworks available.


It seems Stripe solved this problem with https://sorbet.org/ which is actually Ruby + types.


It misses other benefits of static typing -> being able to compile to efficient code. Also Sorbet is still not popular enough to apply to many gems - it's a lot of work to implement it at the moment.


It's not Ruby + types, it just looks like it at first glance. The semantics are all diffrent (for good reasons).

I very much doubt that anyone is using Crystal at the scale Ruby/Rails is used yet. It's still far away from a 1.0 release and when I tried out 0.23 it had several issues, major ones for me were lacklustre debugging support and the type checker needing to load the whole AST into memory. Concurrency story was also not very strong but I know people were working on it actively at least.

Anyway, I wrote a non-trivial project in Crystal and overall really enjoyed it. Improving the debugging tooling would go a long way towards making it a real option.


I've heard Crystal's Ruby resemblance fades once you get past the beginner stage.


Personally, I think it’s hugely worth learning. Aside from some eschewed defacto behaviors, Go is very easy to pick up and learn the entirety of in a week or two, because the language itself is really not that large. So I’d argue the time investment is a good one for what you get.

Still, you definitely do not need Go to scale systems. People scale Everything, perhaps most impressively PHP applications.


Go isn't the only language that scales, it just happens to be popular amongst the scripting language crowd as a next step. You're by no means limited in your choice. You could do Java, C#, Rust...


The good news is that it takes like an hour to learn enough go to be productive.


Before golang was a thing, there were highly scalable systems that handled way more traffic than anything written in golang today. Those systems were (and are) written in languages like C++ and Java and C#.

You're just seeing golang in articles because of hype.


Java and C# lag behind Golang on most performance metrics. Combine it with the awesome deployment story (single binary) and you'd be hard pressed to choose the former


I'd like to see those performance metrics. Other than that, this is not true to the slightest, not just from what I observed, but from established performance people like Martin Thompson[1]. If you watch that talk, he mentions towards the end that they ported Aeron (originally Java) to C#, golang, and C++. The Java version was the fastest out of the box, but with some work, they were able to get the C# version to be faster. I suspect this mainly has to do with value types, which is being developed for the JVM as well.

What you're probably referring to is GC pauses. The golang GC is tuned for latency, at the expense of throughput. The JVM has several GCs, and is gaining several more like Shenandoah and ZGC, which allows you to select the GC that best fits your use case. You can tune for latency or throughput.

A lot of Java deployments these days are in the form of uber/shaded jars, which is basically one jar file that contains the entire app, and run with a single command, not much different than running a binary.

[1] https://www.youtube.com/watch?v=Pz-4co8IaI8


The opposite is true. Golang is on average 2-3 times slower than Java. On the plus side, it uses less memory.


No of course. It may loses at some benchmarks made by Java or Python/Ruby coders.

Two areas where it is actually slower are

1) Memory allocation 2) Regular expression performance.

But you must understand your Java app won’t have performance advantage because of faster alloc speed IRL: GC will take lots of CPU, because Go’s one is much easier at memory release. You can only see allocation advantage in micro benchmarks where the app stops before the GC will start.


> You can only see allocation advantage in micro benchmarks where the app stops before the GC will start.

The golang gc is tuned for latency at the expense of throughput, meaning if you look at the duration of time spent in GC over the course of the code execution, it would actually be longer compared to a GC tuned for throughput.

If you have a use case that requires high throughput, then you cannot change the GC behavior. Unlike on the JVM, where you have several GCs to choose from. The JVM is also getting two new low latency GCs for use cases that require low latency.

And it's not just microbenchmarks where Java does better than golang, it's especially longer running processes where the JVM's runtime optimizations really kick in. Not to mention that the JVM is getting value types as well to ease the load on the GC when required (it does an excellent job as it is even without value types).

I did a dummy port of the C# version of the Havlak code here[1] to Java, preserving the same behavior and not making any data structure changes. On the machine I tested on, the C# version took over 70 seconds to run, while the Java version took ~17 seconds. In comparison, the C++ version took ~24 seconds, and the golang version took ~30 seconds.

Yes, you could most likely spend much more time tuning the C++ version, avoiding allocations, and so on, but at the expense of readability. This is what the JVM gives you, you write straight-forward, readable code, and it does a lot of optimizations for you.

The brainfuck2 benchmark is especially interesting. Kotlin tops the list, but I was able to get Java to the same performance since Kotlin by writing it in a similar manner as the Kotlin code. Again, Java/Kotlin beat out even C++ when I tested them, and by quite a margin.

[1] https://github.com/kostya/benchmarks


How much CPU GC takes for any given GC implementation is largely down to the design of the application, its data structures and allocation graph.

Request / response servers which keep caches and other allocations prone to middle age death out of the GC heap are consistent with the generational hypothesis and ought to spend no more than a few (low single digit) percent in GC with a generational collector.



these microbenchmarks doesn't say anything about real world use cases... but anyway, here are the latest results:

https://www.techempower.com/benchmarks/#section=data-r17&hw=...

c#, rust, go, c++, java, c, nim.. all tied at 7M.

again, this doesn't mean anything useful.


This benchmark isn't really useful as you pointed out. Microbenchmarks are always tricky, but check out the other two posts I just wrote here (about Martin Thompson and the benchmarks on GitHub) for hopefully more realistic benchmarks.


It's not about hype but companies using more recent languages to solve the same problem. Why did not Twitch pick Crystal or Rust or Scala or JRuby ?


Because of the use case. Go wins if all you need is the easiest way to write services with high concurrency requirements. I expect this is true for Twitch's systems.

Crystal is still immature, Rust is more suited to use cases where you want to avoid garbage collection.

Hype is not the only factor but it makes hiring easier. And anything Google puts its weight behind will get hyped. More often than not it's better to choose a technology which suits your organisational (read: hiring) needs.


I'd argue that Java or C# would have worked out just fine for Twitch. There was a recent post on Twitch's early architecture, and it seems they started out with Ruby. Unsurprisingly, they had to switch from it once they needed performance (similar story happened with Twitter).


Well deserved hype


> Realtime transcoding

I'm curious: Do services like Twitch specify a specific desired codec/bitrate that doesn't get transcoded? Transcoding seems like a lot of effort for lower quality end result.

If I were streaming, I would want to avoid transcoding as much as possible. Since we're talking about live broadcasting, there is a unique ability for the streamer to choose the format they upload.


In the RTMP days, the highest quality setting in the viewer was always a straight pass through from the broadcaster, and the reduced versions were transcoded in the data center to fit down lower-bandwidth last-mike pipes.


Same in the HLS days. What used to be called “source” was just a remux from rtmp to ts.


Gotcha. For whatever reason, I had forgotten about lower-bandwidth copies.


> everything is now Go microservices back

Excuse the simple question: When I hear "microservices", I think serverless backend. Is that right, or are they different? If they're the same, how do you stream video with serverless? (Seems like streaming, websockets, etc... shouldn't be possible in a serverless environment...)


"Microservice" describes the size and scope of each deployment artifact. It answers the question "is the whole system just one big ball, is it broken up, how broken up is it?" It doesn't describe how it is deployed.

"Serverless" describes how a deployment artifact is deployed and runs. Generally it refers to a class of technologies in multiple domains whereby intricate knowledge of the underlying host is abstracted behind a cleaner API, with things like scaling, security, patching, etc handled by an infrastructure provider. While the term rose in prominence alongside "functions as a service", which is certainly a technology that generally qualifies as serverless, there are many serverless products out there: AWS Fargate for running containers, DynamoDB for a database, S3 for object storage, all of these are "serverless". A good signal is: if I can SSH into it, its not serverless.

A microservice can certainly be deployed serverless (ECS/Fargate or Google Cloud Run comes to mind). A microservice can even refer to one or more logically related functions-as-a-service; the term more-so speaks to how the engineering teams organize their business domain into the code and how the APIs speak to each other, rather than the exact underlying technologies.


I finally understand the difference. Thank you.


Great explanation!


Microservices are about splitting code into different servers instead of a monolithic codebase. You end up with different servers (probably virtualized) for each domain of the application.

Like, instead of having the video decoding and the analytics code in the same monolith attached to same DB, you deploy a different server for each one, generally with a new DB for each. When the services need to talk to each other, they do it via network (REST, gRPC, etc.).


They're different. Microservices are still stateful applications that run 24/7. They are just really small in scope.

e.g. the Friends feature on Twitch is one microservice, running in its own autoscaling group, with internal APIs used by other microservices like Whispers.


My team follows microservice patterns, and have deployed services that utilise websockets over both serverless (Azure Functions and Lambda) as well as regular hosted services (on k8s, EC2 and Azure App service etc). Nothing stopping you there. On the streaming video side we did an app that used Azure Media services + Azure functions.. works well enough.

Not necessarily a good idea, but one 'feature' of microservices is the ability to pick different stacks, languages and delivery methods on an individual service level.


I work at twitch. Let me put it this way. My team that I am on (VOD) has ~8 backend engineers and we are in charge of something like ~2 dozen services.

We literally have services that are run entirely using AWS Lambda functions only.

This is a pretty big difference from teams I've worked on in the past, that have 8 engineers all working on a singular service.

"Microservices" is more of a philosophy than anything.


“+ React front” to say the least! Hope you are well, glacials.


Hehe, great job on that Chris :)


Didn't Twitch use elemental machines for transcoding?


No. Elemental is more of a high end encoding system for quality. Twitch is more about bulk cheap transcodes of good quality. Think about it. MLB has maybe 18 concurrent events. Twitch is running minimum in the 10k range.


No we never had Elementals. In the early days there was no way we could afford them. In the later days I don't think we would want them as we needed to scale so many transcode jobs that it was easier to have a large farm of dumb machines to organise jobs across.

There may have been an element machine at one point that was used for testing/playing but I really don't think so, and know there wasn't one between 2010 and 2017.


Transcoding was a relatively late addition to the whole system— for a long time we only passed through the original video bits unchanged and tried to advise broadcasters about picking compromise settings.

By the time we decided transcoding was necessary, we had enough in-house video engineering knowledge to build our own system integrated with everything else.


We had transcoding as early as 2011. As that is when I made my first commits to the video jobs codebase, specifically to the transcoding jobs. It was quite late when we had the resources ($$$) to provide widespread availability of transcodes to the community.


Yeah; my perspective on “late” is probably pretty skewed, since I left as the rebranding was still being developed. I think the favorite new name when I left was something like Xarth; Twitch was a much better choice.


miss you!


Miss you too buddy! Hit me up when you're in Seattle next.


of course :) will be easier when I've relocated to sf


"F5 storms" are easy to handle. Intercept all keypress combinations for refresh and do what you want with it client side. (spread it out over time, use a high-performance endpoint to check if live or a combination)

Most people doesn't use the refresh button in the browser, so only a small amount of traffic will be uncontrolled.


Do you have any data to support that? I personally don't have an F5 key on my keyboard (it requires pressing a modifier), so I pretty much always click the reload button to fix a stream blip. The impression I get from reading Twitch chat is that most people are using mobile. I doubt they have a keyboard plugged in and press F5 to refresh.

That said, you certainly don't need your video streaming servers to handle those hundred-thousand refresh requests.


Yeah anecdotally every non-tech person I know clicks the refresh button.

It also doesn't catch people watching on mobile web who refresh. And I don't believe their mobile app has a refresh button, whenever a stream glitches out for me I force kill and reload the mobile app.


You have any data on most people not using the refresh button in the browser?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: