I worked at Twitch from 2014 to 2018. I was never on the video team, but here ar...

spenczar5 · on July 3, 2019

Hi glacials :) Small correction from someone at Twitch today:

> video was the holdout for a while because they needed specialized GPUs

This isn't quite right; it has more to do with the tight coupling of the video system with the network (eg, all the peering stuff described in the article).

glacials · on July 3, 2019

Oh, edited! Thanks Spencer :)

arcticfox · on July 3, 2019

F5 storm is an awesome name for the venue blip -> refresh reaction. I've certainly contributed my fair share to your storms. It's basically automatic.

ummonk · on July 3, 2019

Yeah, and the problem is that it often does work to fix issues. It's the web equivalent of "have you tried turning it off and turning it back on again?"...

Lorin · on July 4, 2019

Maybe something as simple as a textual overlay "Don't worry - the stream will be back soon" and a script to internally hammer the only element of the page that's actually needed for a video refresh

Sahhaese · on July 4, 2019

text overlays would still be clunky, what's really needed is dynamic video spliced directly into the stream so the viewer understands it's the broadcasters' connection that is poor not the viewer's.

This could be achieved if twitch allowed broadcasters to upload a 2 second loop as a 'placeholder' for connection drops, and twitch could mix that in if it detects too many frame-drops.

It would need to be a setting though because some streamers just stream over a bad connection (e.g. Hitchhiking streams) and wouldn't want that interruption.

hatsix · on July 7, 2019

I guess this is actually being tested right now. I saw something like this last week, though I've also seen streams just drop.

spenczar5 · on July 3, 2019

Or, the live video equivalent of "retry" :)

pulkitsh1234 · on July 4, 2019

I initially thought 'F5 storm' refers to people typing "F" in the chat.

wrigby · on July 3, 2019

On my first read-through I thought 'F5 Storm' referred to F5 load balancers, not hitting the F5 key to refresh the page.

diminoten · on July 3, 2019

> - No more Ruby on Rails because no good way was found to scale it organizationally; almost everything is now Go microservices back + React front

Ugh, I just... I keep trying to pretend I don't need to learn Go, but every highly scalable system I read about that's recently been written about seems to be using it. Maybe I just need to stay away from systems that need to scale? Heh...

snaky · on July 3, 2019

The keyword here is organizationally.

Technically speaking you can build scalable systems using anything you want. But if you need to hire a couple of hundreds developers, you better go with Java 7 or Go than Ruby, Lisp or Perl. The dumber and more uniform the better.

ummonk · on July 3, 2019

The key distinguisher there is static typing. Which is how Facebook manages to make do with PHP, and Dropbox with Python. By adding top annotations for a static type-checker...

snaky · on July 3, 2019

Static typing doesn't help you much, when you have 20 teams hating each other and 20 different ways to write statically typed C++ code.

viraptor · on July 3, 2019

It would be interesting to hear a comparison from someone using https://crystal-lang.org/ at scale. It's basically Ruby + types, which would make it the closest for isolating that one feature. It can't run Rails, but there are very similar web frameworks available.

mcpherrinm · on July 3, 2019

It seems Stripe solved this problem with https://sorbet.org/ which is actually Ruby + types.

viraptor · on July 3, 2019

It misses other benefits of static typing -> being able to compile to efficient code. Also Sorbet is still not popular enough to apply to many gems - it's a lot of work to implement it at the moment.

pkd · on July 5, 2019

It's not Ruby + types, it just looks like it at first glance. The semantics are all diffrent (for good reasons).

I very much doubt that anyone is using Crystal at the scale Ruby/Rails is used yet. It's still far away from a 1.0 release and when I tried out 0.23 it had several issues, major ones for me were lacklustre debugging support and the type checker needing to load the whole AST into memory. Concurrency story was also not very strong but I know people were working on it actively at least.

Anyway, I wrote a non-trivial project in Crystal and overall really enjoyed it. Improving the debugging tooling would go a long way towards making it a real option.

rhizome · on July 4, 2019

I've heard Crystal's Ruby resemblance fades once you get past the beginner stage.

jchw · on July 3, 2019

Personally, I think it’s hugely worth learning. Aside from some eschewed defacto behaviors, Go is very easy to pick up and learn the entirety of in a week or two, because the language itself is really not that large. So I’d argue the time investment is a good one for what you get.

Still, you definitely do not need Go to scale systems. People scale Everything, perhaps most impressively PHP applications.

echelon · on July 3, 2019

Go isn't the only language that scales, it just happens to be popular amongst the scripting language crowd as a next step. You're by no means limited in your choice. You could do Java, C#, Rust...

empath75 · on July 3, 2019

The good news is that it takes like an hour to learn enough go to be productive.

apta · on July 4, 2019

Before golang was a thing, there were highly scalable systems that handled way more traffic than anything written in golang today. Those systems were (and are) written in languages like C++ and Java and C#.

You're just seeing golang in articles because of hype.

hactually · on July 4, 2019

Java and C# lag behind Golang on most performance metrics. Combine it with the awesome deployment story (single binary) and you'd be hard pressed to choose the former

apta · on July 4, 2019

I'd like to see those performance metrics. Other than that, this is not true to the slightest, not just from what I observed, but from established performance people like Martin Thompson[1]. If you watch that talk, he mentions towards the end that they ported Aeron (originally Java) to C#, golang, and C++. The Java version was the fastest out of the box, but with some work, they were able to get the C# version to be faster. I suspect this mainly has to do with value types, which is being developed for the JVM as well.

What you're probably referring to is GC pauses. The golang GC is tuned for latency, at the expense of throughput. The JVM has several GCs, and is gaining several more like Shenandoah and ZGC, which allows you to select the GC that best fits your use case. You can tune for latency or throughput.

A lot of Java deployments these days are in the form of uber/shaded jars, which is basically one jar file that contains the entire app, and run with a single command, not much different than running a binary.

[1] https://www.youtube.com/watch?v=Pz-4co8IaI8

MrBuddyCasino · on July 4, 2019

The opposite is true. Golang is on average 2-3 times slower than Java. On the plus side, it uses less memory.

LOLOLOLO1 · on July 4, 2019

No of course. It may loses at some benchmarks made by Java or Python/Ruby coders.

Two areas where it is actually slower are

1) Memory allocation 2) Regular expression performance.

But you must understand your Java app won’t have performance advantage because of faster alloc speed IRL: GC will take lots of CPU, because Go’s one is much easier at memory release. You can only see allocation advantage in micro benchmarks where the app stops before the GC will start.

apta · on July 4, 2019

> You can only see allocation advantage in micro benchmarks where the app stops before the GC will start.

The golang gc is tuned for latency at the expense of throughput, meaning if you look at the duration of time spent in GC over the course of the code execution, it would actually be longer compared to a GC tuned for throughput.

If you have a use case that requires high throughput, then you cannot change the GC behavior. Unlike on the JVM, where you have several GCs to choose from. The JVM is also getting two new low latency GCs for use cases that require low latency.

And it's not just microbenchmarks where Java does better than golang, it's especially longer running processes where the JVM's runtime optimizations really kick in. Not to mention that the JVM is getting value types as well to ease the load on the GC when required (it does an excellent job as it is even without value types).

I did a dummy port of the C# version of the Havlak code here[1] to Java, preserving the same behavior and not making any data structure changes. On the machine I tested on, the C# version took over 70 seconds to run, while the Java version took ~17 seconds. In comparison, the C++ version took ~24 seconds, and the golang version took ~30 seconds.

Yes, you could most likely spend much more time tuning the C++ version, avoiding allocations, and so on, but at the expense of readability. This is what the JVM gives you, you write straight-forward, readable code, and it does a lot of optimizations for you.

The brainfuck2 benchmark is especially interesting. Kotlin tops the list, but I was able to get Java to the same performance since Kotlin by writing it in a similar manner as the Kotlin code. Again, Java/Kotlin beat out even C++ when I tested them, and by quite a margin.

[1] https://github.com/kostya/benchmarks

barrkel · on July 4, 2019

How much CPU GC takes for any given GC implementation is largely down to the design of the application, its data structures and allocation graph.

Request / response servers which keep caches and other allocations prone to middle age death out of the GC heap are consistent with the generational hypothesis and ought to spend no more than a few (low single digit) percent in GC with a generational collector.

Xelbair · on July 4, 2019

I'll just leave this here https://www.ageofascent.com/2019/02/04/asp-net-core-saturati...

dcu · on July 4, 2019

these microbenchmarks doesn't say anything about real world use cases... but anyway, here are the latest results:

https://www.techempower.com/benchmarks/#section=data-r17&hw=...

c#, rust, go, c++, java, c, nim.. all tied at 7M.

again, this doesn't mean anything useful.

apta · on July 4, 2019

This benchmark isn't really useful as you pointed out. Microbenchmarks are always tricky, but check out the other two posts I just wrote here (about Martin Thompson and the benchmarks on GitHub) for hopefully more realistic benchmarks.

theredbox · on July 4, 2019

It's not about hype but companies using more recent languages to solve the same problem. Why did not Twitch pick Crystal or Rust or Scala or JRuby ?

pkd · on July 5, 2019

Because of the use case. Go wins if all you need is the easiest way to write services with high concurrency requirements. I expect this is true for Twitch's systems.

Crystal is still immature, Rust is more suited to use cases where you want to avoid garbage collection.

Hype is not the only factor but it makes hiring easier. And anything Google puts its weight behind will get hyped. More often than not it's better to choose a technology which suits your organisational (read: hiring) needs.

apta · on July 4, 2019

I'd argue that Java or C# would have worked out just fine for Twitch. There was a recent post on Twitch's early architecture, and it seems they started out with Ruby. Unsurprisingly, they had to switch from it once they needed performance (similar story happened with Twitter).

hajhatten · on July 4, 2019

Well deserved hype

thomastjeffery · on July 4, 2019

> Realtime transcoding

I'm curious: Do services like Twitch specify a specific desired codec/bitrate that doesn't get transcoded? Transcoding seems like a lot of effort for lower quality end result.

If I were streaming, I would want to avoid transcoding as much as possible. Since we're talking about live broadcasting, there is a unique ability for the streamer to choose the format they upload.

kd5bjo · on July 4, 2019

In the RTMP days, the highest quality setting in the viewer was always a straight pass through from the broadcaster, and the reduced versions were transcoded in the data center to fit down lower-bandwidth last-mike pipes.

slimscsi · on July 4, 2019

Same in the HLS days. What used to be called “source” was just a remux from rtmp to ts.

thomastjeffery · on July 4, 2019

Gotcha. For whatever reason, I had forgotten about lower-bandwidth copies.

PullJosh · on July 4, 2019

> everything is now Go microservices back

Excuse the simple question: When I hear "microservices", I think serverless backend. Is that right, or are they different? If they're the same, how do you stream video with serverless? (Seems like streaming, websockets, etc... shouldn't be possible in a serverless environment...)

013a · on July 4, 2019

"Microservice" describes the size and scope of each deployment artifact. It answers the question "is the whole system just one big ball, is it broken up, how broken up is it?" It doesn't describe how it is deployed.

"Serverless" describes how a deployment artifact is deployed and runs. Generally it refers to a class of technologies in multiple domains whereby intricate knowledge of the underlying host is abstracted behind a cleaner API, with things like scaling, security, patching, etc handled by an infrastructure provider. While the term rose in prominence alongside "functions as a service", which is certainly a technology that generally qualifies as serverless, there are many serverless products out there: AWS Fargate for running containers, DynamoDB for a database, S3 for object storage, all of these are "serverless". A good signal is: if I can SSH into it, its not serverless.

A microservice can certainly be deployed serverless (ECS/Fargate or Google Cloud Run comes to mind). A microservice can even refer to one or more logically related functions-as-a-service; the term more-so speaks to how the engineering teams organize their business domain into the code and how the APIs speak to each other, rather than the exact underlying technologies.

NetOpWibby · on July 10, 2019

I finally understand the difference. Thank you.

UweSchmidt · on July 4, 2019

Great explanation!

leonidasv · on July 4, 2019

Microservices are about splitting code into different servers instead of a monolithic codebase. You end up with different servers (probably virtualized) for each domain of the application.

Like, instead of having the video decoding and the analytics code in the same monolith attached to same DB, you deploy a different server for each one, generally with a new DB for each. When the services need to talk to each other, they do it via network (REST, gRPC, etc.).

glacials · on July 4, 2019

They're different. Microservices are still stateful applications that run 24/7. They are just really small in scope.

e.g. the Friends feature on Twitch is one microservice, running in its own autoscaling group, with internal APIs used by other microservices like Whispers.

discordance · on July 4, 2019

My team follows microservice patterns, and have deployed services that utilise websockets over both serverless (Azure Functions and Lambda) as well as regular hosted services (on k8s, EC2 and Azure App service etc). Nothing stopping you there. On the streaming video side we did an app that used Azure Media services + Azure functions.. works well enough.

Not necessarily a good idea, but one 'feature' of microservices is the ability to pick different stacks, languages and delivery methods on an individual service level.

stale2002 · on July 4, 2019

I work at twitch. Let me put it this way. My team that I am on (VOD) has ~8 backend engineers and we are in charge of something like ~2 dozen services.

We literally have services that are run entirely using AWS Lambda functions only.

This is a pretty big difference from teams I've worked on in the past, that have 8 engineers all working on a singular service.

"Microservices" is more of a philosophy than anything.

conover · on July 3, 2019

“+ React front” to say the least! Hope you are well, glacials.

glacials · on July 4, 2019

Hehe, great job on that Chris :)

cantbecool · on July 4, 2019

Didn't Twitch use elemental machines for transcoding?

grogenaut · on July 4, 2019

No. Elemental is more of a high end encoding system for quality. Twitch is more about bulk cheap transcodes of good quality. Think about it. MLB has maybe 18 concurrent events. Twitch is running minimum in the 10k range.

Dobbs · on July 4, 2019

No we never had Elementals. In the early days there was no way we could afford them. In the later days I don't think we would want them as we needed to scale so many transcode jobs that it was easier to have a large farm of dumb machines to organise jobs across.

There may have been an element machine at one point that was used for testing/playing but I really don't think so, and know there wasn't one between 2010 and 2017.

kd5bjo · on July 4, 2019

Transcoding was a relatively late addition to the whole system— for a long time we only passed through the original video bits unchanged and tried to advise broadcasters about picking compromise settings.

By the time we decided transcoding was necessary, we had enough in-house video engineering knowledge to build our own system integrated with everything else.

Dobbs · on July 4, 2019

We had transcoding as early as 2011. As that is when I made my first commits to the video jobs codebase, specifically to the transcoding jobs. It was quite late when we had the resources ($$$) to provide widespread availability of transcodes to the community.

kd5bjo · on July 4, 2019

Yeah; my perspective on “late” is probably pretty skewed, since I left as the rebranding was still being developed. I think the favorite new name when I left was something like Xarth; Twitch was a much better choice.

zemnmez · on July 3, 2019

miss you!

glacials · on July 3, 2019

Miss you too buddy! Hit me up when you're in Seattle next.

zemnmez · on July 3, 2019

of course :) will be easier when I've relocated to sf

NightlyDev · on July 4, 2019

"F5 storms" are easy to handle. Intercept all keypress combinations for refresh and do what you want with it client side. (spread it out over time, use a high-performance endpoint to check if live or a combination)

Most people doesn't use the refresh button in the browser, so only a small amount of traffic will be uncontrolled.

jrockway · on July 4, 2019

Do you have any data to support that? I personally don't have an F5 key on my keyboard (it requires pressing a modifier), so I pretty much always click the reload button to fix a stream blip. The impression I get from reading Twitch chat is that most people are using mobile. I doubt they have a keyboard plugged in and press F5 to refresh.

That said, you certainly don't need your video streaming servers to handle those hundred-thousand refresh requests.

squeaky-clean · on July 4, 2019

Yeah anecdotally every non-tech person I know clicks the refresh button.

It also doesn't catch people watching on mobile web who refresh. And I don't believe their mobile app has a refresh button, whenever a stream glitches out for me I force kill and reload the mobile app.

cortesoft · on July 4, 2019

You have any data on most people not using the refresh button in the browser?