Hacker News new | past | comments | ask | show | jobs | submit login
Lessons Operating a Serverless-Like Platform at Netflix, Part 2 (medium.com)
178 points by diab0lic on Oct 16, 2017 | hide | past | favorite | 33 comments

"These fine grained units also meant that the composition of the final application was much more distributed. ... As long as business and client metrics remained unaffected, per unit health was ignored."

This was my big takeaway. I think this is the trajectory for monitoring in general as services become increasingly more distributed.

This isn't so much a unique feature of serverless though, any self-healing cluster based system is better monitored this way.

I have hard time understanding "Serverless" part... It sounds cool, but there are bunch of servers there they are using.

As far as I can manage to understand that name only has to do with pricing rather then actual servers, but then why is it not called something like "cheap" or "pay as you use"?

Maybe I am missing the point here, but this is just a buzzword? (nano-services being the other one, much like micro-services, do we have pico-ones?)

It's not just a buzzword (though some might play it off as one), but the reality is, serverless has a meaning. And it's not that there aren't servers that are there to run your code. Rather, your code can be set to run in production and not be running on any server. It's just not running. It's only running when it needs to run. Another system sits there and when it has an event that comes in that needs to run your code, it starts up your code, passes your code the details, and lets your code run. Once that's done, your code is finished, and it shuts down.

This is different from having code running on a server, or deploying your code to a server where it runs. You are paying for the server, regardless of whether it runs.

That's what it means.


1) The code comes into existence when it is called, and ceases to exist after the request is handled

2) The server on which this happens is managed by someone else. To all intents and purposes there is no server on a logical diagram of your application

If only someone thought about this earlier oh wait shared hosting

Yep. Much more succinct than my explanation. Thanks!

No, thank you.

My understanding was vague until I read your comment, I was just summing up that new understanding.

I will answer here as you kinda summed up what other person said.

1. Same kinda happens with python modules for example

2. All my servers are managed by someone else.

You can down vote all you want, but this is in no way different the spawning a bunch of machines somewhere...

Also once you start having networking issues there is a server on your diagram... cos all that this is describing is booting up worker nodes. And you can stretch your imagination to call it whatever you like but in the end that is what you are doing.

It's different in that someone else is taking care of the details infrastructure, but to a really high degree.

Your function is effectively a single bit of binary executable that will execute once and at termination the fabric it existed on will effectively cease to be. You can definitely say that the lines between that and other forms of execution are blurry but there are quite different modes of operation going on.

I have dynamic workers that scale to handle load in our system. I need to know about how the Python binary runs in the system, the upstart jobs that keep the workers running, the queues that push the jobs in, hosted on another machine, the underlying autoscale system, the watchers to figure out how to scale the autoscale system, the network rules between the machines, etc.

I know it's kinda the same thing, but it's also kinda not.

So basically you cut a need for in-house sysadmin? The way you are explaining it sounds ok when you have very small team.

And if you look at comments below they seem to have an issue of how to do warmup and cold function calls which suggest that you have to know what the platform is doing anyway (has JIT warmed up, do we need to replicate warmed up functions, how do we cold execute fast etc).

So to sum up, it is serverless as long as a 3rd party is in charge of the actual servers and auto-scale infra and if we do it in house it is normal server farm?

And sorry if i sound "edgy" but I am just trying to stretch the meaning of actual word "serverless" in my brain to what is being talked about here. I am not a native speaker so that might hinder me as well.

As ever, looking at some extreme examples can be instructive.

Imagine you have no infrastructure at all, and there's only a single function you need to run. In my case, I have a function that takes a file and produces a file and can take 5 minutes to an hour to run. The warm up time doesn't matter then. If I could just write one big old python function to do all the work and then push it to someone else so I can run as many instances of that function as required then that's a very different mental model from dealing with all the setup required to host that function myself.

There are other tasks too, where you just have some one off admin function (maybe some monitoring thing) where it's a pain to build all the infrastructure around it just to run it.

You're right though, it really doesn't matter what you use - you have to understand what's going on.

For reference, I don't use any serverless stuff in my stack — though I can see where the uses lie and could definitely imagine building for what they offer in some cases. It's a useful tool to have available, but not the only one you should reach for.

One of the best environments I ever used was PiCloud. We used to farm out our long running functions to them. Within your code you could say something like, run_fn_on_picloud(fn, args) and they would ship your code into their environment and execute it. It was absolutely wonderful and you could just fan out as much as required.

The best way to look at "serverless" is that it's a deployment strategy. Instead of deploying a service you are deploying functions or files of code versus a fully functioning service.

It took me a while to figure out what they were talking about, but it sounds a lot like Amazon Lambda from what I can tell.

But you spawning a bunch of worker nodes on demand to execute that code?

The concept of serverless, as I understand it, is that the application developer doesn't need to manage or think about the servers their applications run on. It's not that no one is managing the servers -- it's that the application developer doesn't need to. The developer is building on top of an interface that encapsulates those details.

With typical cloud hosting, I need to think about and manage a fleet of hosts from the kernel and operating system on up. I need to think about the hardware and choose a machine type, with a configuration of CPUs, RAM, and disk. I need to choose an OS, and the machine is my responsibility from the init system on upwards, including all of its daemons for SSH, NTP, etc. I need to think about how I'll distribute and install my software onto the machines, and how many machines I need. I need to think about how I'll handle machine failures and replacement.

With serverless, the interface to the developer is: specify what application to run, and it runs. I don't have to manage the execution environment that runs the application.

I'm trying to understand. This sounds kind of opposite to a description given to me for dev-ops. Is that accurate enough for a layman?

I wouldn't consider serverless to be the opposite of DevOps. I'd actually consider those concepts to be largely orthogonal.

Serverless and classic server environments are different kinds of system to manage. Classic operations vs. DevOps is about how you manage your systems, and who has responsibility for them.

In classic operations, software development and operations are separate teams with different responsibilities. DevOps combines these two functions into a single team (hence DevOps).

Serverless applications are still a system that needs to be managed; it's just less responsibility and effort than the traditional model. Consider: if newly released code contains a defect that's impacting the larger system, who has responsibility for detecting the problem and rolling it back? This concern exists with both server and serverless platforms, and DevOps vs. classic operations would have different answers.

Excellent, thanks. That sums it up wonderfully.

I'm retired and we had separate teams with a liaison between the two, pretty much since our inception. A part of me thinks it's a good idea, but I'd worry about depth of knowledge in both disciplines, but I digress.

To this layman it seems to fit like a tee with devops, as it allows the devs to get the ops out of their hair by having them reside in a different company altogether.

Thus there is no ops to go "no way no how" where devs want to introduce some new shiny to the stack, as their only interaction is over a web api...

Seems like the extreme version of this would be a simple annotation in client code to run it on the server, like the old <script runat...> in ASP, e.g. Java pseudo code

@RunAt(Platform.SERVER) public Observable<List<Customer>> findCustomersByName(String name) { return // code to query DB and return a list of customers over the wire }

// on client public void searchByCustomer(String name) { // findCustomersByName() and bind stream to UI }

The idea being that the RPC is automatic here. The client code makes an async call, constructs an observable around deserialized result. The server code is pulled out by and spun up as a cloud function on demand. Of course this only makes sense if the code needs to run on the server (e.g. the results are big and need to run a lot of code, or need to run code that can't be trusted on the client).

> like the old <script runat...> in ASP

Good memories! I remember how neat I found that as a script-kid back in the pre-.NET, VBScript-based ASP 3.0 days some 17 years ago. (MS Personal Web Server for the Win! The QBasic of the early web age.) But there wasn't any code-sharing and it was always clearly server-code, not client-code. Let alone the fact that only IE could run VBScript on the client..

I would be curious to see more detail about their warmup handler practices. Did they tune the JVM for it or is there some way to deterministically invoke the right JIT'ing.

I'm not sure what Netflix is doing, but Alibaba has a neat JWarmup feature[0] in their AJDK that warms up the JVM and JITs based on recorded profiling data.

0: https://jcp.org/aboutJava/communityprocess/ec-public/materia...

Very neat, thanks. I didn't know they ran a custom jvm. About to go down a rabbit hole.

The easiest thing to do on AWS is just fire off a CloudWatch event. When your Lambda function gets updated, or even on a set schedule, you can invoke your function with whatever arguments you need. Presumably you could cook in a method that warmed everything up, and call that method.

That's fine for functions that go cold, but what about additional containers that spin up when usage spikes? Don't those also suffer from cold start delays? I never hear people talk about that.

Yes, they also suffer from cold starts. When people talk about it, they usually measure the percentage of cold-started function calls, which mostly ends up being around 0.1% of all calls.

In a sense, you have to have a good amount of baseline traffic to end up with that problem, which reduces the percentual impact. So the problem is less pronounced than with a function that is very rarely run, where a high percentage of function calls are cold.

(In theory, the worst case where a CloudWatch trigger would be mostly useless is that a function is totally idle and removed from warm-state, and then gets a huge concurrent spike, is idle again, gets a spike etc. But that is not a realistic traffic pattern, I assume. It would also break with existing auto-scaling that would be even slower with bringing up new instances.)

Depends on the container/instances. There are cold start delays but you can get it down to a minute or two if you really optimize (but... well, most people don't need to).

Those will still have cold start delay. A simple warm-up script only solves 0-to-1 problem, but not 1-to-n problem.

You may have a look at OpsGenie's Sirocco https://github.com/opsgenie/sirocco/tree/master/sirocco-warm...

Awesome! Thanks.

If anyone is interested in learning more about serverless, AWS lambda, etc. There's a webinar this Thursday: https://read.iopipe.com/announcing-the-things-to-know-about-...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact