Hacker News new | comments | show | ask | jobs | submit login

Hi, I'm the tech lead of Workers.

Note that the core point here is multi-tenancy. With Isolates, we can host 10,000+ tenants on one machine -- something that's totally unrealistic with containers or VMs.

Why does this matter? Two things:

1) It means we can run your code in many more places, because the cost of each additional location is so low. So, your code can run close to the end user rather than in a central location.

2) If your Worker makes requests to a third-party API that is also implemented as a Cloudflare Worker, this request doesn't even leave the machine. The other worker runs locally. The idea that you can have a stack of cloud services that depend on each other without incurring any latency or bandwidth costs will, I think, be game-changing.




Greetings Tech Lead of Workers!

I have yet to use this specific product and after reading the blog post I'm actually really frustrated that I didn't fully understand the things that I could do with Workers as I had a proof-of-concept app recently that would have been nice to test with since we've had issues with cold start delays on Lambda in similar projects in the past. And that second point that you make, above, ... holy cow; that's going to change the way I design systems and applications for customers.

Personally speaking, I am a fan of CloudFlare like some of the more obsessed are fans of Apple[0]. Every time your company comes out with a new announcement, it's solves some problem I'm experiencing on projects at that moment.

I could write paragraph after paragraph of all of the grief that using your services has eliminated from mine and my customer's lives, but I am certain you get that enough, already. There is no way my customer's appreciate your free tier as much as I, the developer, do. And your UX is awesome. Whichever one of your team mates designed the API button with instructions next to each option in the control panel as a way to onboard people to the API ... it's those stupid little things -- because that's there, when I'm setting up a client with a unique configuration for the first time, I start from a baseline, set it manually, test, and click into that link to see what commands I'm putting into the script should I have another client with similar needs. I probably wouldn't have thought to even look to see if you had an API were it not for those links.

I'm excited to try out Workers next time I have a Lambda need that fits well. Keep up the good work -- it's taking a ton of grief out of my life.

[0] A gentleman from your company sent me a T-Shirt for providing feedback for the Argo tunnel (called something different back then in beta) product -- you guys even do the damn T-Shirts right ... subtle logo on the shoulder and front, no words, and really soft cotton. It's my favorite "free vendor t-shirt" shirt.

edit: A sentence ... got away from me and I had a dangling bullet


Thank you. Made my day.


Sorry, I just read the linked article so I don’t really know the exact specific of your system, but I think I’m having a flashback. These are exactly all the promises of the sandboxed JVM running in a browser, probably with 10-20x performance deficit to account for JavaScript vs java. I think that after about 20 years we all know why that kind of “security” model failed. I’m actually quite stunned seeing that you are running code from thousands of different websites in the same process. The only game changing part that I can see is giving a huge incentive to a lot of not so good people to try to break the sandbox. And I was hoping that after the intel spectre debacle we were finally going to move in the right direction... Apparently we are still going in the wrong one. I really hope that I misunderstood everything and that you are not really running code from thousands of different websites in the same process.


One of the fundamental truths behind any innovation is it breaks at least one 'sacred' belief of the last generation. The JVM was not a technology built for this kind of multi-tenancy, it was bolted on later.

V8 is one of the most well tested and secured pieces of software in existance, and that it has a much smaller surface area than the Linux process isolation you're referring to.


On the other hand, containerization typically occurs on an independent (from other owners) instance of a virtual machine, which typically is running on separate processor cores, helping increase the overall isolation despite residing on shared hardware. Exposed processor caches due to exploits like Meltdown are a significantly higher risk on a platform of this kind than on a containerized environment. V8 exists at a much higher level than hardware-level exploits. How does your platform mitigate these kinds of concerns? Presumably you have some kind of virtualization above this to manage roll out of your execution environment, but adding a shared execution context like V8 feels to me like the risk factor is doubled, not reduced.


> I’m actually quite stunned seeing that you are running code from thousands of different websites in the same process.

Are you also stunned every time you open your browser? Because it's the same thing.


Chrome has one process per tab. It seems exactly the opposite to me.


As I understand that was introduced only as a mitigation for Spectre.


It's a bit more complicated. Chrome has had the ability to run separate tabs in separate processes since day 1. However, quite often, JavaScript from separate web sites would end up running in the same process. Specifically, (i)frames, popup windows, and sometimes tabs created by a site would run in the same process as the creating site, even if the child window displayed a completely different domain. In this case, in fact, the sites would run in the same V8 isolate. The only separation was "context" separation, which essentially means that the objects from one site were prohibited from accessing the objects from another even though they occupied the same heap.

More recently, Chrome has introduced Site Isolation, which actually allows iframes to run in separate processes. On desktop, this is enabled by default as of Chrome 67. On Android, it is not enabled yet, because it has been shown to be too expensive. The Chrome team wants to fix this, but it's not clear (to me, at least) whether they'll be able to overcome the inherent barriers here.

Chrome started working on Site Isolation before Spectre, but Spectre accelerated interest in it. My take is that Spectre is probably not the main reason that the Chrome security people (who are awesome, BTW) want to do it, but it provided a great excuse to rally support behind it.

For Workers -- much like for Android Chrome -- process-per-tenant is still too costly to be feasible. So, we need to pursue other approaches, as I've described elsewhere in this thread.


Thank you for the very detailed reply, that was an interesting read. Much appreciated!


I appreciate the novel approach, but you should try to use more realistic numbers for the arguments in the post. A node process doing nothing does not consume 30MB of private memory (shared mappings are shared, just like the "shared runtime"), nor does it take 500ms-10s to start a process or 100us to context switch. You're off by 3+ orders of magnitude on the first and almost 2 orders on the second. Cold starts are a symptom of something else (likely shipping function data), which I don't think is really a property of the sandbox.


35MB is what Amazon reports and bills you based on. If that number is wrong, a great question is what are they charging people for?

Launching a process is different than launching a language runtime or interpreter. They generally are not optimized for initial start time in the way an Isolate is. The duration of a context switch varies, but when dealing with tens of thousands of requests per second per machine they are very significant.


Using billing to measure the size or space of something is the worst kind of measure. For example, I recently got on a Grey Hound with my bicycle, packaged in a box. My ticket was $27, but the bike cost $30, even though I weight 3 to 4 times what my bike does. Humans also need thousands of liters of space, while the bicycle box only took up 200 liter or so. And the time the bus driver took to load the bicycle is negligible.


AWS uses virtualization domains to separate customer workloads, and Meltdown required that they all be HVM to provide strong guarantees: https://lists.xenproject.org/archives/html/xen-devel/2018-01...

Your account's Lambda functions won't share a kernel with anyone else's


I'm not sure why you think process start-up is such a heavyweight operation. Of course there's lots of optimization for start-up, it's the whole basis for Unix.

A context switch is O(10000) cycles and 10k is totally doable.

I agree that the Isolate architecture is more efficient, I just object to the straw man comparison. (And would love some real data.)


> nor does it take 500ms-10s to start a process or 100us to context switch. You're off by 3+ orders of magnitude on the first

I'm relatively sure 10ms is not the worst case cold start time of lambda.

> Cold starts are a symptom of something else (likely shipping function data)

Which is all part of spinning up a containerised function in lambda.

About memory, despite being shared memory, does a container in lambda running a node process use 35mb of RAM? Because that's the claim.


> About memory, despite being shared memory, does a container in lambda running a node process use 35mb of RAM? Because that's the claim.

What do you mean by "use" in this context? Does every program that links to glibc "use" the memory for glibc? (The answer is no, because the page cache is shared. But if you just looked at memory usage you might be fooled into thinking it is copied for every program.)

Containers use a drastically tiny amount of in-kernel memory. We are talking fewer than a couple of kilobytes in the worst case.

It appears to me that the main worry was with using Kubernetes, and they've applied this to containers as a whole. I agree that in this sort of usecase Kubernetes really doesn't make sense (using container images at all makes no sense in this case), but the core kernel primitives and simple container runtimes available are more than suitable.

There is an argument that having no userland-kernel context switches has a performance improvement (and it does) but I would like to see data before accepting at face value that "context switches are expensive" is an appropriate thing to optimise for.

After all they are still doing context switches, it's just in userspace. And if v8 is multi-threaded they've just reinvented the N-to-M scheduling model that is very well known to be fundamentally broken because of various well known pathologies.


Your argument is moot when it comes to original point though, isn’t it? Can a single machine handle 10,000 containers with acceptable perfmance at load? Versus the single process handling work for 10,000 customers? Substitute n for 10k here it’s an arbitrary number.


I'm not sure how I could, without having access to CloudFlare-scale compute and load, be able to answer your question.

My point is that the article in question makes claims that I'm not sure are correct, and I wonder whether the engineers behind this solution decided against containers because of their initial experience (this isn't a dig against them, this is a very common trait of almost everyone -- you don't want to waste your time if the first impression you have is negative). Several of the statements (especially about memory) appear to be based on misunderstandings of how containers would actually operate in such a scenario (or based on testing with sub-optimal configuration).

Maybe I'm completely wrong that containers would work under this kind of load in this scenario, but reading the article I didn't find many arguments I would expect to see if someone had really tried to make it work with containers and found the flaws. Instead it reads (at least to me) to be more of a "at first glance this doesn't appear to work" -- which is a fine thing to base product decisions on, but it isn't really okay to then spread (what appears to me to be) misinformation unless you have done significant testing to justify it.

So I disagree that my argument is moot -- and I would like to know how much cheaper the userland context switches are in V8 (I just looked and it appears that V8 does have a multi-threaded core now -- but hopefully CloudFlare doesn't use that with their userland threading+isolation model...).


So, I'm the architect of Cloudflare Workers. In a previous life, I created Sandstorm.io, including implementing its container engine from scratch. I do know some things about containers.

With Sandstorm.io, the rule of thumb we landed on was that a container takes 100MB of RAM. Some apps used more, some used less. This is real, empirical data.

With Workers, we're seeing an order of magnitude better.

There is no one, simple reason for this difference -- rather, it's a large number of factors working together. If I enumerated every one of them, you probably wouldn't find any of them compelling on its own.

Sure, if we could convince all our customers to write tiny C programs, with just the right constraints, maybe they could be as efficient in terms of RAM usage. Maybe. In Sandstorm, a raw C/C++/Rust app could indeed fit in a couple megabytes. But there's still context switching overhead. And no one wants to write C, they want to write JavaScript. ¯\_(ツ)_/¯


> I do know some things about containers.

Great! :D

> With Sandstorm.io, the rule of thumb we landed on was that a container takes 100MB of RAM. Some apps used more, some used less. This is real, empirical data.

I hate to keep harping on this, but what do you mean by "used" here? Are you saying that the RSS was 100MB per container, or that the sum of private mappings used was 100MB (/proc/$pid/smaps)? Did you use a filesystem like overlayfs that facilitates page-cache sharing by allowing read-only inode sharing between different containers, or a driver like {btrfs,devicemapper,aufs} that didn't? (Sandstorm was kicking around a while ago, so this might've been before overlayfs was in mainstream use.)

I know I probably look like an asshole, but I actually don't think I understand what you mean by used -- because saying that a container uses 100MB (which I take to mean that each new container spawned costs 100MB of real mmeory) simply doesn't sound right. It implies you could run less than 80 containers on an 8GB machine, and I doubt that Sandstorm programs were this big. It's like someone telling you that they spent $1500 on lunch -- I'm actually confused what the word "spent" means here.

I'm sure that you'd see less RSS memory usage by only having one process, but it is very possible that this memory usage benefit is not actually real -- I could be completely wrong but I'm just having trouble understanding how putting the same code inside a single program could make such a difference (given that the page cache already shares the memory for the V8 process). If you were to run 10k V8 processes your RSS would increase by 10k but your real memory usage would only increase very slightly because of the minor kernel memory cost of "struct task_struct".

As for context switches, yeah okay that's a cost you pay by having more than one process on a machine. But you do still have context switch overhead (though obviously it's much smaller) if you're running more than one program's state in a single process -- you have to switch context to a different set of protected variables right?

> And no one wants to write C, they want to write JavaScript.

My point about page caches is that the code for V8 is in the page cache and thus running more V8 processes doesn't take up any more real memory than just running one. So that cost of running JavaScript on top should be similar (though probably slightly larger because V8 stores other program information in memory -- but definitely not 10x larger).


My understanding is that he’s talking about app specific code, like yeah the os is smart enough to only load libc once, but php/node/python/rails/x are all going load up there language run times and app specific libraries into their processes. The non .so parts. Then the 100mb number makes sense.


In theory (if most of the runtime code being loaded is the same files underneath) then the page cache will also help with loading them -- you get benefits from the page cache as long as you're opening the same inode (which is why I'm talking about overlayfs -- because overlayfs allows page-cache sharing for base container images).


Even then, to get significant sharing every container would need to be deploying the same frameworks, libraries, and versions, which the single process, restricted environment moots all that anyway.


I'm confused -- the article is specifically about deploying JavaScript (or something that targets WebAssembly). From where I'm standing that seems to be a fairly homogeneous thing to be deploying (if you need to have many versions of JavaScript frameworks I don't see how that's not a problem with this solution either).

"Application containers" get a bad rap because Docker's model of a bunch of tar archives with separate root filesystems isn't actually the best way of getting density (nor is it what a lot of people really want).


A lambda cold start is not equivalent to starting a process. There's obviously a lot of things happening beyond that.

Does it take 500ms every time you type 'ls' in a console? That's starting a process.

> About memory, despite being shared memory, does a container in lambda running a node process use 35mb of RAM? Because that's the claim.

It's not the claim. I wouldn't object to someone saying "we charge X, lambda charge Y". But the post seems to confound implementation decisions in lamda with fundamental properties of containers and processes, which are simply not correct.


Have you tried simply using a seccomp-isolated Linux executable for each tenant? (with rlimit or cgroup-based resource limits)

You have to create a process for each tenant, but that should be much faster and take much less memory than V8 JITting JavaScript or WebAssembly (Linux process creation is very well optimized).

You can try static linking, dynamic linking or even just forking plus an ad-hoc executable loader in userspace.


Clients are intentionally writing JavaScript against (a subset of) the Service Worker API; so you will have to spawn a V8 process & context anyway.

V8 isolates are the same separation as browser tabs, and thus are also designed to be initialized extremely quickly. I’d be surprised if spawning a new process is faster, and it’s not going to be faster if your goal is to run JS/WASM.


Basically CGIs. This would be great if worker functions were written in C, but these days Web programming is done in dynamic languages with fat runtimes where hello world uses 50 MB of RAM. They're saving all that RAM by not using multiple processes.


> If your Worker makes requests to a third-party API that is also implemented as a Cloudflare Worker, this request doesn't even leave the machine. The other worker runs locally.

Doesn't this undermine your assumption that external timers can't be used in a timing side channel attack because "the network is extremely noisy"[1]? Consider a third-party API implemented as a Cloudflare Worker that just returns Date.now(). How much noise will there be if that runs locally?

[1] https://news.ycombinator.com/item?id=18280156


We control the target isolate's Date.now() too, and we can make it return the same value in both workers. Indeed, we can run the workers on the same thread, which makes the whole thing look more like a function call than a network request.


Nice! Thanks for your response.


Chrome recently started running each site's JavaScript in an isolated process, as a Spectre mitigation. [1] Why do you trust V8's isolation more than Chrome does?

[1] https://security.googleblog.com/2018/07/mitigating-spectre-w...


Never mind, I missed the previous discussion of that here: https://news.ycombinator.com/item?id=18418476


This isn’t true. Solaris 11 can have thousands of zones (containers) on one machine. Solaris zones are a very lightweight technology compared to many others on the market. Linux may not, but Solaris can. Yes those containers run Solaris not Linux, but I just wanted to point out that this ability is not unique.


It can also be done on Linux. Docker might not be able to do it, but that's because of its architecture and the language it's written in. There's no kernel limitation like that.


I think Joyent has pushed the limits of what is possible in that area with their tech. From the dev/ops point of view I run docker container / images and they "translate" docker api calls into smartos/opensolaris/illumos zones. NB I don't know much about those OSs/projects.

Docker has a nice and evolving API and OS zones provide an amazing core technology.


Right, but that's because they don't actually run Docker. They have a much better designed manager process that creates LX-branded zones.

My point was that "Linux containers" (so tools like LXC, or runc, or others) are more than capable of being used in this way. Docker has a variety of architectural and other problems that result in it not being as friendly to this kind of use case.

So when you run "docker containers" on Joyent they just untar the image and run it in an LX-branded zone. There is no Go or other problems and they sure as hell don't run Docker as root in their control plane.


Yeah yeah, sorry for not being clear. I understand that they don't use docker and just make their api to quack like a docker.


What are some container technologies that could be used for this right now?


Namespaces and cgroups (using overlayfs to get page cache sharing for the main process mappings in each container). This is a bit of a coy response, but if you have a usecase where other container solutions have a high start-up cost -- write your own. You can write a basic container runtime in tens of lines of shell script.

For existing container glue -- you could probably do this with runc (though unfortunately we are also written in Go) and the architecture is such that runc has no long-running process (only your container continues running after we've set it up). I believe LXC can also produce similar tenancy (and they're written in C which avoids quite a few of the Go problems).


chroot, cgroup, and the systemd run facilities (systemd-nspawn for example).


pivot_root, not chroot. chroot is vulnerable to many trivial escapes.


Probably, but for Solaris it’s native out-of-the-box technology.


What about the language limits the number of containers?


Hi,

Do you have plans to deploy this tech at cell towers in metro areas and support 5G-enabled applications in the future?


Yes. And a really interesting business model to incent mobile providers to install them. Stay tuned!


Yes, we would love to do that. Workers are lightweight enough that they appear to be a much more viable way to distribute compute than technologies like Kubernetes.


*More viable way to distribute JavaScript compute?


C/C++/Rust/Go all target WebAssembly.


Does V8 allow sandboxing on: cpu cycles, heap, stack and system resources? Are requests to these workers http requests? Is there any backend storage or file access (or blob) from workers?


> Does V8 allow sandboxing on: cpu cycles, heap, stack and system resources?

Yes, and that is critical to our use case.

> Are requests to these workers http requests?

Currently, yes.

> Is there any backend storage or file access (or blob) from workers?

Currently there is Workers KV (key-value storage), and we have some cooler stuff in the works.


Being a non-js person, it wasn't clear if this is just for JavaScript or not. If it is, then it's not really fair to compare against containers or vms, right?


that was my thought as well. in the section about the down sides, they do mention though that they should be able to run anything that compiles to web assembly. they specifically call out rust and go.

wasm really has quite a lot of interesting and unexpected applications! (though of course non-wasm tech can accomplish the same thing, but it’s nice that we are getting close to some form of universal binary format)


So hyperthreading and multicores are a security mess, but a JavaScript VM is going to safely allow "10,000 tenants" on a single machine?

Sure.


Hyperthread, multicores, rings, were all stability boundries hijacked to be security boundries. It'll never work


I think it's very reasonable that the best possible security is not implemented in microcode.


I've been closely following CF Workers and have KV beta access, but the mere fact that we had to ask and wait for access to KV essentially made our decision for us to not use workers.

When will KV be out of beta? It's hard to commit to a platform like CF Workers when it's obviously still not intended for mainstream usage and there is no public timeline.


Hi! I work on KV at Cloudflare. Thank you so much for your interest in it, and for considering Workers at all. I'm pretty convinced if you keep considering it when situations allow, you'll choose it eventually when the moment is right for the product and your company.

I'm happy to give you beta access for you to experiment. That said, we take the distinction between a beta and GA very seriously. We don't want you relying on software we're not, at the very least, using in production ourselves.

That day will come very soon! We're moving several projects into KV (allowing us to move them out of centralized data centers), and we expect KV to track GA in the next few months.


By the way kentonv, thanks for the protobuffs and sandstorm. I work with amir and I just wanted to provide a positive note of it's cool tech. Amir is correct that we still don't feel like the kvstore is mainstream? When does kvstore go ga?


> we can host 10,000+ tenants on one machine -- something that's totally unrealistic with containers or VMs.

This is definitely possible with containers. Docker has historically had density problems for a variety of reasons (its architecture as well as Go being awful for systems programming in this context). But you can definitely get this density with a simple container runtime that doesn't have large long running processes.

Saying you can't run 10k containers is like saying you can't run 10k processes on Linux.

I'm sure that it might be somewhat faster to have it supported in V8 directly, but I don't want people to think such density is impossible with containers. (And with page-cache sharing you'd be able to not have to load 10k V8 programs into memory as well.)


I think you are missing the point. Regardless of the number of processes you can run, the overhead of starting them and context switching between them is dramatically higher than the equivalent operations within a single V8 process.


The part of the comment I was responding to was talking about it being impractical to have that many containers running, which isn't true. Later I mentioned that there are probably other benefits (having everything in one process does remove the need for TLB flushes and so on), but that isn't the only thing they said. They didn't say it's drastically faster, they said it's "unrealistic" which simply isn't true.

Also things like page cache sharing would make a huge difference to the real memory cost of having many processes. When you run 10k programs you don't have 10k copies of glibc.


Out of interest, why do you say that Go is awful for systems programming in this context?


How do you protect two isolates from interfearing with one another if they land in same process? If two of them happen to both start a calculation that takes 50ms of cpu time, do both calculations end up taking 100ms to complete? Or do you prevent concurrent execution some how?


At 1/10,000th of a server per each tenant (assuming 10,000 tenants) you can achieve “hard” isolation and far more cpu/memory per tenant using a TinkerBoard or something like that per client and charging a flat $2/mo with no hourly fees. That’s a business model that I see materializing that can eat into your business.

EDIT to clarify: I’m assuming they can do 10,000 tenants per server. If they did 1 tenant per 1 TinkerBoard and charged $2/mo for it flat no hourly fee that would be an interesting business model IMO and it achieves hard isolation between tenacts.


Scaleway out of France has been doing baremetal cloud nodes for years. I used them during the beta and they are fantastic (despite slight latency due to being in France). Their smallest baremetal node starts at €3/mo for 4 ARM cores/2GB RAM/50GB SSD. Pretty cool infrastructure to be able to auto provision physical nodes on demand, but I think it's a different use case from the whole lightweight serverless processes on demand thing.


If only they supported NixOS... I would use them if/when they make the leap.


Are you sure they don't? NixOS can be installed on top of any Linux distribution, unless the hardware is really weird.

ETA: It looks like it is complicated, but might work. See https://nixos.wiki/wiki/NixOS_on_ARM/Scaleway_C1


We replicate each tenant to thousands of servers across more than 150 locations worldwide. That's why we need to support so many tenants per server.


Ah. I see. TIL something about CDNs. Thanks. Good stuff.


I think you're assuming that Cloudflare has only one server?


Nope. I’m assuming they can do 10,000 tenants per server. If they did 1 tenant per 1 TinkerBoard and charged $2/mo for it flat no hourly fee that would be an interesting business model IMO and it achieves hard isolation between tenants.


But a Workers customer gets to use more than one of their servers at once....


You can have multiple tenants per low cost edge device. ARM servers are finding so many such applications. Not sure what is the hang up?


(I sit next to Kenton)

I would actually argue with #1. The truth is, this appears to be a better tech even if it isn't globally distributed.

Lightweight serverless functions with no cold-start are a really great primitive to build on. We figured them out in this way to enable global distribution, but that's a bonus, not necessary to really love the concept.


Well done! Very creative solution to the problem and using the V8 system as a starting point is brilliant!


> The idea that you can have a stack of cloud services that depend on each other without incurring any latency or bandwidth costs will, I think, be game-changing.

This looks to me like just another form of vendor lock-in.


Running 10k containers on one machine is totally realistic though. This is just yet another language vm (e.g. jvm) isolation all over again.


Does your implementation support vectorized SIMD or SSE instructions?


The marketing around “No VM” is interesting. It sounds like you’re relying on V8. What did you do instead of using its VM? Precompile? Or did you reimplement V8s isolates in another VM-less system? Or do you mean more specifically no OS-level VM?


That's cool. Do you use any of your tech from sandstorm days?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: