Hacker News new | past | comments | ask | show | jobs | submit login

So what I really want to know is: what happens at Cloudflare--which uses v8 to implement Cloudflare Workers in shared memory space--when this kind of stuff happens? (Their use is in some sense way more "out on a limb" than a web browser, where you would have to wait for someone to come to your likely-niche page rather than just push the attack to get run everywhere.)



This is what happens:

Within an hour of V8 pushing the fix for this, our build automation alerted me that it had picked up the patch and built a new release of the Workers Runtime for us. I clicked a button to start rolling it out. After quick one-click approvals from EM and SRE, the release went to canary. After running there for a short time to verify no problems, I clicked to roll it out world-wide, which is in progress now. It will be everywhere within a couple hours. Rolling out an update like this causes no visible impact to customers.

In comparison, when a zero-day is dropped in a VM implementation used by a cloud service, it generally takes much longer to roll out a fix, and often requires rebooting all customer VMs which can be disruptive.

> Their use is in some sense way worse than a web browser, where you would have to wait for someone to come to your likely-niche page rather than just push the attack to get run everywhere.

I may be biased but I think this is debatable. If you want to target a specific victim, it's much easier to get that person to click a link than it is to randomly land on the same machine as them in a cloud service.


> Within an hour of V8 pushing the fix for this, our build automation alerted me that it had picked up the patch and built a new release of the Workers Runtime for us. I clicked a button to start rolling it out. After quick one-click approvals from EM and SRE, the release went to canary. After running there for a short time to verify no problems, I clicked to roll it out world-wide, which is in progress now. It will be everywhere within a couple hours. Rolling out an update like this causes no visible impact to customers.

Great workflow! I long for the day when I can start for a company that actually has their automation as efficient as this.

Few question, do you have a way of differentiating critical patches as this? If so, does that trigger an alert for the on-call person? Or do you still wait until working hours before such a change is pushed?


Look for a company who's business model includes uptime, security and scalability. And is big enough to not outsource those parts. And in a mature market where customers can tell the difference.


I once worked for a company that tried to set up a new service, they asked for 99.99999% uptime. This worked really well for the 'ops' team which focused on the AWS setup and automation, but meanwhile the developers (of which I was one, but I didn't have any say in things because I was 'just' a front-ender) fucked about with microservices, first built in NodeJS (with a postgres database storing mostly JSON blobs behind them), then in Scala. Not because it was the best solution (neither microservices nor scala), but because the developers wanted to, and the guys responsible for hiring were Afraid that they'd get mediocre developers if they went for 'just' java.

I'm just so tired of the whole microservices and prima donna developer bullshit.


99.99999% uptime is about 3 seconds of downtime per year. Yikes! Does any service on Earth have that level of uptime?


No. In some sense it doesn't matter though. There are plenty of services that have less than their claimed reliability:

* They set an easy measurement that doesn't match customer experience, so they say they're in-SLO when common sense suggests otherwise.

* They require customers jump through hoops to get a credit after a major incident.

* The credits are often not total and/or are tiered by reliability (so you could have a 100% uptime and not give a 100% discount if you serve some errors). At the very most, they give the customer a free month. It's not as if they make the customer whole on their lost revenue.

With a standard industry SLA, you can have a profitable business claiming uptime you never ever achieve.


Also look at their job ads. If they are looking to hire a devops to own their ci/cd pipeline, that means they don’t have one (and, with that approach, will never have one).


My guess is that the main feature which enables this kind of automation is that they can take down any node without consequences. So they can just install an update on all the machines, and then reboot/restart the software on the machines sequentially. If you have implemented redundancy correctly, then software updating becomes simple.


We actually update each machine while it is serving live traffic, with no downtime.

We start a new instance of the server, warm it up (pre-load popular Workers), then move all new requests over to the new instance, while allowing the old instance to complete any requests that are in-flight.

Fewer moving parts makes it really easy to push an update at any time. :)


What happens if you have long running tasks in the worker?


Can you include me too?! I wish I could automate the hell like they do :P


You can!


Specifically you can do two things: 1) planned incremental improvements, 2) simpler designs.

For 1), write down the entire manual workflow. Start automating pieces that are easy to automate, even if someone has to run the automation manually. Continue to automate the in-between/manual pieces. For this you can use autonomation to fall back to manual work if complete automation is too difficult/risky.

For 2), look at your system's design. See where the design/tools/implementation/etc limit the ability to easily automate. To replace a given workflow section, you can a) replace some part of your system with a functionally-equivalent but easier to automate solution, or b) embed some new functionality/logic into that section of the system that extends and slightly abstracts the functionality, so that you can later easily replace the old system with a simpler one.

To get extra time/resources to spend on the automation, you can do a cost-benefit analysis. Record the manual processes' impact for a month, and compare this to an automated solution scaled out to 12-36 months (and the cost to automate it). Also include "costs" like time to market for deliverables and quality improvements. Business people really like charts, graphs, and cost saving estimates.


Thank you for the rundown!


Thanks for the response! This didn't really answer what I was curious about, though: like, you answered what happens during the minutes after the fix being pushed, but I am curious about the minutes after the exploit being released, as the mention of "zero day" made me think that this bug could only have been fixed in the past few hours (and so there were likely hours of Cloudflare going "omg what now?" with engineers trying to help come up with a patch, etc.).

However... for this comment I then wanted to see how long ago that patch did drop, and it turns out "a week ago" :/... and the real issue is that neither Chrome nor Edge have merged the patch?!

https://therecord.media/security-researcher-drops-chrome-and...

> Agarwal said he responsibly reported the V8 security issue to the Chromium team, which patched the bug in the V8 code last week; however, the patch has not yet been integrated into official releases of downstream Chromium-based browsers such as Chrome, Edge, and others.

So uhh... damn ;P.

> I may be biased but I think this is debatable. If you want to target a specific victim, ...

FWIW, I had meant Cloudflare as the victim, not one of Cloudflare's users: I can push code to Cloudflare's servers and directly run it, but I can't do the same thing to a user (as they have to click a link). I appreciate your point, though (though I would also then want to look at "number of people I can quickly affect"). (I am curious about this because I want to better understand the mitigations in place by a service such as Cloudflare, as I am interested in the security ramifications of doing similar v8 work in distributed systems.)


> which patched the bug in the V8 code last week

This does not appear to be true. AFAICT the first patch was merged today:

https://chromium-review.googlesource.com/c/v8/v8/+/2820971

(It was then rapidly cherry-picked into release branches, after which our automation picked it up.)

> I am curious about this because I want to better understand the mitigations in place by a service such as Cloudflare, as I am interested in the security ramifications of doing similar v8 work in distributed systems.

Here's a blog post with some more details about our security model and defenses-in-depth: https://blog.cloudflare.com/mitigating-spectre-and-other-sec...


Thanks; FWIW, I'd definitely read that blog post, and watched the talk you gave a while back (paying careful attention to the Q&A, etc. ;P). (I had had a back/forth with you a while back, actually, surrounding how you limit the memory usage of workers, and in the end sam still unsure what strategy you went with.)

https://news.ycombinator.com/item?id=23975152

BTW: if there is any hope you can help put me in touch with people at Cloudflare who work on the Ethereum Gateway, I would be super grateful (I wanted to use it a lot--as I had an "all in on Cloudflare" strategy to help circumvent censorship--but then ran into a log of issues and am not at all sure how to file them... a new one just cropped up yesterday, wherein it is incorrectly parsing JSON/RPC id fields). On the off chance you are interested in helping me with such a contact (and I appreciate if you aren't; no need to even respond or apologize ;P): I am saurik@saurik.com and I am in charge of technology for Orchid.


“The author of Cydia” is probably more striking introduction for you :)

HN post on Orchid Protocol for curious: https://news.ycombinator.com/item?id=15576457


Haha, yeah... but that's mostly just "why I'm a bit famous" and not "why I care about this" ;P.


I would be interested to hear a response to your V8 memory limit question. Years before Cloudflare workers we isolated Parse Cloud Code workers in exactly the same way, at least at the beginning (multiple V8 isolates in the same process). One of the big issues was not really being able to set a true memory limit in V8. There was a flag, but it was pretty advisory--there were still codepaths that just tried to GC multiple times and then abort if not enough space was freed up. Not ideal when running multiple tenants in the same process.


We should continue our Ethereum discussion good human. L2 is a cray.


Right, the regression test appeared early (soon after Pwn2Own) and the patch was developed based on that.


Dynamic worker isolation is something something I have dabbled with. I’ve been trying to figure out that if, once a misbehaving isolate is.. isolated it is possible to scrutinize its behavior to catch it in the act. What do you think? Would something like that even be useful? It seems to me that maybe if an isolate is confirmed malicious you can backtrack and identify data leaks.


I'd say for the 0-day exploit to affect services like CloudFlare someone would need to run the exploit first on their V8 infrastructure instances.

This would require ahead knowledge of the vulnerability and someone either within CloudFlare or at one of it's used code dependencies to plant malicious code. Since a rolling upgrade seems to be fully automated at CloudFlare and can be done within a few hours for the complete infrastructure, I don't see CF being at high risk here.


You don’t need to know anyone at Cloudflare to run the exploit on their v8 infrastructure... you just need to sign up here: https://workers.cloudflare.com/


There is something so pleasing about this process being so fully automated that it can be managed in a few clicks. Kudos, I love reading things like this.


What checks do you perform on that upstream code before building and running it? I imagine a supply-chain attack would be devastating, even if the code only made it to canaries. Your build infrastructure at least could easily be compromised by a nefarious makefile addition.


Someone would have to get a malicious patch merged into Google's V8 release branch first. I also personally do a sanity-check review of the patches before rolling out the update.


Most VM providers now support live migration for common machine varieties, so forced reboots are uncommon.


That's impressive, in fact I would argue that this is as close to best-practice as you can get. I would love to read a blog post with details how you set this up!


Pretty interesting, so people don't have to follow updates on those major projects, your pipeline actually does it for you.


The other benefit you have is that anything running in the workers has to go through CI/CD, so you’d see anyone actively trying to exploit this.

Hope you there is also some scanning for this put in place.


All code executed on Workers has to have been uploaded through our API -- we don't allow eval() or dynamically compiling Wasm at runtime. This ensures that we have a paper trail in case of an exploit.


I think you are right — it is debatable [1]. I would argue that it is easier to find ways to exploit determinism of scheduling/allocation than finding ways to exploit humans.

[1] https://hovav.net/ucsd/dist/cloudsec.pdf


Forgot to begin post with "Horse's Mouth here".


> which uses v8 to implement Cloudflare Workers in shared memory space

I'm curious how that works in practice. Specifically, the docs say (https://developers.cloudflare.com/workers/learning/how-worke...)

> Each isolate's memory is completely isolated, so each piece of code is protected from other untrusted or user-written code on the runtime.

But they don't quite specify if it's isolated at system level (separate threads with unshared memory) or something simpler ("you can't use native code, so v8 isolates your objects").


Here's a blog post about the Workers security model: https://blog.cloudflare.com/mitigating-spectre-and-other-sec...

And here's a talk I gave about how Workers works more generally (not security-focused): https://www.infoq.com/presentations/cloudflare-v8/


This page of documentation is also one of my favorite reads in the last month: https://developers.cloudflare.com/workers/learning/security-...

So many juicy details and nuanced takes that really make me appreciate the thought and care CF has put into securing workers.


What kind of exploit is it (it doesn't say)? Is it possible that further sandboxing levels would isolate the affected instance from other customers pending additional exploits (processes, VMs, etc)? This isn't really my area so apologies if what I'm saying is way off base :P


> you would have to wait for someone to come to your likely-niche page

There are a few ad networks which allow you to run JS. They may not work for this specific exploit, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: