Within an hour of V8 pushing the fix for this, our build automation alerted me that it had picked up the patch and built a new release of the Workers Runtime for us. I clicked a button to start rolling it out. After quick one-click approvals from EM and SRE, the release went to canary. After running there for a short time to verify no problems, I clicked to roll it out world-wide, which is in progress now. It will be everywhere within a couple hours. Rolling out an update like this causes no visible impact to customers.
In comparison, when a zero-day is dropped in a VM implementation used by a cloud service, it generally takes much longer to roll out a fix, and often requires rebooting all customer VMs which can be disruptive.
> Their use is in some sense way worse than a web browser, where you would have to wait for someone to come to your likely-niche page rather than just push the attack to get run everywhere.
I may be biased but I think this is debatable. If you want to target a specific victim, it's much easier to get that person to click a link than it is to randomly land on the same machine as them in a cloud service.
Great workflow! I long for the day when I can start for a company that actually has their automation as efficient as this.
Few question, do you have a way of differentiating critical patches as this? If so, does that trigger an alert for the on-call person? Or do you still wait until working hours before such a change is pushed?
I'm just so tired of the whole microservices and prima donna developer bullshit.
* They set an easy measurement that doesn't match customer experience, so they say they're in-SLO when common sense suggests otherwise.
* They require customers jump through hoops to get a credit after a major incident.
* The credits are often not total and/or are tiered by reliability (so you could have a 100% uptime and not give a 100% discount if you serve some errors). At the very most, they give the customer a free month. It's not as if they make the customer whole on their lost revenue.
With a standard industry SLA, you can have a profitable business claiming uptime you never ever achieve.
We start a new instance of the server, warm it up (pre-load popular Workers), then move all new requests over to the new instance, while allowing the old instance to complete any requests that are in-flight.
Fewer moving parts makes it really easy to push an update at any time. :)
For 1), write down the entire manual workflow. Start automating pieces that are easy to automate, even if someone has to run the automation manually. Continue to automate the in-between/manual pieces. For this you can use autonomation to fall back to manual work if complete automation is too difficult/risky.
For 2), look at your system's design. See where the design/tools/implementation/etc limit the ability to easily automate. To replace a given workflow section, you can a) replace some part of your system with a functionally-equivalent but easier to automate solution, or b) embed some new functionality/logic into that section of the system that extends and slightly abstracts the functionality, so that you can later easily replace the old system with a simpler one.
To get extra time/resources to spend on the automation, you can do a cost-benefit analysis. Record the manual processes' impact for a month, and compare this to an automated solution scaled out to 12-36 months (and the cost to automate it). Also include "costs" like time to market for deliverables and quality improvements. Business people really like charts, graphs, and cost saving estimates.
However... for this comment I then wanted to see how long ago that patch did drop, and it turns out "a week ago" :/... and the real issue is that neither Chrome nor Edge have merged the patch?!
> Agarwal said he responsibly reported the V8 security issue to the Chromium team, which patched the bug in the V8 code last week; however, the patch has not yet been integrated into official releases of downstream Chromium-based browsers such as Chrome, Edge, and others.
So uhh... damn ;P.
> I may be biased but I think this is debatable. If you want to target a specific victim, ...
FWIW, I had meant Cloudflare as the victim, not one of Cloudflare's users: I can push code to Cloudflare's servers and directly run it, but I can't do the same thing to a user (as they have to click a link). I appreciate your point, though (though I would also then want to look at "number of people I can quickly affect"). (I am curious about this because I want to better understand the mitigations in place by a service such as Cloudflare, as I am interested in the security ramifications of doing similar v8 work in distributed systems.)
This does not appear to be true. AFAICT the first patch was merged today:
(It was then rapidly cherry-picked into release branches, after which our automation picked it up.)
> I am curious about this because I want to better understand the mitigations in place by a service such as Cloudflare, as I am interested in the security ramifications of doing similar v8 work in distributed systems.
Here's a blog post with some more details about our security model and defenses-in-depth: https://blog.cloudflare.com/mitigating-spectre-and-other-sec...
BTW: if there is any hope you can help put me in touch with people at Cloudflare who work on the Ethereum Gateway, I would be super grateful (I wanted to use it a lot--as I had an "all in on Cloudflare" strategy to help circumvent censorship--but then ran into a log of issues and am not at all sure how to file them... a new one just cropped up yesterday, wherein it is incorrectly parsing JSON/RPC id fields). On the off chance you are interested in helping me with such a contact (and I appreciate if you aren't; no need to even respond or apologize ;P): I am email@example.com and I am in charge of technology for Orchid.
HN post on Orchid Protocol for curious: https://news.ycombinator.com/item?id=15576457
This would require ahead knowledge of the vulnerability and someone either within CloudFlare or at one of it's used code dependencies to plant malicious code. Since a rolling upgrade seems to be fully automated at CloudFlare and can be done within a few hours for the complete infrastructure, I don't see CF being at high risk here.
Hope you there is also some scanning for this put in place.