More

alexgartrell · 2026-03-17T01:15:24 1773710124

For the peanut gallery more: I worked with both of these guys at Meta on this.

The "servers are only on for a few hours" thing was like never true so I have no idea where that claim is coming from. The web performance test took more than a few hours to run alone and we had way more aggressive soaks for other workloads.

My recollection was that "write zeroes" just became a cheaper operation between '12 and '14.

A fun fact to distract from the awkwardness: a lot of the kernel work done in the early days was exceedingly scrappy. The port mapping stuff for memcached UDP before SO_REUSEPORT for example. FB binaries couldn't even run on vanilla linux a lot of the time. Over the next several years we put a TON of effort in getting as close to mainline as possible and now Meta is one of the biggest drivers of Linux development.

ot · 2026-03-17T06:46:20 1773729980

It's not just that zeroing got cheaper, but also we're doing a lot less of it, because jemalloc got much better.

If the allocator returns a page to the kernel and then immediately asks back for one, it's not doing its job well: the main purpose of the allocator is to cache allocations from the kernel. Those patches are pre-decay, pre-background purging thread; these changes significantly improve how jemalloc holds on to memory that might be needed soon. Instead, the zeroing out patches optimize for the pathological behavior.

Also, the kernel has since exposed better ways to optimize memory reclamation, like MADV_FREE, which is a "lazy reclaim": the page stays mapped to the process until the kernel actually need it, so if we use it again before that happens, the whole unmapping/mapping is avoided, which saves not only the zeroing cost, but also the TLB shootdown and other costs. And without changing any security boundary. jemalloc can take advantage of this by enabling "muzzy decay".

However, the drawback is that system-level memory accounting becomes even more fuzzy.

(hi Alex!)

menaerus · 2026-03-17T13:33:44 1773754424

I am trying to understand the reason behind why "zeroing got cheaper" circa 2012-2014. Do you have some plausible explanations that you can share?

Haswell (2013) doubled the store throughput to 32 bytes/cycle per core, and Sandy Bridge (2011) doubled the load throughput to the same, but the dataset being operated at FB is most likely much larger than what L1+L2+L3 can fit so I am wondering how much effect the vectorization engine might have had since bulk-zeroing operation for large datasets is anyways going to be bottlenecked by the single core memory bandwidth, which at the time was ~20GB/s.

Perhaps the operation became cheaper simply because of moving to another CPU uarch with higher clock and larger memory bandwidth rather than the vectorization.

jcalvinowens · 2026-03-17T15:07:30 1773760050

My memory is that Ivy Bridge was when it started being different.

ahoka · 2026-03-17T15:20:00 1773760800

AVX maybe?

adsharma · 2026-03-17T01:30:15 1773711015

[ Edit: "servers" in this context meant the HHVM server processes, not the physical server which of course had a longer uptime ]

People got promoted for continuous deployment

https://engineering.fb.com/2017/08/31/web/rapid-release-at-m...

I think it's fair to say the hardware changed, the deployment strategy changed and the patches were no longer relevant, so we stopped applying them.

When I showed up, there were 100+ patches on top of a 2009 kernel tree. I reduced the size to about 10 or so critical patches, rebased them at a 6 months cadence over 2-3 years. Upstreamed a few.

Didn't go around saying those old patches were bad ideas and I got rid of them. How you say it matters.

alexgartrell · 2026-03-17T01:50:18 1773712218

The linked article says they decided to do CD in 2016 fwiw so that's not inconsistent with what I said.

You reduced the number of patches a lot and also pushed very hard to get us to 3.0 after we sat on 2.6.38 ~forever. Which was very appreciated, btw. We built the whole plan going forward based on this work.

I'm not arguing that anyone should be nice to anyone or not (it's a waste of breath when it comes to Linux). I'm just saying that the benchmarking was thorough and that contemporary 2014 hardware could zero pages fast.

yalok · 2026-03-17T07:10:21 1773731421

Tangentially, on this CD policy - it leads to really high p99s for a long tail of rare requests which don’t get reliable prewarming due to these frequent HHVM restarts…

1bpp · 2026-03-17T04:24:08 1773721448

This is why I always read the comments here.

genxy · 2026-03-17T05:24:42 1773725082

That is, wow, a story.

At what point did you realize how different fb engineering was from what you expected?

hedayet · 2026-03-17T07:32:47 1773732767

For me it happened around my first week after the bootcamp, so about 6 weeks from joining.

An important nuance - most Facebook engineers don't believe that Facebook/Meta would continue to grow next year; and that disbelief had been there since as early as in 2018 (when I'd joined).

very few facebook employees use their products outside of testing, which is a big contributor to that fear - they just can't believe that there are billions of people who would continue to use apps to post what they had for lunch!

And as a result of that lack of faith, most of them believe that Meta is a bubble and can burst at any point. Consequently, everyone works for the next performance review cycle, and most are just in rush to capture as much money as they could before that bubble bursts.

specialist · 2026-03-17T10:57:56 1773745076

> don't believe that Facebook/Meta would continue to grow next year

Huh.

The time I worked at a hyper growth company, us working in the coal mine had much the same skepticism. Our growth rate seemed ridiculous, surely we're over building, how much longer can this last?!

Happily, the marketing research team regularly presented stuff to our department. They explained who are customers were, projected market sizes (regionally, internationally), projected growth rates, competitive analysis (incumbents and upstarts), etc.

It helped so much. And although their forecasts seemed unbelievable, we over performed every year-over-year. Such that you sort of start to trust the (serious) marketing research types.

eduction · 2026-03-17T03:27:27 1773718047

[flagged]

Salgat · 2026-03-17T04:26:32 1773721592

I'm personally appreciative of these comments. It's good that people make claims, be challenged, and both sides walk away with informative points being made. It's entirely possible both sides here are correct and wrong in their own way.

yalok · 2026-03-17T07:16:50 1773731810

Fwiw, this sounds like a healthy discourse - you don’t have to agree on everything, every approach has its merits, code that ends up shipping and supporting production wins the argument in some sense…

This is not special to Meta in any way, I observed it in any team which has more than 1 strong senior engineer.

menaerus · 2026-03-17T13:02:57 1773752577

No, calling out your ex colleague in public years after is not a "healthy discourse" ...

bigstrat2003 · 2026-03-17T18:49:43 1773773383

There's nothing healthy about holding on to a work grudge from 10 years ago and then dragging it out in public. That's toxic AF.

debo_ · 2026-03-17T09:11:28 1773738688

This is literally how pretty much every conversation goes when you work with people close to the metal. It's a stylistic thing at this point.

For what it's worth, 20 years ago all programming newsgroups were like this. I grew my thick skin on alt.lang.perl lol

teiferer · 2026-03-17T14:35:44 1773758144

Except one is an employee and the other one is an ex employee. The bias this introduces is not just a minor nuance, it's what fuels the public conflict and causes everybody else to double check their popcorn reserves.

Of course technical discussions happen all the time at companies between competent people. But you don't do that in public, nor is this a technical debate: "I don't recall talking to you about it" - "I do, I did xyz then you ignored me" - "<changes subject>"

adsharma · 2026-03-17T15:52:44 1773762764

Important distinction yes. It also means I can't go back and check the thread on what was said and when. Nor do I want to.

Always good to talk face to face if you're have strong feelings about something. When I said "talk" I meant literally face to face.

Spending a decade or so on lkml, everyone develops a thick skin. But mix it with the corporate environment, Facebook 2011, being an ex-employee adds more to the drama.

Having read through the comments here, I'm still of the opinion that any HW changes had a secondary effect and the primary contributor was a change in how HHVM/jemalloc interacted with MADV.

One more suggestion: evaluate more than one app and company wide profiling data to make such decisions.

One of the challenges in doing so is the large contingent of people who don't have an understanding of CPU uarch/counters and yet have a negative opinion of their usefulness to make decisions like this.

So the only tool you have left with is to run large scale rack level tests in a close to prod env, which has its own set of problems and benefits.

menaerus · 2026-03-18T06:35:42 1773815742

Perf counters are only indicative of certain performance characteristics at the uarch level but when one improves one or more aspects of it the result does not necessarily positively correlate to the actual measurable performance gains in E2E workloads deployed on a system.

That said, one of the comments above suggests that the HW change was a switch to Ivy Bridge, when zeroing memory became cheaper, which is a bit unexpected (to me). So you might be more right when you say that the improvement was the result of memory allocation patterns and jemalloc.

alexgartrell · 2025-09-15T08:22:21 1757924541

I did something similar a long time ago https://github.com/facebookresearch/py2bpf

It was definitely a toy, I transliterated from python bytecode (a stack based vm) into bpf. I also wrote the full code gen stack myself (bpf was simpler back then)

But using llvm and not marrying things to cpython implementation makes this approach way better

varunrmallya · 2025-09-15T14:59:28 1757948368

Thank you! Ours is a toy for now as well, but I think the idea is pretty good, so we'll continue to work on it. (This was actually a hackathon project, so the code is pretty messy and not something I am proud of)

alexgartrell · 2025-06-14T09:05:01 1749891901

Not sure it’s relevant in the cloths these guys take

alexgartrell · 2025-05-25T05:01:57 1748149317

The cloud business model is to use scale and customer ownership to crush hardware margins to dust. They’re also building their own accelerators to try to cut Nvidia out altogether.

cbg0 · 2025-05-25T08:12:18 1748160738

I've always felt that the business model is nickel & diming for things like storage/bandwidth and locking in customers with value-add black box services that you can't easily replace with open source solutions.

Just took a random server: https://instances.vantage.sh/aws/ec2/m5d.8xlarge?duration=mo... - to get a decent price on it you need to commit to three years at $570 per month(no storage or bandwidth included). Over the course of 3 years that's $20520 for a server that's ~10K to buy outright, and even with colo costs over the same time frame you'll spend a lot less, so not exactly crushing those margins to dust.

shrubble · 2025-05-25T10:19:45 1748168385

Cloud is propped up by the tax laws.

Cloud bills can be written off in the month in which they are paid; while buying hardware has to be depreciated over years.

sokoloff · 2025-05-25T11:28:51 1748172531

Section 179 allows immediate expensing of equipment including computers, but is limited to $1.25M/yr. That’s enough for many small and medium businesses.

alexgartrell · 2025-05-25T04:58:06 1748149086

I’d imagine that these clouds are probably being incentivized to participate

alexgartrell · 2025-04-09T16:09:01 1744214941

I don’t think pivot_root is necessary for something like this, but a new mount namespace will definitely help avoid creating a mess on accident

alexgartrell · on Feb 8, 2025