Hacker Newsnew | past | comments | ask | show | jobs | submit | ot's commentslogin

I would guess to develop and test software that will ultimately run on a system with 64k page size.


Is there a fundamental advantage over other page sizes, other than the convenience of 64k == 2^16?


The reason to want small pages is that the page is often the smallest unit that the operating system can work with, so bigger pages can be less efficient – you need more ram for the same number of memory mapped files, tricks like guard pages or mapping the same memory twice for a ring buffer have a bigger minimum size, etc.

The reason to want pages of exactly 4k is that software is often tuned for this and may even require this from not being programmed in a sufficiently hardware agnostic way (similar to why running lots of software on big median systems can be hard).

The reasons to want bigger pages are:

- there is more OS overhead tracking tiny pages

- as well as caches for memory, CPUs have caches for the mapping between virtual memory and physical memory, and this mapping is page-size granularity. These caches are very small (as they have to be extremely fast) so bigger pages means memory accesses are more likely to go to pages in the cache, which means faster memory accesses.

- CPU caches are addressed based on the index into the minimum page size so the max size of a cache is page-size * associativity. I think it can be harder to increase the latter than the former so bigger pages could allow for bigger caches, which can make some software perform better.

These things you see in practice are:

- x86 supports 2MB and 2GB pages, as well as 4KB pages. Linux can either directly give you pages in this larger size (a fixed number are allocated at startup by the OS) or there is a feature called ‘transparent hugepages’ where sufficiently aligned contiguous smaller pages can be merged. This mostly helps with the first two problems

- I think the Apple M-series chips have an 8k minimum page size, which might help with the third problem but I don’t really know about them


I believe this is true for x86 as a whole, but on NT any large page must be mapped with a single protection applied to the entire page, so if the page contains read-only code and read-write data, the entire page must be marked read-write.


Yes there are

(as a starting point 4k is a "page size for ants" in 2025 - 4MB might be too much however)

But the bigger the page the less TLB entries you need, and less entries in your OS data structures managing memory, etc


4K seems appropriate for embedded applications. Meanwhile 4M seems like it would be plenty small for my desktop. Nearly every process is currently using more than that. Even the lightest is still coming in at a bit over 1M


1M is a huge waste of memory.

Imagine writing out a one sentence note in notepad and the resulting file being 1M on disk.


Yet when I reference the running processes on my desktop something like 90% of them have more than 16M resident. So it doesn't appear that even an 8M page size would waste much on a modern desktop during typical usage.

If I'm mistaken about some low level detail I'd be interested to learn more.


64k is the largest page size that the ARM architecture supports. The large page size provides advantages for applications which allocate large amounts of memory.


Yes! Data workloads fare considerably better with larger pages, less TLB pressire, and a higher cache hit rate. I wrote a tutorial about this and how to figure out whether it will be a good trade-off for your use-case: https://amperecomputing.com/tuning-guides/understanding-memo...


> While the company insists that “nothing is shared unless you choose to post it,” the app nonetheless nudges people to share—and overshare—whether they fully realize it or not.


On Linux you'd do this by sending a signal to the thread you want to analyze, and then the signal handler would take the stack trace and send it back to the watchdog.

The tricky part is ensuring that the signal handler code is async-signal-safe (which pretty much boils down to "ensure you're not acquiring any locks and be careful about reentrant code"), but at least that only has to be verified for a self-contained small function.

Is there anything similar to signals on Windows?


The closest thing is a special APC enqueued via QueueUserAPC2 [1], but that's relatively new functionality in user-mode.

[1] https://learn.microsoft.com/en-us/windows/win32/api/processt...


The 2 implies an older API, its predecessor QueueUserAPC has been around since the XP days.

The older API is less like signals and more like cooperative scheduling in that it waits for the target thread to be in an "alertable" state before it runs (the thread executes a sleep or a wait for something)


> The 2 implies an older API, its predecessor QueueUserAPC has been around since the XP days.

I wasn’t implying that APCs were new, I was implying that the ability to enqueue special (as opposed to normal) APCs from user-mode is new. And of course, that has always been possible from kernel-mode with NT.


Or SetThreadContext() if you want to be hardcore. (not recommended)


Why not recommended? As far as things close to signals go, this is how you implement signals in user land on Windows (along with pause/resume thread). You can even take locks later during the process, as long as you also took them before sending the signal (same exact restrictions as fork actually, but unfortunately atfork hooks are not accessible and often full of fork-unsafe data race and deadlock implementation bugs themselves in my experience with all the popular libc)


I’ve implemented them as you describe, but it’s still a bit hacky due to lots of corner cases — what if your target thread is currently executing in the kernel?

The special APC is nicer because the OS is then aware of what you’re doing— it will perform the user-mode stack changes while transitioning back to user-mode and handle cleanup once the APC queue is drained.


Karma is stored in a 16 bit integer. It overflowed.


That's cool


Being more robust to fragile compiler optimizations is also a nontrivial benefit. An interpreter loop is an extremely specialized piece of code whose control flow is too important to be left to compiler heuristics.

If the desired call structure can be achieved in a portable way, that's a win IMO.


The fact that they categorize FNV-1a as "Good all-rounder, decent speed and collision resistance" was an immediate red flag for me.

It does look like an article written more than 10 years ago.


Certainly when I saw FNV-1 and FNV-1a called out separately I assumed we're going to see a point made about an early period in hashing and then in a paragraph or two we'll get "real" options.


That is a ridiculously well made video, thanks for sharing!


Previous discussion (Jun 2024): https://news.ycombinator.com/item?id=40753989


Thanks! Macroexpanded:

Formal methods: Just good engineering practice? - https://news.ycombinator.com/item?id=40753989 - June 2024 (149 comments)

15 Years of Formal Methods at AWS: Just Good Engineering Practice? - https://news.ycombinator.com/item?id=40283052 - May 2024 (1 comment)


Thanks, I couldn't find it via Google search on this website for some reason.


Google search? Use the HN search on site here: https://hn.algolia.com/?q=https%3A%2F%2Fbrooker.co.za%2Fblog...


I’m glad you couldn’t if that would have kept you from posting it today.


Yeah I wasn't going to post it if it was shared before. But I am happy that some are happy that it's reposted.


Nothing wrong with reposts, it's just useful to link to previous discussions for context :)


Arrogance being one of the main criticisms in the article is a little ironic, coming from bcantrill.


As a long-time watcher and enjoyer of his presentations, he has a long history of building (often) superior products to the mainstream alternatives, and passionately explaining why they are superior. The alternatives still end up being more popular, and there is no denying that there's sometimes an air of "those fools with their eBPF, while we have dtrace/ZFS/zones/etc" which toes the line of arrogance.

I think it's pretty hard to have a career where you are constantly involved with software that is better technically, and have worse alternatives constantly "winning", and not be a little arrogant sometimes.


I think you're confusing arrogance with confidence. Directness and openness paired with underlying respect for others can reflect confidence, not arrogance.

Tackling difficult conversations or addressing challenges head on may be seen as "close to arrogance" but they are in fact a net positive for everyone.


He described his response to watching a video where Gelsinger behaved as if his own work at Intel consisted of interesting successes and the company's failings were due to other people. What's ahem ironic about okay I'm just going to stop


Definitely not a totally unfair criticism (though do keep in mind that I was not at an executive at Sun, whereas Gelsinger was Intel's CTO), but I did try to be forgiving of that as confidence. It was his line about NVIDIA that I felt to (well) over the line. You certainly won't find me saying that "AWS would be a fraction of its size if Oracle hadn't cancelled Sun's cloud initiative" or whatever the equivalent might be.


I appreciate your taking what was meant to be mostly a humorous comment in the spirit which was intended.


What that code does is a per-byte-pair popcount, which is not what the POPCNT instruction does (it computes the popcount for the whole word).

On processors with BMI2 the whole algorithm reduces to a PDEP as mentioned in another comment, but if you don't have that this is pretty much the best you can do (unless you use lookup tables but those have pros and cons).


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: