On one hand it means that that each page table entry takes up half a cache line for the 16KB case, and two whole cache lines in the 64KB case. This really cuts down on the page walker hardware's ability to effectively prefetch TLB entries, leading to basically the same issues as this classic discussion about why tree based page tables are generally more effective than hash based page tables (shifted forward in time to today's gate counts). https://yarchive.net/comp/linux/page_tables.html This is why ARM shifted from a Svnapot like solution to the "translation granule queryable and partially selectable at runtime" solution.
Another issue is the fact that a big reason to switch to 16KB or even 64KB pages is to allow for more address range for VIPT caches. You want to allow high performance implementations to be able to look up the cache line while performing the TLB lookup in parallel, then compare the tag with the result of the TLB lookup. This means that practically only the untranslated bits of the address can be used by the set selection portion of the cache lookup. When you have 12 bits untranslated in a address, combined with 64 byte cachelines gives you 64 sets, multiply that by 8 ways and you get the 32KB L1 caches very common in systems with 4KB page sizes (sometimes with some heroic effort to throw a ton of transistors/power at the problem to make a 64KB cache by essentially duplicating large parts of the cache lookup hardware for that extra bit of address). What you really want is for the arch to be able to disallow 4KB pages like on apple silicon which is the main piece that allows their giant 128LB and 192KB L1 caches.
> What you really want is for the arch to be able to disallow 4KB pages like on apple silicon which is the main piece that allows their giant 128LB and 192KB L1 caches.
Minor nit but they allow 4k pages. Linux doesn't support 16k and 4k pages at the same time; macOS does but is just very particular about 4k pages being used for scenarios like Rosetta processes or virtual machines e.g. Parallels uses it for Windows-on-ARM, I think. Windows will probably never support non-4k pages I'd guess.
But otherwise, you're totally right. I wish RISC-V had gone with the configurable granule approach like ARM did. Major missed opportunity but maybe a fix will get ratified at some point...
> This means that practically only the untranslated bits of the address can be used by the set selection portion of the cache lookup
It's true that this makes things difficult, but Arm have been shipping D caches with way size > page size for decades. The problem you get is that virtual synonyms of the same physical cache block can become incoherent with one another. You solve this by extending your coherence protocol to cover the potential synonyms of each index in the set (so for example with 16 kB/way and 4 kB pages, there are four potential indices for each physical cache block, and you need to maintain their coherence). It has some cost and the cost scales with the ratio of way size : page size, so it's still desirable to stay under the limit, e.g. by just increasing the number of cache ways.
On one hand it means that that each page table entry takes up half a cache line for the 16KB case, and two whole cache lines in the 64KB case. This really cuts down on the page walker hardware's ability to effectively prefetch TLB entries, leading to basically the same issues as this classic discussion about why tree based page tables are generally more effective than hash based page tables (shifted forward in time to today's gate counts). https://yarchive.net/comp/linux/page_tables.html This is why ARM shifted from a Svnapot like solution to the "translation granule queryable and partially selectable at runtime" solution.
Another issue is the fact that a big reason to switch to 16KB or even 64KB pages is to allow for more address range for VIPT caches. You want to allow high performance implementations to be able to look up the cache line while performing the TLB lookup in parallel, then compare the tag with the result of the TLB lookup. This means that practically only the untranslated bits of the address can be used by the set selection portion of the cache lookup. When you have 12 bits untranslated in a address, combined with 64 byte cachelines gives you 64 sets, multiply that by 8 ways and you get the 32KB L1 caches very common in systems with 4KB page sizes (sometimes with some heroic effort to throw a ton of transistors/power at the problem to make a 64KB cache by essentially duplicating large parts of the cache lookup hardware for that extra bit of address). What you really want is for the arch to be able to disallow 4KB pages like on apple silicon which is the main piece that allows their giant 128LB and 192KB L1 caches.