Modern machine can do TLB and L1 lookup in parallel. Here is how it works on a t...

Modern machine can do TLB and L1 lookup in parallel. Here is how it works on a traditional CPU (it is different on M1).

The page size is 4kb. This means the lower 12 bits of an address is the same between logical and physical addresses. The cache line is 64 bytes. The lower 6 bits of the address are indexing within a cache line. The L1 is 8 way associative so the other 6 bits addresses 8 cache lines in the L1. This makes 64*8 cache lines of 64 bytes => 32k of L1 cache.

The CPU does a lookup in TLB and a lookup in L1 in parallel, and gets 8 cache lines from the L1, which are filtered by the results in the TLB to hopefully get a hit.

Now you'll note that, while most CPU have 32k of L1, the M1 has 128k, which means it needs 2 extra bits to match between physical and logical addresses to pull the same trick. And what do you know, M1 has 16k pages! What a coincidence (not!).