It just takes some work.
Source: have used used webcam cube cams in a robotics application where latency needed to be accounted for.
But all kinds of post-processing can be (and is) done on readout (pixel by pixel) without needing to keep the whole frames and go through them. You just keep some calculated parameters from previous frames (and not the whole frames).
You can do a lot of processing this way, incl. scaling, white balance, color correction, effect filters, etc.
Web cams add usb interface and an additional controller chip, so maybe there's some added latency there. But if you use the camera sensor directly over CSI, you can get pretty low latency.
Also depending on the bus being used cameras often compress the image using JPEG to reduce the bandwith (very common with USB cameras, I doubt that the Mac camera does that).
Then on a computer you have the latency of whatever pipeline is being used and how much buffering is involved. For instance since typically the display doesn't run exactly synchronized with the camera you'll have some frames that'll end up "stuck" waiting for the monitor's vsync. If the app then does some internal buffering on top of that you can very easily reach a noticeable amount of latency.
In general you can consider than anything above 100ms is easily noticeable and that's only 3 frames at 30fps and 6 at 60, it's really not that much when you consider the complexity of modern video capture pipelines.
A pipeline might look something like demosaicing -> denoising -> light/exposure adjustment -> tone mapping -> encoding
Some of these steps will likely have hardware support (e.g. demosaicing, encoding) some won't. You may carry a few frames around for denoising, e.g. You have to sync up with the display at some point. Some of the steps might change order too, depending on what you are trying to do and what hardware support you have.
I haven't looked at available hardware specs recently. You used to be able to get very basic capture cameras that offloaded nearly everything to the computer - so that would give you minimal latency on the capture side (limited by your communication channel obviously); However keeping a stable image as resolution and frame rate grow is really hard that way, eventually impossible without a realtime system. More modern cameras are going to do much of that on board.
The alternative that usually gets used in productions with a moderate budget is a hardware camera with HDMI or SDI output into a Blackmagic capture device. There's still some latency, but it's better than a webcam. It's pretty much down to the minimum of what you can do with a computer -- the capture device still has to buffer a frame, transmit it to the computer, which copies it into RAM, then blits it to video RAM, where it waits for the next vertical refresh.
All that said, the human finger doesn't travel all that fast, so for a HID application like this, you could probably get the latency way down by extrapolating the near-future position of the finger, much like iPadOS does with Pencil.
You don't need memory to have latency. Speed of light / electrons is enough...
That said, webcams absolutely have extra processing (and thus latency).