If the 2x2 cell is moving horizontally and vertically by 1 block, then wouldn't that mean you miss the 2 corners you don't pass over for the cell you're currently calculating?
Unfortunately the demo video in the link you posted seems to be private, so I might be misunderstanding.
Yes, but the next iteration will take care of those interactions. You're not really "moving" anything; that was poor imagery on my part. The illustration on the Wikipedia article shows it well: https://en.wikipedia.org/wiki/Block_cellular_automaton
There are some diagonal interactions that are not directly expressible in this way, but they can propagate through the non-diagonal neighbours instead.
Do you know if this is a technique they're using in Noita? I've implemented falling sand simulators but I'm baffled how you can make a game like that performant.
Here's a talk from the creator of Noita about the underlying tech. From 9:20 he talks about their multithreading implementation. https://www.youtube.com/watch?v=prXuyMCgbTc
Typically a cellular automata simulation will have some edge condition like wrapping or mirroring an adjacent cell.
A nice optimization trick is to make the cell buffers 2 cells wider and taller (or two times whatever the neighborhood radius is), and then before each generation you update the "gutter" by copying just the wrapped (or mirrored) pixels. Then your run the rule on the inset rectangle, and the code (in the inner loop) doesn't have to do bounds checking, and can assume there's a valid cell to read in all directions. That saves a hell of a lot of tests and branches in the inner loop.
Also, you you can define the Margolus neighborhood in terms of the Moore neighborhood + horizontal phase (even/odd column x) + vertical phase (even/odd row y) + time phase (even/odd time). With that information the rule can tell if it's at an even or odd step, and which of the four squares of the grid it's in.
That's how the CAM6 worked in hardware: it used the x/y/time phases as additional bits to determine the Margolus neighborhood lookup table index.
Here's how my CAM6 emulator computes the CAM6-hardware-compatible Margolus lookup table index, based on the 9 Moore neighbors + phaseTime, phaseX, and phaseY:
Unfortunately the demo video in the link you posted seems to be private, so I might be misunderstanding.