Yes, you could implement it in a way where the first layer is streamed and accumulate on output activations in parallel in the memory. This would limit the memory requirements for the input activations, but would increase execution time, as more activiations have to be shuffled around.
In this case I am streaing from ROM anyways, so it does not matter if the inputs are read only once or multiple times.
In this case I am streaing from ROM anyways, so it does not matter if the inputs are read only once or multiple times.