MNIST is a handwritten digit database. Each 784-pixel image (28x28) corresponds to a digit from 0 to 9. As a pure mathematical construction, there are at most 2^784 inputs possible, and a small number of possible outputs.
So if you have 784 completely different ways of analyzing the image, and you combine them in the right ways, you will get roughly an approximation of an answer. This is a tautology if the 784 ways are "the value of each pixel" and the combinations are "magic", but if you have more "intelligent" combinations you should have combinations that are less magic. And in this case, since humans can generally determine the digit value from a 7-light display, it seems reasonable that there exists some way to have "intelligent" combinations such that they combine to form a neural network that solves the problem of digit identification.
And that (still hand-wavy) explanation can also plausibly describe how a human would describe identifying a number. If I ask you "why is this a 1 and not a 3", you might say "because it's straight" or "because it's narrow" or "because it doesn't have a point in the middle" or any number of other descriptions of the object. So you can envision a 2-layer network where the middle layer calculates this (and due to the structure of images, in practice it might better be a 3 or 4-layer network. but the important point is that the search algorithms don't rely on it or you knowing what these middle layers are ahead of time)
Which only leaves the question of how "neural network learning" is supposed to find this. And there are a few heuristics which combine to (in practice) be a very effective search. We have back-propagation (which is much easier with automatic differentiation), so we can adjust the entirety of the network based on the output. (and it's an axiom that if you have a lot of things, they will be similar in the ways they are the same, and different [hopefully in some regular way] in the ways they are not the same). We have drop-off, where we attempt to prune connections that are irrelevant. We can add new connections to see if they are relevant. We can do any number of hill-climbing algorithms on the output of the fitness function. And, as a valid search algorithm, it tends to converge to a valid result.
Obviously none of this is at all rigorous, but if you know the math here enough I don't think you're asking the questions in this article.
Interesting intuitions; but if by "ways of analyzing the image" you mean functions, there are an awful lot more than 784...