With the right algorithm you can turn any certain string of bits into a certain other string of bits. So is the image in the data, or is it really in the algorithm?
If the decoder was "trained" on and only works with predictable data, then it might be the algorithm that's illegal, but if a completely new illegal image is created, hashed, fed into the decoder and the decoder produces a valid illegal image, then the illegal data must be in the input, not the algorithm.
This is basically rule 1 of testing neural networks: if the testing data is different from the training data and the results are still correct, your network is "reading" the data correctly and not just memorising a list of known values. I guess this means you'd also need to prove that the decoder doesn't turn most hashes of non-illegal images into illegal images, but if you also did that, you'd have a pretty strong case that the illegal data is in the hash.