You've learned that the system of coin + machine has resulted in the same orientation 99% of the time. You can put some error bars on that, investigate the differences (did that 1% where it changed happen disproportionately with a certain side of the coin up?) and from that provide an estimate for whether the coin is fair. If the confidence intervals aren't small enough for you, you can do more experiments. The confidence interval will never be 0, until you've either done an infinite sequence of trials. (Only axiomatic logic can have confidence interval 0, and it doesn't make statements about the real world, only about the axiomatic system in use.)
So, let's say that we continued the servo tests 1e99 times, with the coin loaded in each orientation equally. We measured 50.00% flips for heads and tails, and continue to see the 0.99 correlation with the initial orientation. The 1% of the time that the correlation doesn't match, it doesn't seem to show any bias for one side or the other.
So after an "infinite" number of tests, we continue to get 50.00% frequency of heads, but with an 0.99 correlation with the orientation when loaded into the machine.
Now I load a coin into the machine and ask you to name the true probability that the result is heads. I don't tell you the initial orientation, but I know it privately.
What's the true probability of heads? Our testing found precisely 50.00% frequency of heads. But are you still sure the probability is an intrinsic property of the system, rather than a property of your state of knowledge of the system?
We can continue the pattern; maybe the 1% error itself correlates to 0.99 with someone running the microwave in the kitchen. This drops the line voltage and causes the servo to impart a little less momentum to the coin, causing it to flip one fewer times on average. Neither of us have currently checked that the microwave is running... And so on...
But what do the error bars themselves mean? Are they not probabilistic in nature themselves?
Say you conduct a thousands trials and calculate the error bar based on the results. If you conduct a hundred such experiments (each consisting of a thousand trials) and one of the experiments violates the error bar, does that invalidate it?