See http://elem.com/~btilly/effective-ab-testing/index.html#asli... for part of a presentation that I did where I actually set up some reasonable fake tests, and ran simulations. What I found is that if there is a significant difference, the probability of coming to the wrong conclusion was (as you would expect) higher, but not that high before the underlying difference made mistakes incredibly unlikely. Conversely if there is only a small real difference, the amount of data needed before you have a significant chance of having accidentally come to a erroneous conclusion is very, very long.
So avoid accepting any result where you don't have at least a few hundred successes and set your thresholds reasonably high. You will make fewer mistakes than you probably fear, and the ones that you make will almost always be very minor. (Oops, I had a 3% chance of accepting the 1% worse solution as probably better.)
Of course if you're aiming to publish academic research, your standards need to be higher. But if you're more interested in getting useful results than publishable ones, you can relax your standards. A lot.