First, bagged decision trees are a little hard to interpret; what is the advantage of a bagged model vs the plain trees? Are you using a simple majority vote for combination? What are the variances between the different bootstraps?
Second - what do you mean by 95% ? Do you mean that out of 99999 good parts 4999 are thrown away? and one bad one is picked out as bad ?
Third - what is this telling you about your process? Do you have a theory that has evolved from the stats that tells you why parts are failing? This is the real test for me.. If the ML is telling you where it is going wrong (even if it's unavoidable/too expensive to solve) then you've got something real.
Unfortunately my concern would be that as it stands.. you might find that in production your classifier doesn't perform as well as it did in test... My worry has been generated by the fact that this same thing has happened to me !
I'd hazard to guess that the 95% is the reduction in how many parts made it through the first test only to be caught later at the more expensive stage. So instead of binning 100 parts a month at that second stage, they now bin 5 parts a month and catch way more early on.
It sounds like the OP is using ML to identify flaws that simply just occur due to imperfections in the manufacturing process. That's life, it happens. You can know that they will occur without necessarily being able to prevent them because maybe there's some dust or other particulates in the air that deposit into the resin occasionally, or maybe the resin begins to cure and leaves hard spots that form bond flaws. There's heaps of possible reasons. It sounds more like the ML is doing classification of 'this too much of a flaw in a local zone' vs 'this has some flaws but it's still good enough to pass', which is how we operate with casting defects. For example, castings have these things called SCRATA comparitor plates, where you literally look at an 'example' tactile plate, look at your cast item, then mentally decide on a purely qualtative aspect which grade of plate it matches. Here  are some bad black and white photos of the plates.
To clarify on that 95% value because it is admittedly really vague: That's actually a 95% correct prediction rate. So far we get ~2.5% false-positives and ~2.5% false-negatives. 2.5% of the parts evaluated will be incorrectly allowed to continue and will subsequently fail downstream testing (no big deal). More importantly, 2.5% of parts evaluated will be wrongly identified as scrap by the model and tossed, but this still works out to be a massive cost savings because a lot of expensive material/labor is committed to the device before the downstream test.
Malik 'Poot' Carr: Naw, man, that ain't right.
D'Angelo Barksdale: Fuck "right." It ain't about right, it's about money. Now you think Ronald McDonald gonna go down in that basement and say, "Hey, Mista Nugget, you the bomb. We sellin' chicken faster than you can tear the bone out. So I'm gonna write my clowny-ass name on this fat-ass check for you"?
D'Angelo Barksdale: Man, the nigga who invented them things still workin' in the basement for regular wage, thinkin' up some shit to make the fries taste better or some shit like that. Believe.
Wallace: Still had the idea, though.
2.5% of what, though? if only 1 in a million parts are actually bad, you're still tossing many more good parts than bad parts.