We use Mechanical Turk a lot when adding a new feature. Before releasing it we want to know if the UI clearly defines this new feature.
Turk is great overall and usually 7 out 10 Turk testers are able to figure through our UI the new feature. Though I wonder should I worry about the 3 (various tests) in which our UI failed them?
Would you not release an update until your UI passed the 10 out 10 test?
It seems like a clear case of where the different parts have very different incentives. The hackers want to have the best possible sample of users, while the Turks want to get their respective tasks done as quickly as possible so they can move on to the next paying one. I would think this would lead to Turks specializing in very specific (and relatively common and repetitive) tasks do they could maximize their through put.
In your case, I would prefer to "watch" users out in the wild, instead of "professional" turks to truly assess the quality of a feature implementation.