Ever since CrowdFlower (now Figure8) decided to terminate its SaaS-offering and focus only on enterprise, Amazon Mechanical Turk has been impossible to use.
CrowdFlower maintained statistics about the accuracy of individual workers. Additionally, they made it easy to include gold-standard "test" questions, which weeds out workers who are not doing the specific task correctly.
Without this sort of quality-control platform, mechanical turk is just unreliable crowdsourced-labor infrastructure. I understand that AWS doesn't want to build too much on the services they provide but, really, mechanical turk sucks.
I'm working with a researcher who is conducting experiments using turk, and he has a half-written shoddy version of what crowdflower offered. It seems that everyone using turk has to reinvent the wheel. Why can't someone provide decent quality control over turk results?
Because you're forgetting the other side of the equation, what sort of people are doing mechanical turk?
Are they 1:1 your peers, colleagues?
Or are they underpaid, most likely emerging economy countries, and very overworked labor that you're paying for?
You can browse to the mechanical turk subreddits and you'll hear the other side of the table that they spend 15-20-hours on a task to get what, a few dollars?
Pay more, get better results. The system itself is designed to exploit.
I have definitely upped the pay and not achieved better results. I don’t think it’s quite as simple a solution as you’re making it out to be. Realistically no matter how much you pay people who don’t have the medium to advanced skills to do the job for whatever reason, they won’t have those skills, at least in the short term.
Yep! The old saying "you get what you pay for" applies to Mechanical Turk like everything else. We tried outsourcing a project to Mechanical Turk and ended up spending more time on reliability and quality controls than it saved is in doing the manual work ourselves.
> I'm working with a researcher who is conducting experiments using turk, and he has a half-written shoddy version of what crowdflower offered
For doing research, I think Prolific [1] is a popular alternative.
They include features like pre-screening based on demographic features (including participants' approval rates in past studies) [2], and attention check questions [3].
I created Cogmint.com ("cognition minting") to solve this problem for myself.
You can submit known correct answers for questions, and those questions are then used as ground truth to score worker accuracy. Workers are then scored on their similarity to known correct answers and other workers that have accurately answered questions. It works surprisingly well for how simple it is. It's been a fun challenge to create simple methods of scoring similarity across different task types.
It's a side project, so don't rely on it for mission critical things, but I rely on it for some production tasks, so it's stable.
It currently supports classification (choose from a set of possible answers) and has beta support for bounding box task types. String input task types are coming very soon.
I'd love to see if it can help you out, I'll waive the fees: I'm not in it for the money I just like making things useful and reliable. Reach out and say hi!
The problem of providing quality control is that there are a lot of edge cases; even known “high accuracy” turkers may have bad judgement sometimes, which means that every piece of data needs to be validated anyway, whether it be the researcher themselves or another paid contractor.
My undergrad thesis was to build https://tagbull.com, where we tried to have turkers validate the work of other turkers by breaking up a label into sub tasks, and getting multi-turker consensus on those before moving forward.
The main issue we ran into is that the incentive system is incredibly misaligned with the responsibility that the turkers have. It’s very difficult to build trust, especially with a crowd of people who haven’t signed contracts, and who face virtually no repercussions for doing bad work, whether intentionally or unintentionally.
> You can submit known correct answers for questions, and those questions are then used as ground truth to score worker accuracy. Workers are then scored on their similarity to known correct answers and other workers that have accurately answered questions. It works surprisingly well for how simple it is.
Have you noticed problems that show up with questions whose answers have a bimodal distribution (ie. The gold standard question actually has two or more correct answers)?
In one sense, this is just a labeling quality problem with the 'gold standard' data, but to a lesser extent these same issues may crop up in the data being labeled when using similarity or clustering to rate or classify the workers and transitively apply that to the other results they produce.
Anecdotally yes it's a problem if two classes (button choices) are similar, resulting in two "top answers" for a given task. This seems most common for "yes/no" task types where there are only two options, and distinguishing between them is the hard part.
I haven't dug into the data on this across the platform but you've given me the idea to go see if I can find evidence of this, and see if I can improve somehow. There's only low hundreds of projects, so I might be able to find some that have this problem.
Given that a business pivoted away from exactly this, when they had this up and running, I'd advise putting in some effort to find out why they did so first.
Sounds sweet. I bet a lot of companies relying on mturk built this for themselves and then sell a higher value service with better margins. You could build something right in the middle.
I know Stanford's research teams all use a common interface to mturk that keeps profiles of turkers on their side so they know who to solicit for upcoming surveys, conduct longitudinal studies, etc. I've always wondered why more universities didn't follow suit.
I built a side project called cogmint based on the insight that simple scoring and ranking of workers was valuable. I ended up building my own worker interface instead of using mturk because it wasn't much additional effort on top of the scoring logic I was building anyway. Perhaps other serious companies came to the same conclusion I did with my hobby project.
Technical and Demographic user metrics are federated across the platform (we supply tasks to other labor markets online, essentially where bots and farms typically attack) so we leverage priors and shared similarity weights. Behavioral user metrics aren’t the only way to evaluate a respondent
Sounds a little meta! Managed managed labeling workflow. I do wish Amazon would do a better job with this, and I also have no doubt if a company like this sprang into existence and was successful, Amazon would clamp down on it in some way — whether by changing terms to disallow such a business, cutting off MT access to the business, or just copying it.
Mechanical Turk custom tasks are insanity as well. I was doing an object detection project and threw together a labeling UI. Integrated the Turk custom task, create and complete, but then I realized these tasks don't show up in the UI anywhere. You start implementing an entire UI for Amazon before you just bail and contract the one good Turk worker you've found.
I've been saying this for years as a critique to psychology studies. Do you really expect desperate people getting paid peanuts to yield quality results?
I mentioned this above, but at least in the short term my experience has been that paying more doesn’t yield better results on most tasks, which I naively attribute to those skills not being available at any price from the current population of MTers.
If higher prices were paid across the board, doubtlessly more workers with the sought after skills would sign up though. So maybe that was the thrust of your comment.
> If higher prices were paid across the board, doubtlessly more workers with the sought after skills would sign up though. So maybe that was the thrust of your commen
Maybe a couple? But you're also going to get millions more unskilled workers. Without good filtering, no pay level is going to work.
I mentioned this above, but at least in the short term my experience has been that paying more doesn’t yield better results on most tasks, which I naively attribute to those skills not being available at any price from the current population of MTers.
This is plausible but doesn't change the basic problem. Of course, workers on AMT have years of experience trying to game it. Just more pay isn't going to create a sudden sense of solidarity with the employer.
I would guess that quality results at scale would require a system built from the ground-up to cultivate some degree of loyalty or something.
It's kind of a dilema of capitalism itself.
At scale, you can pay people to accomplish a task for peanuts but they lack loyalty and will cut-corners whenever it's possible and convenient. And that still gets a whole lot of things done for cheap in this world.
But if you want to get loyalty, people you don't have to watch, etc., you have to organize an entire enterprise for that, maybe show loyalty yourself (Turkers are inherently throw-away, why should they be loyal for just today's wages). And the cost of that arrangement tends to multiply.
I'm sorry, but are we talking about the post docs ~running the experiment~ managing the routine tasks beneath the dignity of the department/lab head or the people participating in the experiment?
I am not privy to the state of psychology research, but I'm pretty sure the OP is referring to the reliance of Mechanical Turk in the field of psychology [1].
Even with more pay, there aren’t any real repercussions for doing bad work. Sure there are different “tiers” of turkers, but realistically anyone could recreate their account once their rating gets low enough.
> AMT workers use VPNs to work around geographical restrictions
Interesting. When I go to login to Amazon with a VPN it puts me in a CAPTCHA loop and never lets me access my account. I have to login with a residential IP address to access my account.
The whole company was shady from the get go...it was effectively slave labour....it's Amazon's bread and butter of just disregarding the quality in favour of massive availability
this service always made me feel uncomfortable, even the origin of the name “mechanical turk”… an illusion of automation meant to fool others, but really just the hidden labor of a foreigner
You're technically not wrong, but from your tone and wording I think you have a misconception about the machine, or are trying to induce such a misconception in others, to inspire others to take offense at the premise. There was never a Turkish person hidden inside the Mechanical Turk. The "foreigners" inside were a variety of chess masters from Germany, Austrian, French and the UK. Not the oppressed immigrants a modern reader might imagine when speaking of Turks and unspecified foreigners. And I suspect the chessmaster operators of the original Mechanical Turk were not remotely representative of the demographics of modern Amazon MT users. A Frenchman participating in a scheme to bamboozle some Austrian princes is not exactly something worth getting bent out of shape over.
(Furthermore I think in at least some of the cases, the chessmasters were operating the machine in their home country and weren't foreigners at all.)
The name was meant to inspire dread in those playing against it, much like how you know shit has gotten real when you have to fight robot Hitler in an American game.
The name “mechanical turk” comes from a fake chess-playing machine that hid a person inside of it, whom actually operated said machine. So I think the name is quite appropriate for this service.
> Why would it be uncomfortable? Those hidden inside the (original) machine were not exploited by the machine owner
Do you have a source for that assertion? As far as I can tell, several of the chess players known to have operated the Turk had substance abuse (chiefly alcoholism), health, and money problems (the words 'debt', 'penniless', and 'destitute' come up a few times). While not proof of abuse, it does suggest a strategy of recruiting the vulnerable.
In late 1700 and early 1800 Europe being a heavy drinker, having some kind of health issue and/or being penniless were the norm for most of the population. If performing a task because of those issues means being exploited then we must assume that almost everyone in Europe at those times were exploited in one way or another.
I understand what you mean if we judge it from modern western standards, but I don’t agree if we judge it by the standards of that era.
CrowdFlower maintained statistics about the accuracy of individual workers. Additionally, they made it easy to include gold-standard "test" questions, which weeds out workers who are not doing the specific task correctly.
Without this sort of quality-control platform, mechanical turk is just unreliable crowdsourced-labor infrastructure. I understand that AWS doesn't want to build too much on the services they provide but, really, mechanical turk sucks.
I'm working with a researcher who is conducting experiments using turk, and he has a half-written shoddy version of what crowdflower offered. It seems that everyone using turk has to reinvent the wheel. Why can't someone provide decent quality control over turk results?