Saturday, July 2, 2011

Crowdsourcing Best Practices: Twenty points to consider when designing a HIT

Your very first glance at worker responses on the very first first task you crowdsource tells you that there are very different kinds of workers out there in the crowdsoucing-sphere, for example, on Mechanical Turk. Some of the responses are impressive in the level of dedication and insight that they reflect, others appear to flatly fail the Turing test.

It is also quite striking that there are different kinds of requesters. Turker Nation gives us insight on to the differences between one requester and the next, some better behaved than others.

What is particularly interesting is the differences among requesters who are working in the area of crowdsourcing for information retrieval and related applications. One would maybe expect there to be some homogenity or consensus here. At the moment, however, I am reviewing some papers involving crowdsourcing, and no one seems to be asking themselves the same questions that I ask myself when I design a crowdsourcing task.

It seems worthwhile to get my list of questions out of my head and into a form where other people can have a look at it. These questions are not going to make HIT (Human Intelligence Task) design any easier, but I do strongly feel that asking them should belong to crowdsourcing best practices. And if you do take time to reflect on these aspects, your HIT will in the end be better designed and more effective.
  1. How much agreement do I expect between workers? Is my HIT "mechanical" or is it possible that even co-operative workers will differ in opinion on the correct answer? Do I reassure my workers that I am setting them up for success by signaling to them that I am aware of the subjective component of my HIT and don't have unrealistic expectations that all workers will agree completely?
  2. Is the agreement between workers going to depend on workers' background experience (familiarity with certain topics, regions of the world, modes of thought)? Have I considered setting up a qualification HIT to do recruitment? or Have I signaled to workers what kind of background they need to be successful on the HIT?
  3. Have other people run similar HITs and I have I read their papers to avoid making the same mistakes again?
  4. Did I consider using existing quality control mechanisms, such as Amazon's Mechanical Turk Masters?
  5. Is the layout of my HIT 'polite'? Consider concrete details: Is it obvious that I did my best to minimize non-essential scrolling? But all in all: Does it look like I have ever actually spent time myself as a working on the crowdsourcing platform that I am designing tasks for?
  6. Is the design of my HIT respectful? Experienced workers know that it is necessary for requesters to build in validation mechanisms to filter spurious responses. However, these shoud be well designed so that they are not tedious or insulting for conscientious workers who are highly engaged in the HIT: it is annoying and breaks the flow of work.
  7. Is it obvious to workers why I am running the HIT? Do the answers appear to have a serious, practical application?
  8. Is the title of my HIT interesting, informative and attractive?
  9. Did I consider how fast I need the HIT to run through when making decisions about award levels on also when I will be running the HIT (on the weekend)?
  10. Did I consider what my award level says about my HIT? High award levels can attract treasure seekers. However, award levels that are too low are bad for my reputation as a requester.
  11. Can I make workers feel invested in the larger goal? Have I informed workers that I am a non-profit research institution or otherwise explained (to the extent possible) what I am trying to achieve?
  12. Do I have time to respond to individual worker mails about my HIT? If no, then I should wait until I have time to monitor the HIT before starting it.
  13. Did I consider how the volume of HIT assignments that I am offering will impact the variety of workers that I attract? (low volume HITs attract workers that are less interested in rote tasks)?
  14. Did I give examples that illustrate what kind of answers I expect workers to return for the HIT? Good examples will let workers concerned about their reputations judge in advance if you are likely to reject their work?
  15. Did I inform workers of the conditions under which they could be expected to earn a bonus for the HIT?
  16. Did I make an effort to make the HIT intellectually engaging in order to make it inherently as rewarding as possible to work on?
  17. Did I run a pilot task, especially one that asks workers for their opinions on how well my task is designed?
  18. Did I take a step back and look at my HIT with an eye to how it will enhance my reputation as a requester on the platform? Will it bring back repeat customers (i.e., people who have worked on my HITs before)?
  19. Did I consider the impact of my task on the overall ecosystem of the crowdsourcing platform? If I indiscriminately accept HITs without a responsible validation mechanism, I encourage workers to give spurious responses since they have been reinforced in the strategy of attempting to earn awards with investing a minimum of effort.
  20. Did I consider the implications of my HIT for the overall development of crowdsourcing as an economic activity? Does my HIT support my own ethical position on the role of crowdsourcing (that we as requesters should work towards fair work conditions for workers and that they should ultimately be paid US minimum hourly wage for their work)? It's a complicated issue:
The workers on Mechanical Turk refer to themselves as "turkers". This act of self-naming signals a sense of community, of a common understanding of what they are doing, the commonality of the activity that they are all engaged in.

What do we as requesters call ourselves? Do we have a sense of community, too? Do we enjoy the strength that derives from a shared sense of purpose?

The classical image of Wolfgang von Kempelen's automaton, the original Mechanical Turk, is included above since I think it sheds some light on this issue. Looking at the image we ask ourselves who should be most appropriately designated "turker"? Well, it's not the worker, who is the human in the machine. Rather it is the figure who is dressed as an Ottoman as is operating the machine: If workers consider themselves turkers, then we the requesters must be turkers, too.

The more that we can foster the development of a common understanding of our mission, the more that we can pool our experience to design better HITs, the more effectively we can hope to improve information retrieval by using crowdsourcing.