Laboratory Journal: "Grades are back"

Now that the Generate Newsworthy HITs have been released, our main objective is to code an evaluation script to decide whether to approve or reject a Turker's submission (it's also disconcerting to see that not many Turkers are interested in these HITs. We might need to figure out how to address that).

Chris and I had written a very similar script back in Winter during an event called mini-SCALE where all the researchers were brought to one conference room with the common purpose of making progress on their projects and presenting it at the end of the week-long event. My focus was on the Haitian-English Anonymization HIT, in which the task was to anonymize (cross out phone numbers, names, and other personal information) messages that were sent for aid during the Haiti Earthquake. The evaluation script for this specific HIT would basically read in a .csv file downloaded from MTurk, parse (analyze text) the answers, and then write another .csv file with the Approve and Reject columns filled out (an x for 'Yes, this Turker did a great job - we'll accept and pay for the submitted work' and a message for 'Sorry, this Turker did not submit quality work).

Similarly, the script now, called grading-wikitopics.py (link here), is coded to fill out the Approve and Reject columns for the 'Generate Newsworthy HIT.' You can see the output when run as follows:

.csv file with HITS that I filled out :D

Chris recommended that I do some of my own HITs to collect data for the grading-wikitopics.py since we want to know where to set the threshold percentages for accepting or rejecting a Turker's HIT. In other words, if a Turker's submission is x% blank, do we reject it or accept it; if a Turker's submission has some of the control (questions that we included that we know the answers to) questions wrong, how many can they get wrong before we reject their submission? We allow leeway since not all Turkers will submit perfect submissions and not all perfect submissions are needed for data - we can very well manage with submissions that have missing fields. Of course, we try to constrict the collected data to be as accurately congregated as possible, so in between accuracy and quantity is what we are shooting for.

It was also very interesting to be in the position of doing the HITs because it gave me insight on how I could improve the HIT design and instructions. To have both perspectives on the Generate Newsworthy HIT is a great advantage for me since I gain a stronger understanding of how to communicate my goals and minimize and ambiguity or misunderstandings. These were some of the updates I would like to include to the most recent HIT:

     1. Tell Turkers to follow these exact instructions:
         a. Google the topic through both the Web tab and News tab
         b. Look for any news articles around ${date} that mention the topic, and select options as follows
         c. If few or no articles appear on the topic, then you can regard it as a non-newsworthy topic
     2. Figure out how to make the HIT more appealing (perhaps more concise instructions will do the job)
   3. Possibly include a segment about why we want to label these topics as newsworthy or not - Turkers might be more interested if we do.

After a quick push to the git repository, I'm off to do some more HITs for the grading-wikitopics.py script.

Laboratory Journal

Monday, June 6, 2011

"Grades are back"

No comments:

Post a Comment