Friday, May 27, 2011

Catching Devilish Scammers on MTurk

The past two days I've been working on creating positive and negative controls for the 'Generate Newsworthy HIT'. These are important in case we happen upon cheaters who randomly choose options for the tasks, similar to proctors for the SATs. At the same time, it will not punish the good Turkers who work hard on our HITs. :)

Generate Newsworthy HIT
The goal of this HIT is to ask Turkers to help us decide whether or not our top 1000 wikipedia pages (based off of pageviews) are truly newsworthy with regards to if it was current in the past or present. Sometimes, random articles make it into our top 1000 articles as we are only relying on page views and relatively simple algorithms at the moment. Automating this check will probably be one of the more difficult parts of the project, so to get some more information on how 'newsworthy' our articles are currently, we submit this HIT to MTurk.

Chris sent me new introductions for the HIT. Pretty!




Positive Controls
The overall idea of the positive controls is that we collect the articles Wikipedia puts on their current news webpage each day. We then perform an intersection between Wikipedia's current news and our top 1000 articles [over a period of days so may be 2000-15000+] by checking if any of Wikipedia's current news articles match our generated top 1000 articles - such articles are the intersection, and we assume these articles to be positively current, so Turkers should always choose 'Yes, the article is current' or 'Yes, the article was current'. We will include one of these intersection articles somewhere within the 12 articles (we've only setup 10 currently, but will soon add in 12) we give the Turkers to label for newsworthiness.

This is the generated output of the generate_positive.py script. The blocks of articles appear as follows:

1. Articles pulled from Wikipedia's Current News Portal (Portal!)
2. Congregated articles that we generated by page views over a period of time
3. Intersecting articles mentioned both in the Wikipedia Current News Portal and our top congregated articles


Negative Controls
The second quality control that we have for the HIT is the negative control, which are articles that should be labeled as 'No, the article is not current'. The way we went about generating the negative control is by simply taking random articles from Wikipedia and checking to make sure that they're not in our top 1000 articles [over a period of days] (and possibly also not mentioned in Wikipedia's Current News Portal - thinking about adding this). This way, we'll have a collection of articles that are surely not current! Along with the positive control articles, we will include the negative control articles somewhere within the 12 articles given to Turkers. This dual-combo of controls sets a quality standard for Turker submissions so that we can better decide which submissions to consider.

This is the generated output of the generate_negative.py script. The blocks of articles appear as follows:

1. Articles randomly generated from Wikipedia
2. Congregated articles that we generated by page views over a period of time
3.(Set Difference) Randomly generated articles in step 1 that are not in our top congregated articles




I was glad to hear that both Chris and Byung Gyu were proud to see me finish off these controls. ^^ The project seems to be making great progress, and I imagine that we'll soon be able to release the HIT live for Turkers to do. The plan is to automate the generate_positive.py and generate_negative.py to distinguish two articles in which we most probably know the answer to, and release the HIT with the controls and 10 unknown articles we want labeled for newsworthiness. This will be a daily HIT for Turkers to complete! I wonder if we can even automate distinguishing newsworthiness without MTurk in the future!

MediaWiki
You can use MediaWiki software to create wiki-pages (API), and they have some interesting ways to write the articles on Wikipedia without needing to know HTML. For instance (pulled from Wikipedia - so meta :P):

MediaWiki syntax Equivalent HTML Rendered output
"Take some more [[tea]]," the March Hare said to Alice, very earnestly.

"I've had nothing yet," Alice replied in an offended tone: "so I can't take more."

"You mean you can't take ''less''," said the Hatter: "it's '''very''' easy to take ''more'' than nothing."
<p>"Take some more <a href="/wiki/Tea" title="Tea">tea</a> ," the March Hare said to Alice, very earnestly.</p>
<p>"I've had nothing yet," Alice replied in an offended tone: "so I can't take more."</p>

<p>"You mean you can't take <i>less</i>," said the Hatter: "it's <b>very</b> easy to take <i>more</i> than nothing."</p>
"Take some more tea," the March Hare said to Alice, very earnestly. "I've had nothing yet," Alice replied in an offended tone: "so I can't take more."
"You mean you can't take less," said the Hatter: "it's very easy to take more than nothing."

Python
+ import datetime (datetime Python module for the purpose of working with the date and the time)
   + date = datetime.date(2011, 05, 11) (creates a date object)
+ import simplejson | import json as simplejson (using the JSON format)
+ path = os.environ["HOME"] (environment variables in Python)
+ u'The_Pirate_Bay' (unicode strings that are different from regular strings! You can't get rid of the starting u through string[-2:]
+ var = [string].encode("utf-8") (the right way to convert unicode strings to regular strings in UTF-8)

How I broke down problems
+ Really great to figure things out, learned a whole lot of stuff as I was making progress through it. One thing I realized was that it's important to remember that there exists many types of objects and concepts even though they may appear similar to one another. By keeping an open mind, you have do a direction to solving those odd bugs/problem.

Todo List
+ generate the first section and first sentence of the randomly chosen negative controls
+ geolocation services for the 'Generate Newsworthy HIT' (45-60mins)

No comments:

Post a Comment