Friday, May 27, 2011

Catching Devilish Scammers on MTurk

The past two days I've been working on creating positive and negative controls for the 'Generate Newsworthy HIT'. These are important in case we happen upon cheaters who randomly choose options for the tasks, similar to proctors for the SATs. At the same time, it will not punish the good Turkers who work hard on our HITs. :)

Generate Newsworthy HIT
The goal of this HIT is to ask Turkers to help us decide whether or not our top 1000 wikipedia pages (based off of pageviews) are truly newsworthy with regards to if it was current in the past or present. Sometimes, random articles make it into our top 1000 articles as we are only relying on page views and relatively simple algorithms at the moment. Automating this check will probably be one of the more difficult parts of the project, so to get some more information on how 'newsworthy' our articles are currently, we submit this HIT to MTurk.

Chris sent me new introductions for the HIT. Pretty!




Positive Controls
The overall idea of the positive controls is that we collect the articles Wikipedia puts on their current news webpage each day. We then perform an intersection between Wikipedia's current news and our top 1000 articles [over a period of days so may be 2000-15000+] by checking if any of Wikipedia's current news articles match our generated top 1000 articles - such articles are the intersection, and we assume these articles to be positively current, so Turkers should always choose 'Yes, the article is current' or 'Yes, the article was current'. We will include one of these intersection articles somewhere within the 12 articles (we've only setup 10 currently, but will soon add in 12) we give the Turkers to label for newsworthiness.

This is the generated output of the generate_positive.py script. The blocks of articles appear as follows:

1. Articles pulled from Wikipedia's Current News Portal (Portal!)
2. Congregated articles that we generated by page views over a period of time
3. Intersecting articles mentioned both in the Wikipedia Current News Portal and our top congregated articles


Negative Controls
The second quality control that we have for the HIT is the negative control, which are articles that should be labeled as 'No, the article is not current'. The way we went about generating the negative control is by simply taking random articles from Wikipedia and checking to make sure that they're not in our top 1000 articles [over a period of days] (and possibly also not mentioned in Wikipedia's Current News Portal - thinking about adding this). This way, we'll have a collection of articles that are surely not current! Along with the positive control articles, we will include the negative control articles somewhere within the 12 articles given to Turkers. This dual-combo of controls sets a quality standard for Turker submissions so that we can better decide which submissions to consider.

This is the generated output of the generate_negative.py script. The blocks of articles appear as follows:

1. Articles randomly generated from Wikipedia
2. Congregated articles that we generated by page views over a period of time
3.(Set Difference) Randomly generated articles in step 1 that are not in our top congregated articles




I was glad to hear that both Chris and Byung Gyu were proud to see me finish off these controls. ^^ The project seems to be making great progress, and I imagine that we'll soon be able to release the HIT live for Turkers to do. The plan is to automate the generate_positive.py and generate_negative.py to distinguish two articles in which we most probably know the answer to, and release the HIT with the controls and 10 unknown articles we want labeled for newsworthiness. This will be a daily HIT for Turkers to complete! I wonder if we can even automate distinguishing newsworthiness without MTurk in the future!

MediaWiki
You can use MediaWiki software to create wiki-pages (API), and they have some interesting ways to write the articles on Wikipedia without needing to know HTML. For instance (pulled from Wikipedia - so meta :P):

MediaWiki syntax Equivalent HTML Rendered output
"Take some more [[tea]]," the March Hare said to Alice, very earnestly.

"I've had nothing yet," Alice replied in an offended tone: "so I can't take more."

"You mean you can't take ''less''," said the Hatter: "it's '''very''' easy to take ''more'' than nothing."
<p>"Take some more <a href="/wiki/Tea" title="Tea">tea</a> ," the March Hare said to Alice, very earnestly.</p>
<p>"I've had nothing yet," Alice replied in an offended tone: "so I can't take more."</p>

<p>"You mean you can't take <i>less</i>," said the Hatter: "it's <b>very</b> easy to take <i>more</i> than nothing."</p>
"Take some more tea," the March Hare said to Alice, very earnestly. "I've had nothing yet," Alice replied in an offended tone: "so I can't take more."
"You mean you can't take less," said the Hatter: "it's very easy to take more than nothing."

Python
+ import datetime (datetime Python module for the purpose of working with the date and the time)
   + date = datetime.date(2011, 05, 11) (creates a date object)
+ import simplejson | import json as simplejson (using the JSON format)
+ path = os.environ["HOME"] (environment variables in Python)
+ u'The_Pirate_Bay' (unicode strings that are different from regular strings! You can't get rid of the starting u through string[-2:]
+ var = [string].encode("utf-8") (the right way to convert unicode strings to regular strings in UTF-8)

How I broke down problems
+ Really great to figure things out, learned a whole lot of stuff as I was making progress through it. One thing I realized was that it's important to remember that there exists many types of objects and concepts even though they may appear similar to one another. By keeping an open mind, you have do a direction to solving those odd bugs/problem.

Todo List
+ generate the first section and first sentence of the randomly chosen negative controls
+ geolocation services for the 'Generate Newsworthy HIT' (45-60mins)

Wednesday, May 25, 2011

Fresh Start

I've been talking with my parents a lot lately about school and our ambitions as a whole to be successful and happy. I haven't had a chance to do that and just think about the future in the while because of the time put into this semester's courses, but it's a lucky feeling to have been able to since summer started. My boyfriend and I even walked for an hour and a half to Inner Harbor Baltimore - I love things like this. I DID come out with blisters between my toes (because someone wore flipflops for the trip ...), but I'd trade that any day for an amazing trip like that. It's only been a few days into summer but I feel like so much has happened already, and I'm constantly learning everyday about myself and natural language processing at my research lab.

These are all the tasks I completed yesterday:

Todo List
+ (check) learn about ssh/secure shell
+ (not yet!) complete 'Generate Newsworthy HIT'
   (check)     + get a more solid understanding of JavaScript/CSS/HTML
+ (check) correct escaped string in newsworthy.csv
+ (check) getting acquainted with GitHub
+ (check) start learing Perl

CSS/HTML
<!-- COMMENT --> (how to comment)

JavaScript
How to Show/Hide text
1. create a controlling link with <a href="javascript:function(id);">Click</a>
2. give elements an id <div id = ...
3. javascript function: getElementsByTagName("div")
4. use .style method to get style object of element if (item.style.display = 'block' ...
(I followed this tutorial: http://webdesign.about.com/od/dhtml/a/aa101507.htm)

Linux Commands
+ ln -s [path] (to access directories of another user with all the read/write permissions intact)
+ scp [file] username@domain.name:directory (to pull files from remote machine onto another)

SSH/Secure Shell
Secure Shell is a program developed by SSH Communications Security Ltd. that allow users to log into other computers over a network. They use public key crptyography to match a public key and a private key to allow users into the system (and might also prompt for a password). Through this, I can login to another computer through my laptop, store files there, move them to my own, and perform any other edits as I want.

Generate Newsworthy HIT

+ added paragraph that Chris wrote, fixed up little errors in value options
+ fixed the tables (much more quickly than before). The time I spent consolidating some CSS concepts (tables) really helped :)
+ changed document to JavaScript generated HIT template so that could be used to unescape string AND it's much, much neater

How to approach problems

+ Sometimes, it's a good idea to make a fresh start. Perhaps you've got a bunch of code that has a bug in it somewhere, but you just can't figure it out. Opt Pulling up an older version or starting from scratch if you feel like debugging is taking way too long. I frequently do this for bugs that are right in front of my nose but due to familiarity are overlooked.
+ Good idea to design your code beforehand if it has many things going on at once. I had to rewrite the Generate Newsworthy HIT, and I felt that I could have been more efficient if I had planned out what needed to be done beforehand. All in all, I did learn much more about CSS/HTML/HIT design in general, so the 1st draft created a template for me to write even better code
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These are the tasks Chris sat down and figured out with me today



Perl
+ $ (single, like an integer) metacharacter
+ @ (multiple, like an array) metacharacter
+ substr([string], [starting character], [length of substring])
+ open FILE "<$path" (reading a file)
+ <FILE> (read one line at a time of file)
  - (tutorial at http://www.perlfect.com/articles/perlfile.shtml)

Yay I've learned some perl and fixed the generate_newsworthy.pl code to be able to generate the HIT with the first sentence as well. So if the Wikipedia article is "Twitter", the variable lead_sentence# will have "Twitter is a website, owned and operated by Twitter Inc., which offers a social networking and microblogging service, enabling its users to send and read messages called tweets." This is the HIT currently:

(HIT template)
  

(example HIT)


Todo List
+ make worksheets since Theresa asked me to teach Molly and Amy about Mechanical Turk

(need to add more............)

Tuesday, May 24, 2011

Back on Campus

First day on the job! I'm participating in research for natural language processing this summer and I decided to keep a blog of my thoughts and what I've learned throughout. It'll be really nice to have something to look back to in the future, plus let my family know that I'm doing alright! I also have my 'Summer List' of things I want to do all ready, such as brainstorming beforehand for my game design course this fall, practicing my Chinese more, or finally learning Guitar. I'll write about them, too! :D Anyways, here's how my first day went.

My supervisors Chris Callison-Burch and Ben Van Durme assigned me a project that Ph.D. student Byung Gyu Ahn had begun working on called 'WikiTopics' (who'll be guiding me as well while doing research at Microsoft this summer). Our main project goal is to basically create a completely automated news aggregator of current news articles, clustered into groups and categories, based off of a data repository of the number of pageviews of all the Wikipedia Articles (its been monitored hourly over the past 3 years and is still going!) along with other sources including Twitter and other social networking websites. For instance, let's say that the top 5 articles of the day are 'Rebecca Black', 'Japan Earthquake', 'Osama Bin Laden', 'Waterboarding', and 'Marilyn Monroe'. We want to be able to determine are the reasons why the articles received such large number of hits, if they actually pertain to any presently occurring events or are reasonably random, and what category each article belongs in - these are the main focuses, except that we'll be doing it with the top 1000 articles as opposed to the top 5. (Rebecca Black, Japan Earthquake, Osama Bin Laden, and Waterboarding should all be part of our collection, while Marilyn Monroe should not.) 

So after figuring out some logistics such as setting up computer accounts at the Human Language Technology Center of Excellence (COE) and Center for Language and Speech Processing (CLSP), filing for a building access card, and figuring out the summer schedule, we dove right into it beginning with working on the 'Generating Newsworthy' articles portion of the project. As mentioned above, one significant concentration of the project is to determine which news articles are 'current' or 'newsworthy' and which are not. I spent some time just getting comfortable with what Byung Gyu worked on along with syncing it to my account. This included the generate_newsworthy.sh shell script.

Success! We got it working on my account, and I began moving onto improving a Human Intelligence Task (HIT) for Amazon Mechanical Turk (MTurk) to ask Turkers to help us initially decide whether the articles are newsworthy or not. In layman's terms, I'm coding a task in JavaScript/CSS/HTML and uploading it onto a website where many, many people can complete and get paid (by us!). It's a pretty neat site, and Chris also wrote a paper analyzing it (reading it was one of the first things he asked me to do back in November 2010).

I also got a crash course on simple linux and emacs commands (dual weild with vi!). Honestly, I felt pretty dumb initially doing things the long way. I put a list of the new commands I learned today so you don't have to feel that way. :P Lastly, I'm starting to get comfortable with GitHub, and can hopefully import parts of this blog related with WikiTopics over there with more coding bits. Overall, it was a great first day! I called my Mom shortly after telling her how happy I was and wishing my sister best of luck on finals, whilst walking in the wrong direction for a good 10 minutes lol. Cheers to an awesome summer!

Linux Commands
ssh [address] (secure shell .. on the to-do list to understand more)
du -hs [path] | sort -nr > [file] (calculating disk usage and sorting them based on smallest to largest size)
nohup (process can run even after logging out! man I needed this for some of my classes)
screen  (can detach/attach terminal sessions and also share them with others)
tail -f [file] (can view file as being created!)

(following are for dummies)
control + r (search)
control + k (delete rest of line)
control + a (go to beginning)
control + e (go to end)
control + b (backwards one space; a.k.a. the left arrow key)
  
Emacs commands (hm no need to type 'i' before typing)
control + x - control + s (save)
control + x - control + c (exit)
control + g (ignore partially typed command)
control + xu (undo)
control + w (paste)
control + space - move cursor - control + w (cut)

GitHub commands
git add [file]
git commit -m 'first commit'
git remote add origin git@github.com:[username]/[repository]
git push -u origin master
(I'm still not completely sure what these commands do, so gotta do that!)

Todo List
+ learn about ssh/secure shell
+ complete 'Generate Newsworthy HIT'
   + get a more solid understanding of JavaScript/CSS/HTML
+ correct escaped string in newsworthy.csv
+ getting acquainted with GitHub
+ start learing Perl

How I broke down problems
I definitely ran into problems here and there, such as Permission Denied and No file or directory found messages, CLSP account related bits, or figuring out new concepts such as environment variables. Definitely my first thought whenever I ran into an obstacle was that perhaps I've already been given the answer to getting past it. For instance, when I received the message WIKITOPICS environment variable not set, it meant complete baloney to me at first, but after a few quick searches on Wikipedia as well as going through a few emails, there was an aha-moment, and I realized that I just needed to type 'WIKITOPICS=/home/...' to set the variable. I also went to Chris to ask questions not only to help me resolve the problems, but feel comfortable that it's okay to get stuck. Just remember, there's most likely a very logical path to the problem and don't give up! - looks at people IT support get mad at -