Saturday, June 4, 2011

COMPLETE AUTOMATION

My sister's pretty awesome. She's a smart girl, and she's not afraid to say what she thinks or believes. She always lights up the place with her weirdness and laughs, and I always have a blast doing something new or remember the good old times whenever I'm with her. We'll frequently recall old TV shows that we used to get a kick out of watching like Courage the Cowardly Dog or Flapjack and then naturally burst out into role-play. Sometimes it feels like we're part of the same person; we can finish each other's sentences and know what the other is thinking before they say it. Sometimes it's hard to be away from Julie, and it's always a treat to talk with her on the phone or visit home again. Mom's been telling me how she's doing so much better in school now, staying after class to ask her teachers questions before the upcoming finals, and studying until she's confident of the material and concepts. I'm extremely proud of her, and it's sort of weird to see her grow up. I'm rooting for her to be the artist and game designer she dreams of becoming. ^_^

 <.jpg picture of her artwork>

Today was a big success! We posted our Generate Newsworthy HIT (we're doing a pilot study at the moment so we're trying to get feedback on our HIT design and instructions. LET US COLLECT YOUR INFO! Come do our HITs :D) on Amazon Mechanical Turk just in time for the weekend. It took a lot of debugging, but I'm happy that Chris and I stuck it through to finally get this HIT up and running.

Moment it went up

Here's a quick run-down of the steps taken to get where we are now as well an assessment of the problems we had.

1. (check) Incorporate the positive and negative control outputs into generating the .csv file (an excel spreadsheet where each column is a variable, and we upload it onto MTurk to work with our HIT template)
2. (check) Fix any bugs of the whole spreadsheet including (I sent this to Chris last night)
      a. (check) The script prints outside of the columns <-- this should be a quick fix, the for loop is probably iterating more than needed

          The problem was that the loop was iterating twice, so it printed the controls twice, hence the two extra columns. After fixing that boundary mistake, the extra printing was fixed.

 
      b.
(check) The script only calls one article for the positive control and one article for the negative control for all 100 <-- I can fix this by initially generating more control articles and put them into a directory, and then use a for loop to iterate over each one
           We changed the format of the positive and negative controls to tab-separated lines where each line represents an article. The order of the properties of the article is as follows: title, trending score (I'm not too sure what this is, but we don't really use it o_O), the first sentence of its Wikipedia page, and the first section of its Wikipedia page. 
           This allowed us to the read the file into an array and utilize separate counters (i.e. $positive_counter = 0) for each control array that would reset to 0 once the counter is-equal-to or exceeds the size of the array (well, it can ever exceed the size of the array, but I guess we put it in for good measure. :))

Showing the one article and its tab separations (vi shows on multiple lines but everything in the outer pink is just one line)

      c. The csv files had bugs from the earlier, original script (I attached the old csv file we used), in that sometimes it doesn't print the lead_section variable, so the columns get messed up. I'm not sure how this error came to be but I'm guessing that it probably cam from within the .pl script
           What happened was that some of the files we were trying to read in simply did not exist! So the printing statement was skipped and thus, those lines of the .csv file were offset. We made it such that the loops would skip over any of the articles that did not have the necessary existing files. This is what that part of the code looks like:


3. (check) Check for any bugs in the actual HIT template (the JavaScript/CSS/HTML I wrote to make the HIT look like what it looks like)
           We're good to go! We added in some extra features to the HIT including a separate column asking Turkers to indicate when the article was newsworthy; so they have two tasks: tell us, is this article related to some newsworthy event? and two, when was this article newsworthy? We also made it so that the article could be clicked to its respective Wikipedia page. 

Updated 'Generate Newsworthy HIT'

A look into the options for the second column

 A look into the csv generation generation :o
(1 2)
(3 4)
 
     a. Spend 45-60mins researching geolocation for the Turkers
            Turns out it really wasn't needed, so we didn't spend time on this. Maybe I can on the weekend. :P

-Random picture to separate different content- 

Another big thing that happened today was that Omar F. Zaidan, one of Jason's and Chris' Ph.D. students, gave me a rundown of Amazon's MTurk API (so much to MTurk!). Basically what you can do with the API is completely automate the processes of uploading data, publish HITs, paying workers - anything you can think of that you can do with the web interface. In other words, instead of going to the website and clicking on the actual buttons, you can run a command-line program to do everything for you. Of course, you have to program the features you want as well as the decision processes using MTurk's API (which is in Java), and Omar showed me all the code he wrote (located here) that I could take to use for the 'Generate Newsworthy HIT.' Apparently, once everything's set to go, a 'Generate Newsworthy HIT' will be created and released daily along with the 1000 top Wikipedia articles of the day, and something-related-with-cron-job will do this job for us at a set time every day. Chris told me it'll cost $50 per day, but he said to let him take care of the financial side of things. :)

MTurk Notes
+ RequesterServiceRaw and ReqesterService (which extends RequesterServiceRaw) are the two classes you'll probably use the most
+ Anything on MTurk that you can imagine to be a class (i.e.. a HIT, a Turker) is probably a class
+ Omar allowed us to have many, many options when running the Java code (such as 'don't approve or reject anything if set to false')
+ His code is relatively reusable and extensible, and I might not need to make any changes when using it since uploading the data and publishing the HIT (these sort of steps) are the same for every kind of HIT.
   - the layout follows: write custom script to generate file to upload + write the HIT template -> pass to Omar's Java code -> write a script to decode and analyze the data -> pass to Omar's Java code
+ There are a few things you can do with the MTurk API that you cannot do with the regular website interface, such as change what the HIT looks like once it has already been published
+ Other related links: 
    - "There's also a tool kit for interaction with Mechanical Turk, called TurKit, which you might consider researching as well:"  http://groups.csail.mit.edu/uid/turkit/ (from Chris)
    -  And here's the javdoc for the Java API: http://people.csail.mit.edu/glittle/MTurkJavaAPI/ (from Omar) 

Todo List
+ people are telling me to keep away from Perl; the only way to beat the enemy, is to learn about it!
+ get feedback from the Generate Newsworthy HIT and improve upon it
+ create a script to approve/reject the submissions (pretty neat, first automation thing on MTurk that I learn about! I worked on something similar with Chris last winter, so I'll show the previous script written)

Ah there's so much more to write about! I gotta wait until tomorrow. :)

1 comment:

  1. you beat it, then you'll want to keep away from it. ;-) -Xuchen

    ReplyDelete