Friday, June 24, 2011

Living in a big world

Our apartment walls still smell of paint since the painter left 10 minutes ago. I'm lying on a pile of blankets and pillows in a pensive mood. Of what? Probably a combination of games and traveling, a mix of life. Here's a video of a penguin I met the beginning of freshman year at the National Aquarium in Inner Harbor while I reflect.


(yes he's swimming into the glass xD)

Generate Newsworthy HIT 

- (check) instead of random articles on HIT, make HIT display articles from clusters (email BA - figured it out :D)

One thing that Chris suggested to improve the accuracy of the Turkers in rating the newsworthiness of articles was to display articles in related clusters according to a script that Byung Gyu wrote- k-means clustering. In other words, instead of giving Turkers a random assortment of 10 articles, we giv e them 10 articles some of which have relation with another. If Turkers detect the connection, they will more likely label 'Yes, this is newsworthy' for all related articles, rather than omitting some as miscellaneous and randomly popular (such as the connection between Osama Bin Laden and waterboarding). By reading in the file containing the pre-generated clusters, it was simple to create HITs according to the clumps.

Sample cluster file (2011-06-02)

I've started to add more comments to the Wikitopics code

The overall idea was that we first read in the cluster articles into one array and then the regular topics into another array if and only if it was not already in the cluster array (in Perl, instead of iterating over the cluster array multiple times to detect if an element existed, it was more efficient to create a hash out of the array, which make the get() operations virtually instant: O(1)) We would then shuffle the articles not in the clusters (or else they would be in alphabetical problem, which I guess isn't a problem but I wanted to maintain Byung Gyu's original code). With the articles in some arbitrary order, we then added the cluster articles to the array, meaning that the clusters are maintained in their groups (the shuffle call in the print_articles method was also commented out).

Yay clusters. :) Their trending scores are 2 for identification purposes.

 Of course this means that some of the clusters will be broken up. Would it be more efficient to place one cluster in each HIT? This feature can be accomplished in a reasonable amount of time. Perhaps we'll work on these details more once more important parts of the HIT is completed (HIT instructions, design, and any other areas as suggested by Turkers).



2. (check) Run the fetch_sentences.py script within the generate_newsworthy.sh shell script to output individual files containing corresponding Wikipedia articles
3. (check) Edit the parallelize_serif_part.sh shell script that Byung Gyu uses for the top 1000 articles to work for the negative controls

And the recent date sentences for the negative controls are finished! The previous error that I encountered where Serif could not find some files occurred because the file names were in a different format, specifically, URL format where the file names contained a bunch of %__ escape characters. With a quick fix, Serif began generating the .xml files (well, .xml.xml files for some reason).


 

Uh oh, Serif crashed on me! PuTTy actually freezes during the Serif .xml file generation, though conveniently after all the .xml files have been generated. Chris recommended that I ask Ed Loper, the developer of Serif, about this issue - I'll give it a shot! Who knew he was in our lab!

The pick_recent_xml2.py script working

Here is the .csv output:


(before clustering) ==> (after clustering + debugging)

As you can see, there are still some missing cells. This happens when Wikipedia articles do not contain dates at all.

As seen missing in cell BI-5

We still have yet to determine whether to let these cells go empty or feed in pre-written sentences with dates since the negative controls will be relatively easy to spot of the right eye is looking for them (then again, if you're spending that much time analyzing the HIT, you're probably doing them right ... well, up until then at least)

I've also added randomization to the articles such that the 12 articles received won't be in the order: articles needing labeling, positive control, negative control.


This is what the most recent draft of the HIT looks like (how many have we been through now?):


Sentence Summary HIT

I've got the .csv file generation script written and part of the highlighting functionality working for the HIT template now. Here the .csv file generation script (call generate_summary.py) (All of this is pretty much brute force ... I'm hoping I'll learn more Python to make this code more efficient)

Code here

Phew large file

The .csv file actually looks like this because of the ~32k character limit for each Excel cell. I wonder if this will affect the HITs, since regardless of the appearance, the cells are correctly formatted to comma separated values (a.k.a. csv). Anyways, Amazon Mechanical Turk isn't accepting the input for some currently unknown reason. Y!!

Quick peek into the highlighting functionality.

Problems
+ I've come to realize that a lot of the problems I run into are moreso decision-based as opposed to technical issues. For instance, should I adhere to standard code or write potentially more efficient code? Should I spend time thoroughly reading code as opposed to skimming it over and dealing with the problems as I encounter them (this is what happened with the negative controls during Serif not finding some of the files). It's an ongoing cycle of deciding between one or the other, and sometimes you'll try both choices before deciding on one, or even combine both approaches together.

Python
+ str.startswith({SUBSTRING}) (method to check if string 'str' starts with {SUBSTRING})
+ [{ARRAY1}, {ARRAY2}] == zip(*zip({ARRAY1}, {ARRAY2})) ('unzip' of builtin zip; more explanation)
   - zip will return a list of length equal to the shorter of the two arrays (truncated)
+ import random                  (shuffles the array in arbitrary order)
   random.shuffle({ARRAY})
+ urllib
   - urllib.quote({STRING}) (returns the proper url format of {STRING} by using the %xx escape (such as the common %20 = whitepsace))

Perl
+ s/#.*//;           (turns all lines starting with '#' to a blank line ('#.*' is a regular expression))
   next if /^(\s)*$/; (skips the line if it is empty; more info)
+ Given @array1, @array2 and @array3 = (@array1, @array2), array3 is the merging of array1 and array2! (Simple)
+ split(/{DELIMITER}/, {STRING}) (splits {STRING} by {DELIMITER} and returns an array of the splitted string)
+ @array, $array[0] (stands for an array, stands for the element)
+ $hash{{KEY}} = {VALUE} (creating a hash)
   - exists($hash{{KEY}}) (tells if {KEY} exists in hash)
   - my %hash = map { $_ => {VALUE}} @array (creates a hash out of an array)

No comments:

Post a Comment