Monday, June 20, 2011

Wikitopics

I remember before Chris left for Oregon last week, he and I sat down not only to create a list of possible things for me to work on, but also to reiterate the grand purpose and importance of why we created a 'Generate Newsworthy HIT' or why he and I spent hours one day to grab Python modules, beyond simply accepting that 'it's what we needed to do.' It was eye-opening to hear him tell me what our overall purpose was; and I'd like to write it here again as a prevailing reminder.

"Wikitopics is a project Ph.D. student Byung Gyu began under supervisor Chris Callison-Burch. Equipped with 3 years (and counting) data on the top 1000 most popular Wikipedia articles by pageviews, Byung Gyu and Chris sought to create a completely automated website that displayed current events all over the world from pop-culture to natural disasters and everything in between. The finished website would also possess functions to aggregate articles from other news websites beyond the sole Wikipedia articles, cluster related articles into categories beyond just the standard 'sports, politics, etc.', and provide summaries detailing why such topics are current. The overarching goal is that Wikitopics will become a prime resource to collect data, branching into other research projects in fields such as sentiment analysis and translation. The current website is located here."


I was also extremely happy to discover that Chris and I were on the same page regarding certain aspects of the project, specifically that the big dream is to have Wikitopics fully automated, while for now we resort to Turkers from Amazon Mechanical Turk for help, and that the ideal perspective to have in approaching Wikitopics is one of paying close attention to detail while maintaining the big picture. We also spoke about how to proceed with certain tasks such as the remaining HITs, and I found myself providing feedback and opinions on certain decisions with Chris eagerly listening. Overall, it was a great experience to sit down with Chris and hit these points. 

This post will cover activities related to Wikitopics last week.

Generate Newsworthy HIT

- generate the recent date sentences for the (check) positive and negative controls
     - (check) changed get_positives_article.py script to print the date of the article as well
   - (check) create bullet points for every recent date sentence extracted
   - (check) add google news link at the bottom (so that there is no ambiguity) instead of linking the entire recent date sentences
   - (check) go back to just one column of drop-downs
   - instead of random articles on HIT, make HIT display articles from clusters (email BA)
 
The positive controls are completely finished! Here is the HIT with the sentences containing the most recent dates pulled for the positive controls.


I managed to do this by (1) adding code to get_positive_articles.py to store a list of the positive control articles, then (2) changing the pick_recent_xml2.py script to include this list when generating the individual files containing the recent date sentences, and (3) finally changing generate_newsworthy.pl to read and print out the articles into the .csv file. I also soon realized that I had to backtrack to store a list of not only the titles of the articles but also the dates in which the articles were considered newsworthy according to the existing cron jobs occurring (the behind-the-scenes activity that Byung Gyu had already implemented). This is essential in determining which folder/directory the text contents of the articles is located.

pink: (first addition) stores the list of titles of the positive control articles
yellow: (second addition) stores the date-directory alongside the title if the date is different from the running date (which is manually fed (by you!) into the generate_newsworthy.sh script)

purple: sets the path of the list of positive control articles and confirms its existence
blue: reads in the articles and appends the title and corresponding date to the stored list (isn't is amazing how python lists can store multiple types of items? O_O)
gold: sets the variables for positive controls (you can see how it requires a specific date)
pink: sets the variables for the regular articles
red: a little check that allows me to skip the 40 minutes it takes to generate all the files if they've already been generated

blue: prints out the recent date sentences during the .csv file generation
pink: html-code for a bullet point (talked about next)

The next step after getting the recent date sentences was to format them for readability. The proposed idea was to bullet point each extracted sentence rather than smush the sentences together into a nonsensical paragraph. Here's how that worked:

As shown in the green, the JavaScript code separates the sentences by the bullet-point and then reintroduces it along with a break tag to make each sentence start on a new line.

Here is the finished result:

Also notice how the 'Google News Link' has been separated from the recent date sentences

Then merging the two columns ...

Simplicity = :)

The remaining task for the 'Generate Newsowrthy HIT  is the negative controls. Byung Gyu wrote an extremely helpful email detailing how to run SERIF to pull out the recent date sentences for the negative controls. With the received information, I wrote down the following steps:

1. (check) Change the get_negative_articles.py to output a list of the negative control articles just like for the positive controls
2. (debugging) Run the fetch_sentences.py script within the generate_newsworthy.sh shell script to output individual files containing corresponding Wikipedia articles
3. (waiting on step 2) Edit the parallelize_serif_part.sh shell script that Byung Gyu uses for the top 1000 articles to work for the negative controls

1.
Straightforward code to generate the list of negative articles

Appearance of the output file

2.
Generated files containing Wikipedia articles - with some mysteriously missing

3.
Note the red comments - there might be more changes to be made

You can see the produced error from the unfinished recent date sentences extraction for the negative controls:


My goal is to have the negative controls for the 'Generate Newsworthy HIT' finished by tomorrow.

[Important] Sentence HIT

   - sentences HIT (Wikipedia interface, highlight/click sentences to tell why current)

With that said, I began working on a new HIT. This one is called the '[Important] Sentence HIT' (we'll call it 'Sentence HIT' from now on), which is fed input as determined by the 'Generate Newsworthy HIT'.

Here is a dusty draft of the HIT:

If you can guess which YouTube video that is from ... (it's a song)
green: yep, we're going to make video instructions :D
red: this is where the text instructions will go

As indicated in the 'Sentence HIT', we want Turkers to choose one sentence from the Wikipedia article, which best illustrates its importance on a specific date (the ${date} variable will be fed into this HIT template similar to the 'Generate Newsworthy HIT'). Turkers will be able to click on one sentence from the entire article, which will thus be highlighted to show that it had been clicked on. There is an existing 'Highlight Dialectal Arabic' HIT that I have been studying to reproduce the highlight function. Here is a screenshot of that HIT:

(My mouse is over {srcSeg1}$) As you can see, the variable becomes green when I mouse-over it. If I click it, it will turn yellow.

MTurk Dropbox

+ (check - and ongoing) manage DropBox for all the previous HIT batches (because Amazon is deleting them after a certain amount of days!!)

As noted, Amazon will be deleting the results collected from Amazon Mechanical Turk after a certain amount of days. I remember when Chris gave me this assignment, he looked at me to say that a few of them are worth a few thousand dollars ... and my eyes went O_O

HITs that will disappear :(

But no fear! That is where Dropbox comes in.

Yay we're saved!

Took a few hours to save and double check all these files but you could say it was worth it. :)

HTML
+ HTML codes (&#149 stands for the bullet point)
+ <span> (no physical change unless indicated by class/id; good tutorial)
+ <hr /> (introduces a horizontal line break as shown below)

+ empty elements (for XHTML; ends in ' />' (space, slash, angle bracket) for elements that cannot contain content such as <br /> or <hr />. You cannot do <a href="{URL}" />, you have to do <a href="{URL}"></a>; good tutorial)

jQuery
+ .select() (bind event handler to when user selects something; jQuery doc)
+ .click() (binds to when user clicks something)
+ document.ready() (need to read)

Problems Faced
Chris encouraged me to keep track of any problems I faced so here are more obstacles I encountered the past week:
+ The first noticeable struggle was that upon picking up the recent date sentences for the positive controls to work on, I could not figure out where to start as I had no idea where I last ended. I remember the first thing I did was simply look for evidence of the recent date sentences for the positive controls - any files that I was sure I last worked on - and try to create a map of how all the scripts depended on each other. I think my main problem was that there were just so many processes going on that I couldn't figure out heads or tails of it, literally. Eventually, I did manage to stumble upon the correct file that I had previously overlooked. There are many solutions I came up with prevent situations like these from happening again:
   - (will do) create a visual tree of all the scripts, which will show the dependencies and overall start-to-finish route
   - (will do) simply write a note where you last ended for the day!
Additionally, decisions I made throughout the situation that really helped me find my way was to keep asking are there any new files I haven't look at yet? and just basic deductions based on scripts I knew couldn't be where I last left off.  This situation reminds me of a treasure hunt or something. :P
+ Similar to the problem of not knowing where I last left off, the next problem I encountered was not knowing how or where to start on coding the highlighting function for the Sentence HIT. While I had the 'Highlight Dialectal Arabic' HIT to look off of, it was quite overwhelming to even figure out how each function worked with each other since everything was new. Additionally, the way in which the 'Highlight Dialectal Arabic' HIT worked is not exactly how the 'Sentence HIT' will be (of course). It was a push-and-pull situation between sticking to studying the existing HIT or starting from scratch. I decided to do a combination of both: first outline how I would do the HIT, research relevant methods and look on forums/Stack Overflow, and then break down the JavaScript functions written in the 'Highlight Dialectal Arabic' HIT if I hit a dead end. We'll see how it works out tomorrow. ^^ This also leads into the final problem that prevailed throughout all the activities.
+ The third major problem I ran into was making the decision between fitting code to existing standards or making my own standards (that I think would work better) and making the existing standards fit to mine. The largest consideration is how will the people best be able to use my code in the future? I usually make a decision that combines both properties, but I have yet to figure out any patterns or final decisions on the matter.
+ I also wanted to include that it's a very good idea to be on the lookout for any new concepts that you should research. It's seemingly common sense, but you would not imagine how easy it is to just brute-force or work your way around foreign code/topics (just like vocabulary). I know that there are a lot of new things to learn at the start, but I assure you the extra effort is worth it.

I'm ending this post with the song I've been playing many times over the past week. =)

No comments:

Post a Comment