Friday, June 24, 2011

Living in a small world

Can you believe it? It just hit me that we're a third of the way through summer. I thought I would have already learned my lesson about leaving my 'Summer Todo List' tucked away in the cabinet. *smacks head* But as I look back, I like to think that it was okay. It was okay lie around and stay up late watching Anime and have simple time to yourself. For the longest time, I felt like classes and work was carrying into a direction I didn't initially choose. But I feel like I have that choice now. And you know what, there's something else I have more of now, and that's confidence, to take bigger steps and reach for waiting dreams and goals. Gotta make the most of life. :)

I wanted to write this post to remember some of the things that have happened this summer:

PlayStation 3
I got a PS3. :) Wayne and I went to Towson mall and picked it up along with two games.

Games we bought ^_^ (we've had Heavy Rain since winter and have been waiting forever to play) (Final Fantasy XIII, Heavy Rain, Atelier Rorona)

The visual art book that came with 'Atelier Rorona'

'Atelier Rorona' is an Anime-based game, and as perfectly mentioned by this review, it's a 5/5 if you match this style of gaming (and you can match many). You play as a clumsy but responsible character named Rorona and you've just been given ownership of the Alchemy Workshop by your lazy and snarky alchemy master Astrid. What's more, the king is trying to shut down that very workshop, and he's given you 12 assignments, completed through synthesis, battling, and item gathering to prove yourself. You also complete quests for the locals to raise your friendship with them and gain the trust of the town (since Astrid's personality rubbed the wrong way :P). Very straightforward and simple game; no complicated story plots. The personalities of the characters are also very developed and dynamic! I cringe at some character designs, whos' remarks are cliché and overly dramatic. This happened a few times will playing Final Fantasy XIII, a beautiful game nonetheless. Anyhow, if you want to take a break from the action squeezers, 'Atelier Rorona' is a great fit. It tends to be under the 'girly' category, but who cares. It's great comic relief, simple, and cute. :)

Wayne and I have only played 'Heavy Rain' for about an hour lol; I really want to spend some more time on it *drags him to the PS3*.

Steiff Silver Building
Okay this isn't really an event, I just took some pictures of the workplace. :D These pictures are from two Fridays ago (June 10th).

If you notice the USB cable, I come back to get it after forgetting it upon biking away like a dummy. Also I'm not sure if that computer in the background works ... hmmm *investigates*

Pretty photograph they had hanging

APERTURE ... okay I stop

Coming back home June 20th - June 22nd
YUMMMY tart! And my Dad made us zongzi. :)




Haha not to be crazy cat lady (seriously), but I love my kitties so much. I think about them frequently and how they're probably doing dumb things like getting stuck in a box or meowing with toys in their mouths. It's pretty ridiculous how smart they can be, too. The minute the cages come out, the white one, Kitty knows that it's time to get into the car, which usually means a trip to the vet; she dashes under the bed. As for the grey one, Kitten, he doesn't notice a thing at first, but soon follows suit and joins Kitty under the bed. Too silly. :P

Japan
My sister earned a scholarship to Japan!! She'll be joining YFU (Youth For Understanding) for 6 weeks and live with a host family off the coast of Japan. She brought my Canon Rebel on the trip to capture her adventure there. :) Here we are before her departure:
Yeah we were being weird at the airport as always lol

Cooking + Camera
My parents bought me a replacement Nikon camera for the time being while my sister's in Japan with the DSLR. ^^ I feel very lucky! I've been using it mostly to take picture of food I cooked. xD By the way, I cooked steak twice the past two weeks (and I still have one more in the freezer)! The first time it was medium well-done, and then next time was actually worse cause I kept flipping it lol! That prevented the insides from getting cooked and let the outside become a bit dry. Why am I talking about steak.

Test picture - current state of the apartment lol, everything's jammed into the corner since we're getting the walls painted

*cough**cough*dry*cough* dumplings :D

Rice pudding!

Living in a big world

Our apartment walls still smell of paint since the painter left 10 minutes ago. I'm lying on a pile of blankets and pillows in a pensive mood. Of what? Probably a combination of games and traveling, a mix of life. Here's a video of a penguin I met the beginning of freshman year at the National Aquarium in Inner Harbor while I reflect.


(yes he's swimming into the glass xD)

Generate Newsworthy HIT 

- (check) instead of random articles on HIT, make HIT display articles from clusters (email BA - figured it out :D)

One thing that Chris suggested to improve the accuracy of the Turkers in rating the newsworthiness of articles was to display articles in related clusters according to a script that Byung Gyu wrote- k-means clustering. In other words, instead of giving Turkers a random assortment of 10 articles, we giv e them 10 articles some of which have relation with another. If Turkers detect the connection, they will more likely label 'Yes, this is newsworthy' for all related articles, rather than omitting some as miscellaneous and randomly popular (such as the connection between Osama Bin Laden and waterboarding). By reading in the file containing the pre-generated clusters, it was simple to create HITs according to the clumps.

Sample cluster file (2011-06-02)

I've started to add more comments to the Wikitopics code

The overall idea was that we first read in the cluster articles into one array and then the regular topics into another array if and only if it was not already in the cluster array (in Perl, instead of iterating over the cluster array multiple times to detect if an element existed, it was more efficient to create a hash out of the array, which make the get() operations virtually instant: O(1)) We would then shuffle the articles not in the clusters (or else they would be in alphabetical problem, which I guess isn't a problem but I wanted to maintain Byung Gyu's original code). With the articles in some arbitrary order, we then added the cluster articles to the array, meaning that the clusters are maintained in their groups (the shuffle call in the print_articles method was also commented out).

Yay clusters. :) Their trending scores are 2 for identification purposes.

 Of course this means that some of the clusters will be broken up. Would it be more efficient to place one cluster in each HIT? This feature can be accomplished in a reasonable amount of time. Perhaps we'll work on these details more once more important parts of the HIT is completed (HIT instructions, design, and any other areas as suggested by Turkers).



2. (check) Run the fetch_sentences.py script within the generate_newsworthy.sh shell script to output individual files containing corresponding Wikipedia articles
3. (check) Edit the parallelize_serif_part.sh shell script that Byung Gyu uses for the top 1000 articles to work for the negative controls

And the recent date sentences for the negative controls are finished! The previous error that I encountered where Serif could not find some files occurred because the file names were in a different format, specifically, URL format where the file names contained a bunch of %__ escape characters. With a quick fix, Serif began generating the .xml files (well, .xml.xml files for some reason).


 

Uh oh, Serif crashed on me! PuTTy actually freezes during the Serif .xml file generation, though conveniently after all the .xml files have been generated. Chris recommended that I ask Ed Loper, the developer of Serif, about this issue - I'll give it a shot! Who knew he was in our lab!

The pick_recent_xml2.py script working

Here is the .csv output:


(before clustering) ==> (after clustering + debugging)

As you can see, there are still some missing cells. This happens when Wikipedia articles do not contain dates at all.

As seen missing in cell BI-5

We still have yet to determine whether to let these cells go empty or feed in pre-written sentences with dates since the negative controls will be relatively easy to spot of the right eye is looking for them (then again, if you're spending that much time analyzing the HIT, you're probably doing them right ... well, up until then at least)

I've also added randomization to the articles such that the 12 articles received won't be in the order: articles needing labeling, positive control, negative control.


This is what the most recent draft of the HIT looks like (how many have we been through now?):


Sentence Summary HIT

I've got the .csv file generation script written and part of the highlighting functionality working for the HIT template now. Here the .csv file generation script (call generate_summary.py) (All of this is pretty much brute force ... I'm hoping I'll learn more Python to make this code more efficient)

Code here

Phew large file

The .csv file actually looks like this because of the ~32k character limit for each Excel cell. I wonder if this will affect the HITs, since regardless of the appearance, the cells are correctly formatted to comma separated values (a.k.a. csv). Anyways, Amazon Mechanical Turk isn't accepting the input for some currently unknown reason. Y!!

Quick peek into the highlighting functionality.

Problems
+ I've come to realize that a lot of the problems I run into are moreso decision-based as opposed to technical issues. For instance, should I adhere to standard code or write potentially more efficient code? Should I spend time thoroughly reading code as opposed to skimming it over and dealing with the problems as I encounter them (this is what happened with the negative controls during Serif not finding some of the files). It's an ongoing cycle of deciding between one or the other, and sometimes you'll try both choices before deciding on one, or even combine both approaches together.

Python
+ str.startswith({SUBSTRING}) (method to check if string 'str' starts with {SUBSTRING})
+ [{ARRAY1}, {ARRAY2}] == zip(*zip({ARRAY1}, {ARRAY2})) ('unzip' of builtin zip; more explanation)
   - zip will return a list of length equal to the shorter of the two arrays (truncated)
+ import random                  (shuffles the array in arbitrary order)
   random.shuffle({ARRAY})
+ urllib
   - urllib.quote({STRING}) (returns the proper url format of {STRING} by using the %xx escape (such as the common %20 = whitepsace))

Perl
+ s/#.*//;           (turns all lines starting with '#' to a blank line ('#.*' is a regular expression))
   next if /^(\s)*$/; (skips the line if it is empty; more info)
+ Given @array1, @array2 and @array3 = (@array1, @array2), array3 is the merging of array1 and array2! (Simple)
+ split(/{DELIMITER}/, {STRING}) (splits {STRING} by {DELIMITER} and returns an array of the splitted string)
+ @array, $array[0] (stands for an array, stands for the element)
+ $hash{{KEY}} = {VALUE} (creating a hash)
   - exists($hash{{KEY}}) (tells if {KEY} exists in hash)
   - my %hash = map { $_ => {VALUE}} @array (creates a hash out of an array)

Monday, June 20, 2011

Wikitopics

I remember before Chris left for Oregon last week, he and I sat down not only to create a list of possible things for me to work on, but also to reiterate the grand purpose and importance of why we created a 'Generate Newsworthy HIT' or why he and I spent hours one day to grab Python modules, beyond simply accepting that 'it's what we needed to do.' It was eye-opening to hear him tell me what our overall purpose was; and I'd like to write it here again as a prevailing reminder.

"Wikitopics is a project Ph.D. student Byung Gyu began under supervisor Chris Callison-Burch. Equipped with 3 years (and counting) data on the top 1000 most popular Wikipedia articles by pageviews, Byung Gyu and Chris sought to create a completely automated website that displayed current events all over the world from pop-culture to natural disasters and everything in between. The finished website would also possess functions to aggregate articles from other news websites beyond the sole Wikipedia articles, cluster related articles into categories beyond just the standard 'sports, politics, etc.', and provide summaries detailing why such topics are current. The overarching goal is that Wikitopics will become a prime resource to collect data, branching into other research projects in fields such as sentiment analysis and translation. The current website is located here."


I was also extremely happy to discover that Chris and I were on the same page regarding certain aspects of the project, specifically that the big dream is to have Wikitopics fully automated, while for now we resort to Turkers from Amazon Mechanical Turk for help, and that the ideal perspective to have in approaching Wikitopics is one of paying close attention to detail while maintaining the big picture. We also spoke about how to proceed with certain tasks such as the remaining HITs, and I found myself providing feedback and opinions on certain decisions with Chris eagerly listening. Overall, it was a great experience to sit down with Chris and hit these points. 

This post will cover activities related to Wikitopics last week.

Generate Newsworthy HIT

- generate the recent date sentences for the (check) positive and negative controls
     - (check) changed get_positives_article.py script to print the date of the article as well
   - (check) create bullet points for every recent date sentence extracted
   - (check) add google news link at the bottom (so that there is no ambiguity) instead of linking the entire recent date sentences
   - (check) go back to just one column of drop-downs
   - instead of random articles on HIT, make HIT display articles from clusters (email BA)
 
The positive controls are completely finished! Here is the HIT with the sentences containing the most recent dates pulled for the positive controls.


I managed to do this by (1) adding code to get_positive_articles.py to store a list of the positive control articles, then (2) changing the pick_recent_xml2.py script to include this list when generating the individual files containing the recent date sentences, and (3) finally changing generate_newsworthy.pl to read and print out the articles into the .csv file. I also soon realized that I had to backtrack to store a list of not only the titles of the articles but also the dates in which the articles were considered newsworthy according to the existing cron jobs occurring (the behind-the-scenes activity that Byung Gyu had already implemented). This is essential in determining which folder/directory the text contents of the articles is located.

pink: (first addition) stores the list of titles of the positive control articles
yellow: (second addition) stores the date-directory alongside the title if the date is different from the running date (which is manually fed (by you!) into the generate_newsworthy.sh script)

purple: sets the path of the list of positive control articles and confirms its existence
blue: reads in the articles and appends the title and corresponding date to the stored list (isn't is amazing how python lists can store multiple types of items? O_O)
gold: sets the variables for positive controls (you can see how it requires a specific date)
pink: sets the variables for the regular articles
red: a little check that allows me to skip the 40 minutes it takes to generate all the files if they've already been generated

blue: prints out the recent date sentences during the .csv file generation
pink: html-code for a bullet point (talked about next)

The next step after getting the recent date sentences was to format them for readability. The proposed idea was to bullet point each extracted sentence rather than smush the sentences together into a nonsensical paragraph. Here's how that worked:

As shown in the green, the JavaScript code separates the sentences by the bullet-point and then reintroduces it along with a break tag to make each sentence start on a new line.

Here is the finished result:

Also notice how the 'Google News Link' has been separated from the recent date sentences

Then merging the two columns ...

Simplicity = :)

The remaining task for the 'Generate Newsowrthy HIT  is the negative controls. Byung Gyu wrote an extremely helpful email detailing how to run SERIF to pull out the recent date sentences for the negative controls. With the received information, I wrote down the following steps:

1. (check) Change the get_negative_articles.py to output a list of the negative control articles just like for the positive controls
2. (debugging) Run the fetch_sentences.py script within the generate_newsworthy.sh shell script to output individual files containing corresponding Wikipedia articles
3. (waiting on step 2) Edit the parallelize_serif_part.sh shell script that Byung Gyu uses for the top 1000 articles to work for the negative controls

1.
Straightforward code to generate the list of negative articles

Appearance of the output file

2.
Generated files containing Wikipedia articles - with some mysteriously missing

3.
Note the red comments - there might be more changes to be made

You can see the produced error from the unfinished recent date sentences extraction for the negative controls:


My goal is to have the negative controls for the 'Generate Newsworthy HIT' finished by tomorrow.

[Important] Sentence HIT

   - sentences HIT (Wikipedia interface, highlight/click sentences to tell why current)

With that said, I began working on a new HIT. This one is called the '[Important] Sentence HIT' (we'll call it 'Sentence HIT' from now on), which is fed input as determined by the 'Generate Newsworthy HIT'.

Here is a dusty draft of the HIT:

If you can guess which YouTube video that is from ... (it's a song)
green: yep, we're going to make video instructions :D
red: this is where the text instructions will go

As indicated in the 'Sentence HIT', we want Turkers to choose one sentence from the Wikipedia article, which best illustrates its importance on a specific date (the ${date} variable will be fed into this HIT template similar to the 'Generate Newsworthy HIT'). Turkers will be able to click on one sentence from the entire article, which will thus be highlighted to show that it had been clicked on. There is an existing 'Highlight Dialectal Arabic' HIT that I have been studying to reproduce the highlight function. Here is a screenshot of that HIT:

(My mouse is over {srcSeg1}$) As you can see, the variable becomes green when I mouse-over it. If I click it, it will turn yellow.

MTurk Dropbox

+ (check - and ongoing) manage DropBox for all the previous HIT batches (because Amazon is deleting them after a certain amount of days!!)

As noted, Amazon will be deleting the results collected from Amazon Mechanical Turk after a certain amount of days. I remember when Chris gave me this assignment, he looked at me to say that a few of them are worth a few thousand dollars ... and my eyes went O_O

HITs that will disappear :(

But no fear! That is where Dropbox comes in.

Yay we're saved!

Took a few hours to save and double check all these files but you could say it was worth it. :)

HTML
+ HTML codes (&#149 stands for the bullet point)
+ <span> (no physical change unless indicated by class/id; good tutorial)
+ <hr /> (introduces a horizontal line break as shown below)

+ empty elements (for XHTML; ends in ' />' (space, slash, angle bracket) for elements that cannot contain content such as <br /> or <hr />. You cannot do <a href="{URL}" />, you have to do <a href="{URL}"></a>; good tutorial)

jQuery
+ .select() (bind event handler to when user selects something; jQuery doc)
+ .click() (binds to when user clicks something)
+ document.ready() (need to read)

Problems Faced
Chris encouraged me to keep track of any problems I faced so here are more obstacles I encountered the past week:
+ The first noticeable struggle was that upon picking up the recent date sentences for the positive controls to work on, I could not figure out where to start as I had no idea where I last ended. I remember the first thing I did was simply look for evidence of the recent date sentences for the positive controls - any files that I was sure I last worked on - and try to create a map of how all the scripts depended on each other. I think my main problem was that there were just so many processes going on that I couldn't figure out heads or tails of it, literally. Eventually, I did manage to stumble upon the correct file that I had previously overlooked. There are many solutions I came up with prevent situations like these from happening again:
   - (will do) create a visual tree of all the scripts, which will show the dependencies and overall start-to-finish route
   - (will do) simply write a note where you last ended for the day!
Additionally, decisions I made throughout the situation that really helped me find my way was to keep asking are there any new files I haven't look at yet? and just basic deductions based on scripts I knew couldn't be where I last left off.  This situation reminds me of a treasure hunt or something. :P
+ Similar to the problem of not knowing where I last left off, the next problem I encountered was not knowing how or where to start on coding the highlighting function for the Sentence HIT. While I had the 'Highlight Dialectal Arabic' HIT to look off of, it was quite overwhelming to even figure out how each function worked with each other since everything was new. Additionally, the way in which the 'Highlight Dialectal Arabic' HIT worked is not exactly how the 'Sentence HIT' will be (of course). It was a push-and-pull situation between sticking to studying the existing HIT or starting from scratch. I decided to do a combination of both: first outline how I would do the HIT, research relevant methods and look on forums/Stack Overflow, and then break down the JavaScript functions written in the 'Highlight Dialectal Arabic' HIT if I hit a dead end. We'll see how it works out tomorrow. ^^ This also leads into the final problem that prevailed throughout all the activities.
+ The third major problem I ran into was making the decision between fitting code to existing standards or making my own standards (that I think would work better) and making the existing standards fit to mine. The largest consideration is how will the people best be able to use my code in the future? I usually make a decision that combines both properties, but I have yet to figure out any patterns or final decisions on the matter.
+ I also wanted to include that it's a very good idea to be on the lookout for any new concepts that you should research. It's seemingly common sense, but you would not imagine how easy it is to just brute-force or work your way around foreign code/topics (just like vocabulary). I know that there are a lot of new things to learn at the start, but I assure you the extra effort is worth it.

I'm ending this post with the song I've been playing many times over the past week. =)

Saturday, June 11, 2011

The Drawing Board

I'm on the train back home to Pennsylvania now; it's my first time taking Amtrak. I'm looking through the window at the outside scenery as Wayne suggested that I do. It's completely different from a regular car ride, where you're surrounded by other cars, houses and cities. Even though there are hundreds of strange faces around you, it is completely silent. Instead, you feel like you're by yourself, leveled with the infinite trees and passing green. Surreal - that's what you call it.

I'm coming back home to take care of a few things: (We're riding through an enormous lake now, and there's nothing but trees surrounding both sides of us) groceries, furniture, living essentials, as well as creating an ideal schedule for the rest of the summer. It's about time I set things right; our apartment, even after promises from the landlord to clean up the place by last weekend, is still covered in the previous tenants' trash and spoils. The kitchen repairs are an unfinished mess; the locks are broken, and there are stains covering the walls, especially in the bathroom. We can't even bring in furniture or unpack because the repairs and cleaning has yet to be done. You can imagine that my mom was furious when I told her what sort of living conditions we were in; it seemed like she wanted to give the landlord a piece of her mind. She's a tough mom, and I've learned a lot about responsibility and taking things into your own hands. That's why we're not going to wait anymore for someone to clean up the place; we'll do it ourselves. ;)

Things I need to bring back
+ (check) vacuum cleaner
+ (will buy tomorrow, Friday at IKEAS) beds, mattresses, sheets
+ (will buy) desks, (check) chairs, (check) lamps
+ (check) dust wipes and (check) stain remover
+ (check) air freshener
+ (check) sponges and mops
+ posters, (check) lights, plants (you wouldn't believe how much of a difference plants add)
+ (check) laundry detergent, (check) towels
+ pencils, pens, paper
+ (check) cute things ^.^ (just because I like :D)
+ (check) movies and video games! (I find that just having these around add quality living .. somehow)

Additionally, now that the kitchen should soon be refurnished (and if not, there's a stove and oven in the lounge room), I'd like to start cooking. We've been living on cold cut sandwiches and ramen for way too long now (not that I'm complaining about the ramen)

Kitchen things to bring back
+ (check) saucepan
+ (check) pot
+ (check) rice cooker
+ (check) kettle
+ (check) spatula, (check) ladles, (check) forks/spoons/knives, (check) chopsticks
+ (check) regular plates and cups
+ (check) strainers
+ (check) cutting boards
+ (check) KNIVES muahahhaa (jk o_o)
+ (check) oven mittens (and apron? dunno)
+ (check) trays, (check) waxed sheets, (check) Pan
+ (check) measuring cups
+ (check) other assorted baking containers for cuter and sweeter things :)

Food to bring back
+ (check) flour, (check) (brown, white, and confectioner) sugar, (check) eggs
+ (check) vanilla/almond extract
+ (check) butter, cream, (check) milk
+ chocolate, cocoa powder, (check) chocolate chips
+ (check) baking powder, (check) custard powder
+ mint leaves and pineapple (In grade school, I remember I made mint pineapple juice for my classmates and turns it was pretty good! Time to test it again)
+ (check) rice
+ (check) soy sauce, (check) rice vinegar, (check) sesame oil, (check) oil
+ (check) salt, (check) pepper, cinnamon, spices
+ (check) dried noodles, tomato sauce, basil, cheeses
+ (check) raw beef, (check) chicken, pork
+ (check) spinach, bok choy, green beans, other asian vegetables I don't know the names of x)
+ onion, garlic, (check) mushroom, sweet potato
+ (check) blueberries, (check) strawberries, (check) bananas, (check) apples, (check) oranges
+ (check) bread (lots!), (check) cereal, (check) orange juice
+ chicken broth (can make by boiling chicken bones and meat in hot water :))


(closes laptop to get off the train)

Recent work on the Wikitopics project:

With the goal to improve the Generate Newsworthy HIT underway, our most recent addition is a column that displays the k-sentences containing the most recent dates from the corresponding Wikipedia page since we found that Wikipedia pages are regularly updated when a notable event directly related to the page occurs. The updates do not happen for all Wikipedia pages, but for the ones that do, Turkers are able to distinguish the newsworthy articles on the spot. Here is what the 'Generate Newsworthy HIT' looks like now:

The recent changes are in gold (Missing sentences for positive and negative controls)

The first step towards generating the k-sentences containing the most recent dates was to understand Byung Gyu's pick_recent_xml.py script. After installing a few python modules and rearranging data:


This script outputs the sentence containing the most recent date: what we want is to change the script to output the k-most recent.

Error (start)
I actually had a misunderstanding with the instructions, so what I had originally done was output the sentence with the most recent date in addition to the k-surrounding sentences (oops sorry Chris!) Here's a peek into that. >.>

Outputting the line number and the sentences

Here's how that script worked

Error (finished)

Okay! SO you can ignore everything in that section. :D Here's how I extracted the k-sentences containing the most recent dates.


blue: confirms the proper format to run the script
pink: sets the paths and prepares iteration over multiple files
gold: initializes variables containing sentences with dates
orange: writes the the k-sentences containing the most recent dates to the variable 'result'
purple: writes 'result' to a file called [ARTICLE].sentences and closes the file

Directory where the files are stored

Once pick_recent_xml2.py runs through the generate_newsworthy.sh shell script, we also modify generate_newsworthy.pl to read the corresponding file to write the sentences containing recent dates to the .csv file.

Works for just a few articles

Works for all articles
 
Side note: generate_newsworthy.sh takes 40 minutes to run o_o ... we'll have to see if we can fix that

Here's another run of the graded-wikitopics.py script now that we have a few more submissions from Turkers:

Turkers who failed the evaluation

Shows the Approve/Reject decisions

With the recent date sentences extracted for the sentences needing labeling, here is my progress so far, which includes the next steps. (Chris and a few other researchers will be leaving for Oregon for the next two weeks, so we created a huge list of possible things to do)

Todo-List
+ Complete Generate Newsworthy HIT
   - (check) combine the 'Google News' and 'Recent sentences' column
   - (check) change the width of the lower table to 1000px instead of 800px
   - (check) change the width of the last two columns to 300px each
   - generate the recent date sentences for the positive and negative controls
     - changed get_positives_article.py script to print the date of the article as well
   - create bullet points for every recent date sentence extracted
   - add google news link at the bottom (so that there is no ambiguity) instead of linking the entire recent date sentences
   - go back to just one column of drop-downs
   - instead of random articles on HIT, make HIT display articles from clusters (email BA)
   - sentences HIT (Wikipedia interface, highlight/click sentences to tell why current)
+ manage DropBox for all the previous HIT batches (because Amazon is deleting them after a certain amount of days!!)
+ JQuery Cookies
   - cookie to automatically fill out the 'Age, Location' items
   - cookie to collapse the instructions if done once
+ create a third parallel file with citations -> name, date, link citation (wpextractor parses xml/wikimarkup) (interesting sentences with references to articles)
    - wget to get the link citation, 'beautiful soup' pulls out the text from the html (python), and you would run nltk to sentence split() -> serif on the results (marks the date + coreference solutions, markup names of people and organizations and generates parse trees)

Miscellaneous
+ (check) Download DropBox and accept Chris' past MTurk files
  - For some reason I happened upon a case of the 'Malware Protection' virus right after this .. o_o All my windows closed, and tray popups were flying onto the screen like crazy. Of course, a window of the 'System Scan' popped up, so I tried to access the internet, and then realized that the virus (it's called a rogue I believe) would not let me open any other applications besides itself - this was also the case even after restarting. So after restarting in safe mode for the first time in a thousand years (really, I've never used it before so I gave it a shot), the virus surprisingly wasn't making any active attacks. I ran Malewarebytes' Anti-Malware and Spy++ ASAP and flushed it out of the system - everything's perfect now! Best free software ever.
    I looked on the internet to see if downloading DropBox caused this, but it doesn't seem like there's a connection - DropBox should be completely safe. It's weird ... I'll have to run system scans more often now - you should go run one now (and download Malewarebyte's Anti-Malware/Spy++ if you haven't already. I swear they look fishy, but they're on your side. :))

Python
+ variable = [{FORMAT} for {EVERY_ELEMENT} in {THIS_SET} {CONDITIONS}] (list comprehension, a quick way to write lists)
+ dict({LIST}) (creates a python dictionary from the list inside)
+ utils.convert_date({STRING}) (converts string that represents a date to a datetime object)
+ for i, a enumerate({LIST}) (i stands for the index and a stands for {LIST}[i])
+ for a, b zip({LIST_A}, {LIST_B}) (iterates over two lists in parallel)
+ for i, a, b enumerate(zip({LIST_A}, {LIST_B})) (i stands for the index, a stands for {LIST_A}[i], and b stands for {LIST_B}[i])
   + supposedly faster way: for i, a, b izip(count(), {LIST_A}, {LIST_B})
      
   from itertools import izip, count
   alist = ['a1', 'a2', 'a3']
   blist = ['b1', 'b2', 'b3']

   for i, a, b in izip(count(), alist, blist):
      print i, a, b
   ------------------------------------------
   >>> def foo():
   ...  for i, x, y in izip(count(), a, b):
   ...   pass
   ...
   >>> def bar():
   ...  for i, (x, y) in enumerate(zip(a, b)):
   ...   pass
   ...
   >>> delta(foo)
   0.0213768482208
   >>> delta(bar)
   0.180979013443 
   (source)  


+ lambda (creating anonymous functions (not bound to a name))
+ min({LIST}, key={ARBITRARY_FUNCTION}) (finding minimum element of dictionary, where key determines how the minimum is found)
+ {LIST}.remove({ELEMENT}) (removes an element from a list)

Perl
+ counter++; (just like Java, has '++' trick and needs a semicolon)

SERIF
+ marks the date + coreference solutions (meaning that SERIF can match pronouns to their actual subjects, superscript references ...)
+ markup names of people and organizations and generates parse trees 

History behind the 'Hello World' tradition (from Wikipedia)
The first known instance of the usage of the words "hello" and "world" together in computer literature occurred earlier, in Kernighan's 1972 Tutorial Introduction to the Language B,[1] with the following code:
main( ) {
  extrn a, b, c;
  putchar(a); putchar(b); putchar(c); putchar('!*n');
}
a 'hell';
b 'o, w';
c 'orld';