Laboratory Journal

Monday, July 11, 2011

New Discoveries

Oof it's going to be difficult remembering everything that happened these past two weeks. That's what I get for not writing thoughts down. :P The first thing that comes to mind is Bitcoins!

Logo is still changing

I came across it a month or so ago on Reddit, and by chance I bumped into it again through another link; that was when I began to look into it. In simple terms, Bitcoin is a digital currency, and its most notable characteristic is the fact that there is no intervention from central authority unlike other forms of currency like American dollars today. Bitcoins are stored on a 'wallet.dat' file on your computer, and transferred electronically from one account to another through 'keys' that are represented by various numbers and letters. Additionally, Bitcoins are generated through a hashing algorithm by users such as me and you through out own computers. This is called mining. The hashing algorithm is created such that a block of 50BTC (Bitcoins) is generated every 10 minutes regardless of the number of miners; this prevents Bitcoin inflation or excess 'printing'. Currently, Bitcoins are still very young, with an estimate of less than 100,000 users total. However, they're already starting to make their way into people's everyday lives, including coffee shop sites and third-party transactions for sites like Newegg, Ebay and Amazon. Probably the largest objective for the Bitcoin community today is to get more people interested in Bitcoins and even for businesses to begin accepting Bitcoins for their products and services. The current Bitcoin to USD conversion is 1BTC for $14.4. Here are some further resources on Bitcoin:

.... and in the midst of the entire Bitcoin fascination, I decided to learn how to build a computer!!! That was the biggest accomplishment within the past couple days - it must have hours at a time watching videos and reading guides on forums and websites on the processes and components. I wanted to learn how to build a computer since I am majoring in C.S., and I'm very glad to have at least the basics and overall concept in mind. I'd like to write a guide within the near future to help anyone else interested in building computers (seriously, if you game or even if you don't game, knowing how to give a computer gives you a huge advantage in customizing what sort of features you want - this is something you'll have to do yourself/pay a comp-savvy person and believe me, it pays off). Here are some short notes that I took on the process:

ATX - Advanced Technology eXtended (motherboard)
AMD - Advanced Micro Devices
PCI Express 2.0 x16 - For GPUs
Asus vs MSI -> MSI
overclocking
SDRAM vs RAM

Yeah, it looks like a bunch of mumbo jumbo at first, but no fear, this is just plain English, not some foreign language. I'm getting excited to write the guide now. :D

The reason for going through all this trouble is because in order to mine Bitcoins, you need the proper hardware, especially GPUs (Graphics Processing Units) or video cards since they will be doing the actual hashing/mining. Yep, this is for mining Bitcoins! It might be fun to get into. :)

The Sapphire Radeon HD 5830 (an example GPU)

Monday was the 4th of July fireworks, and I got to see them with my family on Sunday and Wayne and friends on Monday. :) You should have seen the smiles the entire half hour the fireworks went off in Inner Harbor. Later that Monday evening, we discovered that there was a report of violence during the fireworks while we were still there!! I've been lucky to grow up in safe neighborhoods since and this is the first time the lightbulb has gone off that it's truly necessary to be aware and safe. Be safe!! Seriously :P

I hopped on the Amtrak train on Saturday July 9th to visit my family in Pennsylvania and my best friend Kathie in New Jersey for her birthday party!! We played Cranium, ate lots of pesto pizza and cake, and then played a few rounds of mah jong. I have no idea how the time went by so quickly. When I took a second to look at the clock, four hours had already passed! It was a simply get together and I really loved hanging out with Kathie and also go to meet her awesome college friends, Regina and Deep. My parents and I then drove back around 9pm, and chatted with Wayne. He gifted me my first game on Steam called Worms Reloaded, which has a similar structure to Gunbound (you and the opponent are on a 2D, hilly map and you can choose from a variety of weapons to damage the opponent), and I played that for two hours before falling asleep. I returned to Johns Hopkins and peeled crabs with Wayne for dinner, battled each other in Worms Reloaded and WON, and fell asleep.

Natural Language Processing
The CLSP summer school began two weeks ago and ended last Friday. Each day consisted of two lectures from 9am - 12pm regarding topics in computing such as computer vision and of course natural language processing. It was a great mix of introductory to advanced content presented by passionate and well known lecturers, including Ben Van Durme, one of my supervisors at the lab, and Jason Eisner, my Declarative Methods (and future Natural Language Processing) professor. When I look back at the experience, I wish I could have recorded the lectures because some of the material definitely flew past me and it would have been nice to rewatch and piece together the lectures better. I'm looking forward to the new eye contacts being developed, which may conveniently possess this recording ability. :)

The second half of the day was devoted to mini-projects such as jQuery cookies, capturing references on Wikipedia pages (more info on this later), and researching about Twitter Firehose.

Notes about Twitter Firehose, a service which provides a constant stream of tweets to developers (so yes, if you tweeted about your boyfriend or your cat, some researcher or developer probably has it).

Twitter Streams

Firehose (100%): Google paid $15million, Microsoft paid $10million (released to a select 15-20 developers)
Halfhose (50%): $360k/year (estimates)
Decahose (10%): $60k/year (estimates)
Gardenhose (10%): Free [No longer available as indicated -> http://captico.com/the-twitter-firehose-is-up-for-grabs-if-you-have-the-cash/2010/11
-> http://www.thegurureview.net/tag/garden-hose]
- Then Gardenhose users were transfered to Decahose on Gnip
Spritzer (2%): Free

More information found here: http://gnip.com/twitter

'Contact info@gnip.com for pricing and details'

But most of all, the second half of the days were devoted to the 'Sentence Summary HIT'. There were two main revisions:

1st revision

2nd revision

Just to give a quick overview of what the 'Sentence Summary HIT' is, we developed this HIT to ask actual people to help us isolate the sentences of a Wikipedia article that explain why that Wikipedia page received so many page views - was it a newly premiered video game? was it the death of a celebrity? Here is an example:

This is an example of what a submissions for the 'Sentence Summary HIT' should look like. Sentences can also be anywhere in the article and do not need to be in the same place (though they usually are).

The first revision was made to accommodate five articles per HIT, while the second revision was made for three (so that the Turkers don't get bored!). Additionally, the second revision had a few 'upgrades' and tweeks to the features. The smaller ones included a [clean] button to clean the highlighted sentences, extra detail to the instructions, adjustments to the number of sentences allowed to be highlighted (which will be removed), and separations of sentences with tab spaces instead of bullet points (standardization). The largest addition was a 'Recent Sentences' box, which showed the three sentences containing the most recent dates (taken from the 'Generate Newsworthy HIT'). This is because sentences containing dates usually describe an event on that date in a condensed format, which is just what we need for this HIT. Moreover, the recent sentences are click-able, such that clicking on the sentence will scroll the text box containing the article on the right to that exact sentence as well as bold it. These sentences were placed to give users a place to look for these 'summarizing sentences' without needing to skim through the entire article as them of them can be quite long.

To further explain this HIT, I will also be creating a video complete with instructions, examples, and plenty of information on how we would like the users to approach the HIT and even more importantly, why. A deep understanding of this HIT will prevent and misunderstandings and allow Turkers to efficiently and accurately complete the HIT.

Here is the current dialogue of the video:

Hello this is an instructional video for the Sentence Summary HIT. My name is Katherine, one of Chris' research students, and I will be walking you through this simple and easy HIT. The page I'm currently on displays an actual HIT from the batch of 100 HITs. My goal today is to provide you with plenty of information, along with examples, to be able to complete this HIT. I've also provided a table of contents of the video so you can skip to parts that best appeals to your interests.

The first thing we'll do is just read out the background information and short guidelines on the Sentence Summary HIT in this paragraph.

[Read Instructions]

In summary, the goal of this HIT is to distinguish the sentences that summarize a current event relating to a Wikipedia article for three Wikipedia articles. Simple enough, right?

Let's go over the interface of the HIT and then do some quick practice runs.

You'll notice if I scroll down to the first article, you see two text boxes. The first text box contains the title of the article, and the article itself. That's what this large box of text is right here. The second text box contains clickable sentences from the article that contain the 3 most recent dates. Cliking each sentence takes you to the section in this article with the corresponding sentence.

Once again, the purpose of this HIT is to distinguish the sentences that tell us why this topic was newsworthy or popular around this date, which happens to be ____. To do that, you click on the sentences, which will become highlighted a sea blue. You can also click on highlighted sentences or this [clean] button to unselect one or all of them respectively. Initially, the tricky part is deciding which sentences to highlight. But believe me, you'll get the hang of it once you know what you're looking for.

The first step you should take is to skim over the first paragraph or two. Usually, the summary sentence is staring at you right there, no need to skip all the way to the bottom. The second step is to look at this text box of recent sentences. When it comes to current events, summary sentences are inclined to contain dates though that is not always the case. The third step is just 'thinking on your feet' and making logical conclusions based on the characteristics of the topic. You'll see what I mean in a bit - let's take a look at the first article.

Note* remember to provide transcript for video

This is a look into the data we'll be uploading for this HIT:

The columns are offset because excel cells hold a maximum of 32k characters per cell. It looks wrong, but the data is fine and in the right format - yep, the data is visually misleading.

These are more notes that I took regarding the current goals and to-do list:

- (check) \t -> sentence separator
- (check) alternate design for HIT (one article per HIT)
- (change to infinite) let them click on any sentences they want (not just one)
- post the current events one (download the current date)
- ask byung gyu how the clustering algorithm works -> see if can have another clustering algorithm that can work on all 1000
- record groups of clusters in a new column called cluster number
- try to keep real trending scores (for wikitopics.org)
- (in progress) refining the instructions
- (check) point our recent sentences somehow in Sentences Summary HIT

Information
-----------
- nohup [command] &
- screen (look up if curious)

Over the past two weeks, I've definitely gotten more attached to the entire Wikitopics project. The possible website designs including extra features such as forums and commenting. There's just so much we can do with it!! The main focus right now is just to get the nitty-gritty down to collect good data for the website.

Friday, June 24, 2011

Living in a small world

Can you believe it? It just hit me that we're a third of the way through summer. I thought I would have already learned my lesson about leaving my 'Summer Todo List' tucked away in the cabinet. *smacks head* But as I look back, I like to think that it was okay. It was okay lie around and stay up late watching Anime and have simple time to yourself. For the longest time, I felt like classes and work was carrying into a direction I didn't initially choose. But I feel like I have that choice now. And you know what, there's something else I have more of now, and that's confidence, to take bigger steps and reach for waiting dreams and goals. Gotta make the most of life. :)

I wanted to write this post to remember some of the things that have happened this summer:

PlayStation 3
I got a PS3. :) Wayne and I went to Towson mall and picked it up along with two games.

Games we bought ^_^ (we've had Heavy Rain since winter and have been waiting forever to play) (Final Fantasy XIII, Heavy Rain, Atelier Rorona)

The visual art book that came with 'Atelier Rorona'

'Atelier Rorona' is an Anime-based game, and as perfectly mentioned by this review, it's a 5/5 if you match this style of gaming (and you can match many). You play as a clumsy but responsible character named Rorona and you've just been given ownership of the Alchemy Workshop by your lazy and snarky alchemy master Astrid. What's more, the king is trying to shut down that very workshop, and he's given you 12 assignments, completed through synthesis, battling, and item gathering to prove yourself. You also complete quests for the locals to raise your friendship with them and gain the trust of the town (since Astrid's personality rubbed the wrong way :P). Very straightforward and simple game; no complicated story plots. The personalities of the characters are also very developed and dynamic! I cringe at some character designs, whos' remarks are cliché and overly dramatic. This happened a few times will playing Final Fantasy XIII, a beautiful game nonetheless. Anyhow, if you want to take a break from the action squeezers, 'Atelier Rorona' is a great fit. It tends to be under the 'girly' category, but who cares. It's great comic relief, simple, and cute. :)

Wayne and I have only played 'Heavy Rain' for about an hour lol; I really want to spend some more time on it *drags him to the PS3*.

Steiff Silver Building

Okay this isn't really an event, I just took some pictures of the workplace. :D These pictures are from two Fridays ago (June 10th).

If you notice the USB cable, I come back to get it after forgetting it upon biking away like a dummy. Also I'm not sure if that computer in the background works ... hmmm *investigates*

Pretty photograph they had hanging

APERTURE ... okay I stop

Coming back home June 20th - June 22nd
YUMMMY tart! And my Dad made us zongzi. :)

Haha not to be crazy cat lady (seriously), but I love my kitties so much. I think about them frequently and how they're probably doing dumb things like getting stuck in a box or meowing with toys in their mouths. It's pretty ridiculous how smart they can be, too. The minute the cages come out, the white one, Kitty knows that it's time to get into the car, which usually means a trip to the vet; she dashes under the bed. As for the grey one, Kitten, he doesn't notice a thing at first, but soon follows suit and joins Kitty under the bed. Too silly. :P

Japan

My sister earned a scholarship to Japan!! She'll be joining YFU (Youth For Understanding) for 6 weeks and live with a host family off the coast of Japan. She brought my Canon Rebel on the trip to capture her adventure there. :) Here we are before her departure:

Yeah we were being weird at the airport as always lol

Cooking + Camera

My parents bought me a replacement Nikon camera for the time being while my sister's in Japan with the DSLR. ^^ I feel very lucky! I've been using it mostly to take picture of food I cooked. xD By the way, I cooked steak twice the past two weeks (and I still have one more in the freezer)! The first time it was medium well-done, and then next time was actually worse cause I kept flipping it lol! That prevented the insides from getting cooked and let the outside become a bit dry. Why am I talking about steak.

Test picture - current state of the apartment lol, everything's jammed into the corner since we're getting the walls painted

*cough**cough*dry*cough* dumplings :D

Rice pudding!

Living in a big world

Our apartment walls still smell of paint since the painter left 10 minutes ago. I'm lying on a pile of blankets and pillows in a pensive mood. Of what? Probably a combination of games and traveling, a mix of life. Here's a video of a penguin I met the beginning of freshman year at the National Aquarium in Inner Harbor while I reflect.

(yes he's swimming into the glass xD)

Generate Newsworthy HIT

- (check) instead of random articles on HIT, make HIT display articles from clusters (email BA - figured it out :D)

One thing that Chris suggested to improve the accuracy of the Turkers in rating the newsworthiness of articles was to display articles in related clusters according to a script that Byung Gyu wrote- k-means clustering. In other words, instead of giving Turkers a random assortment of 10 articles, we giv e them 10 articles some of which have relation with another. If Turkers detect the connection, they will more likely label 'Yes, this is newsworthy' for all related articles, rather than omitting some as miscellaneous and randomly popular (such as the connection between Osama Bin Laden and waterboarding). By reading in the file containing the pre-generated clusters, it was simple to create HITs according to the clumps.

Sample cluster file (2011-06-02)

I've started to add more comments to the Wikitopics code

The overall idea was that we first read in the cluster articles into one array and then the regular topics into another array if and only if it was not already in the cluster array (in Perl, instead of iterating over the cluster array multiple times to detect if an element existed, it was more efficient to create a hash out of the array, which make the get() operations virtually instant: O(1)) We would then shuffle the articles not in the clusters (or else they would be in alphabetical problem, which I guess isn't a problem but I wanted to maintain Byung Gyu's original code). With the articles in some arbitrary order, we then added the cluster articles to the array, meaning that the clusters are maintained in their groups (the shuffle call in the print_articles method was also commented out).

Yay clusters. :) Their trending scores are 2 for identification purposes.

Of course this means that some of the clusters will be broken up. Would it be more efficient to place one cluster in each HIT? This feature can be accomplished in a reasonable amount of time. Perhaps we'll work on these details more once more important parts of the HIT is completed (HIT instructions, design, and any other areas as suggested by Turkers).

2. (check) Run the fetch_sentences.py script within the generate_newsworthy.sh shell script to output individual files containing corresponding Wikipedia articles

3. (check) Edit the parallelize_serif_part.sh shell script that Byung Gyu uses for the top 1000 articles to work for the negative controls

And the recent date sentences for the negative controls are finished! The previous error that I encountered where Serif could not find some files occurred because the file names were in a different format, specifically, URL format where the file names contained a bunch of %__ escape characters. With a quick fix, Serif began generating the .xml files (well, .xml.xml files for some reason).

Uh oh, Serif crashed on me! PuTTy actually freezes during the Serif .xml file generation, though conveniently after all the .xml files have been generated. Chris recommended that I ask Ed Loper, the developer of Serif, about this issue - I'll give it a shot! Who knew he was in our lab!

The pick_recent_xml2.py script working

Here is the .csv output:

(before clustering) ==> (after clustering + debugging)

As you can see, there are still some missing cells. This happens when Wikipedia articles do not contain dates at all.

As seen missing in cell BI-5

We still have yet to determine whether to let these cells go empty or feed in pre-written sentences with dates since the negative controls will be relatively easy to spot of the right eye is looking for them (then again, if you're spending that much time analyzing the HIT, you're probably doing them right ... well, up until then at least)

I've also added randomization to the articles such that the 12 articles received won't be in the order: articles needing labeling, positive control, negative control.

This is what the most recent draft of the HIT looks like (how many have we been through now?):

Sentence Summary HIT

I've got the .csv file generation script written and part of the highlighting functionality working for the HIT template now. Here the .csv file generation script (call generate_summary.py) (All of this is pretty much brute force ... I'm hoping I'll learn more Python to make this code more efficient)

Code here

Phew large file

The .csv file actually looks like this because of the ~32k character limit for each Excel cell. I wonder if this will affect the HITs, since regardless of the appearance, the cells are correctly formatted to comma separated values (a.k.a. csv). Anyways, Amazon Mechanical Turk isn't accepting the input for some currently unknown reason. Y!!

Quick peek into the highlighting functionality.

Problems
+ I've come to realize that a lot of the problems I run into are moreso decision-based as opposed to technical issues. For instance, should I adhere to standard code or write potentially more efficient code? Should I spend time thoroughly reading code as opposed to skimming it over and dealing with the problems as I encounter them (this is what happened with the negative controls during Serif not finding some of the files). It's an ongoing cycle of deciding between one or the other, and sometimes you'll try both choices before deciding on one, or even combine both approaches together.

Python
+ str.startswith({SUBSTRING}) (method to check if string 'str' starts with {SUBSTRING})
+ [{ARRAY1}, {ARRAY2}] == zip(*zip({ARRAY1}, {ARRAY2})) ('unzip' of builtin zip; more explanation)
   - zip will return a list of length equal to the shorter of the two arrays (truncated)
+ import random                  (shuffles the array in arbitrary order)
   random.shuffle({ARRAY})
+ urllib
   - urllib.quote({STRING}) (returns the proper url format of {STRING} by using the %xx escape (such as the common %20 = whitepsace))

Perl
+ s/#.*//;           (turns all lines starting with '#' to a blank line ('#.*' is a regular expression))
   next if /^(\s)*$/; (skips the line if it is empty; more info)
+ Given @array1, @array2 and @array3 = (@array1, @array2), array3 is the merging of array1 and array2! (Simple)
+ split(/{DELIMITER}/, {STRING}) (splits {STRING} by {DELIMITER} and returns an array of the splitted string)
+ @array, $array[0] (stands for an array, stands for the element)
+ $hash{{KEY}} = {VALUE} (creating a hash)
   - exists($hash{{KEY}}) (tells if {KEY} exists in hash)
   - my %hash = map { $_ => {VALUE}} @array (creates a hash out of an array)

Monday, June 20, 2011

Wikitopics

I remember before Chris left for Oregon last week, he and I sat down not only to create a list of possible things for me to work on, but also to reiterate the grand purpose and importance of why we created a 'Generate Newsworthy HIT' or why he and I spent hours one day to grab Python modules, beyond simply accepting that 'it's what we needed to do.' It was eye-opening to hear him tell me what our overall purpose was; and I'd like to write it here again as a prevailing reminder.

"Wikitopics is a project Ph.D. student Byung Gyu began under supervisor Chris Callison-Burch. Equipped with 3 years (and counting) data on the top 1000 most popular Wikipedia articles by pageviews, Byung Gyu and Chris sought to create a completely automated website that displayed current events all over the world from pop-culture to natural disasters and everything in between. The finished website would also possess functions to aggregate articles from other news websites beyond the sole Wikipedia articles, cluster related articles into categories beyond just the standard 'sports, politics, etc.', and provide summaries detailing why such topics are current. The overarching goal is that Wikitopics will become a prime resource to collect data, branching into other research projects in fields such as sentiment analysis and translation. The current website is located here."

I was also extremely happy to discover that Chris and I were on the same page regarding certain aspects of the project, specifically that the big dream is to have Wikitopics fully automated, while for now we resort to Turkers from Amazon Mechanical Turk for help, and that the ideal perspective to have in approaching Wikitopics is one of paying close attention to detail while maintaining the big picture. We also spoke about how to proceed with certain tasks such as the remaining HITs, and I found myself providing feedback and opinions on certain decisions with Chris eagerly listening. Overall, it was a great experience to sit down with Chris and hit these points.

This post will cover activities related to Wikitopics last week.

Generate Newsworthy HIT

- generate the recent date sentences for the (check) positive and negative controls
     - (check) changed get_positives_article.py script to print the date of the article as well
   - (check) create bullet points for every recent date sentence extracted
   - (check) add google news link at the bottom (so that there is no ambiguity) instead of linking the entire recent date sentences
   - (check) go back to just one column of drop-downs
   - instead of random articles on HIT, make HIT display articles from clusters (email BA)

The positive controls are completely finished! Here is the HIT with the sentences containing the most recent dates pulled for the positive controls.

I managed to do this by (1) adding code to get_positive_articles.py to store a list of the positive control articles, then (2) changing the pick_recent_xml2.py script to include this list when generating the individual files containing the recent date sentences, and (3) finally changing generate_newsworthy.pl to read and print out the articles into the .csv file. I also soon realized that I had to backtrack to store a list of not only the titles of the articles but also the dates in which the articles were considered newsworthy according to the existing cron jobs occurring (the behind-the-scenes activity that Byung Gyu had already implemented). This is essential in determining which folder/directory the text contents of the articles is located.

pink: (first addition) stores the list of titles of the positive control articles

yellow: (second addition) stores the date-directory alongside the title if the date is different from the running date (which is manually fed (by you!) into the generate_newsworthy.sh script)

purple: sets the path of the list of positive control articles and confirms its existence

blue: reads in the articles and appends the title and corresponding date to the stored list (isn't is amazing how python lists can store multiple types of items? O_O)

gold: sets the variables for positive controls (you can see how it requires a specific date)

pink: sets the variables for the regular articles

red: a little check that allows me to skip the 40 minutes it takes to generate all the files if they've already been generated

blue: prints out the recent date sentences during the .csv file generation

pink: html-code for a bullet point (talked about next)

The next step after getting the recent date sentences was to format them for readability. The proposed idea was to bullet point each extracted sentence rather than smush the sentences together into a nonsensical paragraph. Here's how that worked:

As shown in the green, the JavaScript code separates the sentences by the bullet-point and then reintroduces it along with a break tag to make each sentence start on a new line.

Here is the finished result:

Also notice how the 'Google News Link' has been separated from the recent date sentences

Then merging the two columns ...

Simplicity = :)

The remaining task for the 'Generate Newsowrthy HIT is the negative controls. Byung Gyu wrote an extremely helpful email detailing how to run SERIF to pull out the recent date sentences for the negative controls. With the received information, I wrote down the following steps:

1. (check) Change the get_negative_articles.py to output a list of the negative control articles just like for the positive controls

2. (debugging) Run the fetch_sentences.py script within the generate_newsworthy.sh shell script to output individual files containing corresponding Wikipedia articles

3. (waiting on step 2) Edit the parallelize_serif_part.sh shell script that Byung Gyu uses for the top 1000 articles to work for the negative controls

Straightforward code to generate the list of negative articles

Appearance of the output file

Generated files containing Wikipedia articles - with some mysteriously missing

Note the red comments - there might be more changes to be made

You can see the produced error from the unfinished recent date sentences extraction for the negative controls:

My goal is to have the negative controls for the 'Generate Newsworthy HIT' finished by tomorrow.

[Important] Sentence HIT

- sentences HIT (Wikipedia interface, highlight/click sentences to tell why current)

With that said, I began working on a new HIT. This one is called the '[Important] Sentence HIT' (we'll call it 'Sentence HIT' from now on), which is fed input as determined by the 'Generate Newsworthy HIT'.

Here is a dusty draft of the HIT:

If you can guess which YouTube video that is from ... (it's a song)

green: yep, we're going to make video instructions :D

red: this is where the text instructions will go

As indicated in the 'Sentence HIT', we want Turkers to choose one sentence from the Wikipedia article, which best illustrates its importance on a specific date (the ${date} variable will be fed into this HIT template similar to the 'Generate Newsworthy HIT'). Turkers will be able to click on one sentence from the entire article, which will thus be highlighted to show that it had been clicked on. There is an existing 'Highlight Dialectal Arabic' HIT that I have been studying to reproduce the highlight function. Here is a screenshot of that HIT:

(My mouse is over {srcSeg1}$) As you can see, the variable becomes green when I mouse-over it. If I click it, it will turn yellow.

MTurk Dropbox

+ (check - and ongoing) manage DropBox for all the previous HIT batches (because Amazon is deleting them after a certain amount of days!!)

As noted, Amazon will be deleting the results collected from Amazon Mechanical Turk after a certain amount of days. I remember when Chris gave me this assignment, he looked at me to say that a few of them are worth a few thousand dollars ... and my eyes went O_O

HITs that will disappear :(

But no fear! That is where Dropbox comes in.

Yay we're saved!

Took a few hours to save and double check all these files but you could say it was worth it. :)

HTML
+ HTML codes (&#149 stands for the bullet point)
+ <span> (no physical change unless indicated by class/id; good tutorial)
+ <hr /> (introduces a horizontal line break as shown below)

+ empty elements (for XHTML; ends in ' />' (space, slash, angle bracket) for elements that cannot contain content such as <br /> or <hr />. You cannot do <a href="{URL}" />, you have to do <a href="{URL}"></a>; good tutorial)

jQuery
+ .select() (bind event handler to when user selects something; jQuery doc)
+ .click() (binds to when user clicks something)
+ document.ready() (need to read)

Problems Faced
Chris encouraged me to keep track of any problems I faced so here are more obstacles I encountered the past week:
+ The first noticeable struggle was that upon picking up the recent date sentences for the positive controls to work on, I could not figure out where to start as I had no idea where I last ended. I remember the first thing I did was simply look for evidence of the recent date sentences for the positive controls - any files that I was sure I last worked on - and try to create a map of how all the scripts depended on each other. I think my main problem was that there were just so many processes going on that I couldn't figure out heads or tails of it, literally. Eventually, I did manage to stumble upon the correct file that I had previously overlooked. There are many solutions I came up with prevent situations like these from happening again:
- (will do) create a visual tree of all the scripts, which will show the dependencies and overall start-to-finish route
- (will do) simply write a note where you last ended for the day!
Additionally, decisions I made throughout the situation that really helped me find my way was to keep asking are there any new files I haven't look at yet? and just basic deductions based on scripts I knew couldn't be where I last left off. This situation reminds me of a treasure hunt or something. :P
+ Similar to the problem of not knowing where I last left off, the next problem I encountered was not knowing how or where to start on coding the highlighting function for the Sentence HIT. While I had the 'Highlight Dialectal Arabic' HIT to look off of, it was quite overwhelming to even figure out how each function worked with each other since everything was new. Additionally, the way in which the 'Highlight Dialectal Arabic' HIT worked is not exactly how the 'Sentence HIT' will be (of course). It was a push-and-pull situation between sticking to studying the existing HIT or starting from scratch. I decided to do a combination of both: first outline how I would do the HIT, research relevant methods and look on forums/Stack Overflow, and then break down the JavaScript functions written in the 'Highlight Dialectal Arabic' HIT if I hit a dead end. We'll see how it works out tomorrow. ^^ This also leads into the final problem that prevailed throughout all the activities.
+ The third major problem I ran into was making the decision between fitting code to existing standards or making my own standards (that I think would work better) and making the existing standards fit to mine. The largest consideration is how will the people best be able to use my code in the future? I usually make a decision that combines both properties, but I have yet to figure out any patterns or final decisions on the matter.
+ I also wanted to include that it's a very good idea to be on the lookout for any new concepts that you should research. It's seemingly common sense, but you would not imagine how easy it is to just brute-force or work your way around foreign code/topics (just like vocabulary). I know that there are a lot of new things to learn at the start, but I assure you the extra effort is worth it.

I'm ending this post with the song I've been playing many times over the past week. =)