Tuesday, May 24, 2011

Back on Campus

First day on the job! I'm participating in research for natural language processing this summer and I decided to keep a blog of my thoughts and what I've learned throughout. It'll be really nice to have something to look back to in the future, plus let my family know that I'm doing alright! I also have my 'Summer List' of things I want to do all ready, such as brainstorming beforehand for my game design course this fall, practicing my Chinese more, or finally learning Guitar. I'll write about them, too! :D Anyways, here's how my first day went.

My supervisors Chris Callison-Burch and Ben Van Durme assigned me a project that Ph.D. student Byung Gyu Ahn had begun working on called 'WikiTopics' (who'll be guiding me as well while doing research at Microsoft this summer). Our main project goal is to basically create a completely automated news aggregator of current news articles, clustered into groups and categories, based off of a data repository of the number of pageviews of all the Wikipedia Articles (its been monitored hourly over the past 3 years and is still going!) along with other sources including Twitter and other social networking websites. For instance, let's say that the top 5 articles of the day are 'Rebecca Black', 'Japan Earthquake', 'Osama Bin Laden', 'Waterboarding', and 'Marilyn Monroe'. We want to be able to determine are the reasons why the articles received such large number of hits, if they actually pertain to any presently occurring events or are reasonably random, and what category each article belongs in - these are the main focuses, except that we'll be doing it with the top 1000 articles as opposed to the top 5. (Rebecca Black, Japan Earthquake, Osama Bin Laden, and Waterboarding should all be part of our collection, while Marilyn Monroe should not.) 

So after figuring out some logistics such as setting up computer accounts at the Human Language Technology Center of Excellence (COE) and Center for Language and Speech Processing (CLSP), filing for a building access card, and figuring out the summer schedule, we dove right into it beginning with working on the 'Generating Newsworthy' articles portion of the project. As mentioned above, one significant concentration of the project is to determine which news articles are 'current' or 'newsworthy' and which are not. I spent some time just getting comfortable with what Byung Gyu worked on along with syncing it to my account. This included the generate_newsworthy.sh shell script.

Success! We got it working on my account, and I began moving onto improving a Human Intelligence Task (HIT) for Amazon Mechanical Turk (MTurk) to ask Turkers to help us initially decide whether the articles are newsworthy or not. In layman's terms, I'm coding a task in JavaScript/CSS/HTML and uploading it onto a website where many, many people can complete and get paid (by us!). It's a pretty neat site, and Chris also wrote a paper analyzing it (reading it was one of the first things he asked me to do back in November 2010).

I also got a crash course on simple linux and emacs commands (dual weild with vi!). Honestly, I felt pretty dumb initially doing things the long way. I put a list of the new commands I learned today so you don't have to feel that way. :P Lastly, I'm starting to get comfortable with GitHub, and can hopefully import parts of this blog related with WikiTopics over there with more coding bits. Overall, it was a great first day! I called my Mom shortly after telling her how happy I was and wishing my sister best of luck on finals, whilst walking in the wrong direction for a good 10 minutes lol. Cheers to an awesome summer!

Linux Commands
ssh [address] (secure shell .. on the to-do list to understand more)
du -hs [path] | sort -nr > [file] (calculating disk usage and sorting them based on smallest to largest size)
nohup (process can run even after logging out! man I needed this for some of my classes)
screen  (can detach/attach terminal sessions and also share them with others)
tail -f [file] (can view file as being created!)

(following are for dummies)
control + r (search)
control + k (delete rest of line)
control + a (go to beginning)
control + e (go to end)
control + b (backwards one space; a.k.a. the left arrow key)
  
Emacs commands (hm no need to type 'i' before typing)
control + x - control + s (save)
control + x - control + c (exit)
control + g (ignore partially typed command)
control + xu (undo)
control + w (paste)
control + space - move cursor - control + w (cut)

GitHub commands
git add [file]
git commit -m 'first commit'
git remote add origin git@github.com:[username]/[repository]
git push -u origin master
(I'm still not completely sure what these commands do, so gotta do that!)

Todo List
+ learn about ssh/secure shell
+ complete 'Generate Newsworthy HIT'
   + get a more solid understanding of JavaScript/CSS/HTML
+ correct escaped string in newsworthy.csv
+ getting acquainted with GitHub
+ start learing Perl

How I broke down problems
I definitely ran into problems here and there, such as Permission Denied and No file or directory found messages, CLSP account related bits, or figuring out new concepts such as environment variables. Definitely my first thought whenever I ran into an obstacle was that perhaps I've already been given the answer to getting past it. For instance, when I received the message WIKITOPICS environment variable not set, it meant complete baloney to me at first, but after a few quick searches on Wikipedia as well as going through a few emails, there was an aha-moment, and I realized that I just needed to type 'WIKITOPICS=/home/...' to set the variable. I also went to Chris to ask questions not only to help me resolve the problems, but feel comfortable that it's okay to get stuck. Just remember, there's most likely a very logical path to the problem and don't give up! - looks at people IT support get mad at - 

No comments:

Post a Comment