Lab 6: The Sentiment of a Superstorm


For this tutorial, we’ll explore how we can use social data (tweets from Twitter) to enrich our understanding of one of the most destructive hurricanes in US history: “Superstorm Sandy”.

The first part of this lab sheet contains reading material for both sentiment analysis and hashing, the second part contains the excercise questions. Skip directly to the practical part of you are impatient.

Also note that there are five scoring parts in total, Exercise 6.1 - 6.4, and visualization.

Table of Contents

Part I - Preliminaries

  • Making Sense of Social Media
  • Concept 1: Sentiment Analysis
    • A Simplified Approach to Determining Sentiment
    • Sentiment Lexicon
    • Steps for Determining Tweet Sentiment
  • Interlude: The Dictionary Problem
  • Concept 2: Hash Tables (not #HashTags)
    • Hash Functions
    • Buckets and Hash Collisions

Part II - Programming tasks


Making Sense of Social Media

The prevalence of social media is changing not only the way society communicates and interacts, but also how we think about social data. For instance, can real-time social data be used to better detect the occurence of an earthquake?[1] Can social media aid diaster response to wildfires?[2] What other kinds of activities could be enriched by the analysis of social data?

The abundance of data presents us with the unique opportunity to better understand and utilise social behaviour and phenomena. This has lead to the rise of Computational Social Science; an interdisciplinary field which explores how computational techniques from computer science can be applied to various fields of social science.

To do this we’ll first review two concepts: sentiment analysis and hash tables.

Concept 1: Sentiment Analysis

There are many ways we could make sense of textual data from tweets. One way is to consider whether the emotional state or attitude it expresses is either positive, negative or neutral. This is called sentiment polarity.

For example this tweet would be considered positive in sentiment:

So happy exams are over. Finally done with second year! Now can look forward to my holiday :D #summmerrr

Conversely this tweet would express negative sentiment:

I hope i do good on my finals tommorow #worried

But what about this one? Is the person actually happy? Can we be sure?

i totally failed math finals so that's good

Try determining whether the following tweets about Superstorm Sandy are positive, neutral or negative:

So happy to wake up to #sandy gone!

Everyone Panic!!!1 RT @NBCNews Full moon could make #Sandy's impact worse http://t.co/hHs5l8EY

#sandy i love you. even if your last names not cheeks*

RT @NewYorkPost: Rainbow touches down in lower Manhattan #hurricane #sandy http://t.co/M6MGqGN7 via @kurtdietrich*

RT @NickpcWx: #Hurricane #Sandy cloud cover is covering nearly 1 million square miles of water/land. @reedtimmerTVN http://t.co/STGZxi2X

RT @GovChristie: I don't give a damn about Election Day after what has happened here. I am worried about the people of New Jersey. #Sandy*

Discussion Questions:

  • How did you determine their sentiment polarity?
  • Could you write out a list of steps used to determine this?
  • Would others following your steps produce the same results?
  • What makes determining sentiment difficult? Why is it relatively easy for humans to classify sentiment but not computers?

A Simplified Approach to Determining Sentiment

A simple way to determine sentiment polarity for a tweet would be to determine how many words in the tweet are positive, negative or neutral in sentiment polarity. If there are more positive words than negative words, then the tweet is considered positive. Conversely if we have more negative words than positive words, then the tweet is negative. Otherwise if they are equal, then the tweet is neutral. But how do we determine the sentiment polarity of a word?

Sentiment Lexicon

Determining the sentiment of a word can be accomplished by using a sentiment lexicon. This is a collection of words (or phrases) along with their associated sentiment polarity. For instance a lexicon of positive words could contain the words “happy” and “good” whilst a negative word lexicon could contain the words “sad” and “bad”. Thus the sentiment of word can be determined by checking if it is in either lexicon.

Such a sentiment lexicon can be described as an example of labeled data. This is because the words have been assigned a “correct” label (sentiment polarity). This is accomplished either by experts manually assigning a “correct” value or by using machine learning techniques to determine this.

Discussion Questions:

  • What are some of the limitations of this approach?
  • Is the labelled data “correct” across different social domains?
    • Compare how are the words “bull” and “bear” are used general and in the context of financial markets
  • Is the labelled data “correct” in different contexts?
    • Compare the word ‘low’ in the context of “low cost” and “low salary”
  • Has social media made it easier to create or access labeled data?

Steps for Determining Tweet Sentiment

Since we are interested in whether a tweet contains more (or less) positive words than negative words, we can assign positive words a value of 1 and negative words a value of -1. Thus if we sum these values for all words in a tweet, we get a score that tell us whether a tweet is positive, negative or neutral in sentiment. We can describe these steps informally as follows:

input is tweetText
set sentimentScore to 0
for each word in tweetText
    if positiveWordLexicon contains word then
        add 1 to sentimentScore
    end                  
    if negativeWordLexicon contains word then
        subtract 1 from sentimentScore
    end
end
output is sentimentScore

Thus a sentimentScore greater than zero indicates that the tweet is positive. If it is below zero, then the tweet is negative. Otherwise if the score is zero then it has a neutral in sentiment polarity. Note that we ignore neutral words since they won’t change the score.

This description of steps used to determine sentimentScore from tweetText is an example of an algorithm. The informal description is known as pseudo-code, as it cannot be read by a computer,however to be useful it should loosely relate to a language you’re familiar with. We use pseudo-code to focus on how the algorithm works rather than language (Snap!) syntax and rules.

Discussion Questions:

  • Can you tie back all the pseudo-code to the Snap! blocks you have seen/used?
  • How would you determine the words of a tweet?
    • Look back at the example tweets, what would you need to exclude or consider?
  • What concerns would you have if the sentiment lexicon contained a very large number of words?
  • Suppose that you have a physical copy of all 20 volumes of the Oxford English Dictionary (2nd Edition), which comprises of over 400,000 entries and 22,000 pages.[4] How would you quickly look up a definition for a particular word?

Interlude: The Mailbox Problem

Imagine that you are post-office clerk looking for a letter in a pile that has been sorted alphabetically. You’d flick to where you’d estimate the letter to be, then flick left or right through the letters to get closer to the desired letter (or determine that it is not present). This type of searching is fast, but is there a faster way?

Instead imagine that you live in a residential hall or college and you are trying to find your mail. There are a number of letter-box, one for ‘A’, another for ‘B, another for ‘XYZ’, etc. To find mail for a person, you simply take the first letter of the person’s last name, go to the respective letter-box and search through the letters to find the person’s mail. This scenario roughly describes the idea of a hash table. 'mailbox image'

Discussion Questions:

  • Some college kids are pranksters. One has randomly redistributed the mail into the letter-boxs. You face the same problem of finding your letter.
    • What is the maximum number of letters you would have to look through to find your letter?
    • What is the least number of letters?
    • On average, how many pages would you have to look through to find your letter?
    • How many letters would you have to look through to organise all the mail?

Concept 2: Hash Tables (not #HashTags)

Similar to how a residential college stores a letter associated with a person, a hash table stores a value associated with a key. Hash tables store entries using a hash fuction for faster look up.

Hash Functions

A hash function is simply a function that takes an input value of arbitrary size and outputs a value of fixed sized. For example consider the following hash function which takes an integer k as input (for postive numbers mod is that same as remainder):

$$hash(k) = k \mod 3$$
$$hash(1) = 1$$
$$hash(2) = 2$$
$$hash(3) = 0$$
$$hash(23412517) = 1$$

No matter how large the input integer is, the function will always return a value between 0 and 2 inclusively. This is called a hash value and it is useful because just as a physical residential college has a finite number of mailboxes to put letters in; computers have a finite amount of memory that can be used to store data. Where we store our entries in memory is called a bucket.

Thus if we have n buckets to store our entries in, our hash function should output a value corresponding to a bucket:

$$hash(k) = k \mod n$$

Since we want to associate a key with a value, we hash the key to determine the bucket. We then store the key along with the value in the bucket. This is just like having letters with people’s names on them. Otherwise, when we flick through the letters in a mailbox, we wouldn’t be able to tell which ones were ours.

In the mailbox example, we already defined an intuitive hash function. The key/input is a arbitrary person. To find the hash value, you take the first letter of a person’s surname. This outputs a fixed character between ‘A’ and ‘Z’. Thus, we have a hash function.

Discussion Questions:

  • Why would it be undesirable for a hash function to give a different output value each time it is applied to the same given input?
  • What would happen if we have 100 words to store but only 99 buckets to put them in?
  • What would happen if we have two different input values that have the same hash value? How would we store them?

Buckets and Hash Collisions

A hash collision is when two different input values have the same hash value. Collisions are problematic because we need to somehow store and look up multiple values in the same bucket. One way to handle this is to use an addressing method called separate chaining.

The idea of separate chaining is to treat buckets as a list of entries, where one entry is linked to another in a chain. Thus if we want to look up an entry in a bucket, we would need to go through the chain until we find it or reach the end of the chain. If we want to insert an entry we would need to go through the chain to check if it is already there and if so, we update its value. Otherwise we add the entry to the end of the chain.

To illustrate this, simply observe the mailbox example. When there is a hash collision, like in the case of ‘Bruce Wagne’ and ‘Andrew Wrigley’, both letters are stored in the same mailbox. If Bruce wanted to find his letter he would have to look through the pile/list of letters in the box. In this case if you wanted to add a letter to a mailbox, you look through the pile to find a letter addressed to the person and put the new letter next to it, otherwise you just put it on top of the pile.

Discussion Questions:

  • Are hash collisions avoidable? What would be needed to avoid any collissions?
  • Would it be better to have more or less buckets? What would happen if we only have one bucket?
  • Why would it be undesirable for our hash function to put more entries into one particular bucket over others?
  • Why might a residential college in Australia have a ‘XYZ’ mailbox rather than a box for ‘X’,‘Y’ and ‘Z’? Would this differ in other countries?

Practical: Sentiment Analysis of Superstorm Sandy

Files Needed:

  • The Snap! workspace with the Hashset project file is here
  • The Snap! workspace Sandy Tweet Visualiser project file is here

To demonstrate both of these concepts, we’ll use Snap! to analyse the sentiment of tweets about Superstorm Sandy from the United State of America and visualise their polarity on a map.

Screenshot of visualisation

To do this we’ll need to use a sentiment lexicon of positive and negative words and store them in a hash table for fast look up.

But since our sentiment analysis does not take into account different degrees of positive or negative polarity (i.e. +0.5 or -0.6), it isn’t necessary to store both a word and its sentiment polarity value. Instead if we store all positive words in one hash table, and all negative words in another, then we only need to store the word itself since we will know the polarity (positive or negative) of all words we look up in that particular hash table. This is the idea of a Set, which is simply a collection of distinct elements.

Here’s an example of a set of positive words and negative words:

$$S_{positive} = \left\{ \text{happy, love, relief} \right\}\\ S_{negative} = \left\{ \text{angry, damage, afraid} \right\}$$

Note that there can be no duplicate elements in a set and that the size of the two example sets is 3.This size of a set is often refered to as it’s cardinaility.

Since we are using hash tables to implement a set, we call this a hash set.

Implementing a Hash Function

As you’ve read, a hash set (or hash table) depends on a hash function for all its operations. However hash collisions are bound to happen and handling these collisions adversely affects the performance of our operations. Consider the example where our hash set uses a hash function that puts all words into the same bucket. To look up or remove a word we would need to possibly go through all the words we have inserted because the bucket contains a chain that is as long as the number of words we have inserted. This is akind to you looking for your letter, when the letters have been randomly placed in boxes.

On the other extreme end, if we have at least as many buckets as we have words, then a hash function that allocates a word to its own bucket would perform extremely fast since the chain is as small as possible since there are no collisions!. Such a hash function is called a perfect hash function.

However for most practical purposes, hash collisions are unavoidable. Instead the goal is to create a hashing function that is as uniform as possible. This means that our hash function should roughly distribute the same number of words into each bucket. Thus if we have m words and n buckets, we would expect each bucket to contain m/n buckets.

Since the input of our hash function is a word, we need to devise a method (or algorithm) of turning that word into some number. Ideally this number should be determined by using all the characters in a word. We call this a hashing scheme and it is simply a function that takes a word as input and outputs a number.

Once we define a suitable hashing scheme our hash set will use it to define its hash function as shown below:
Snap block setting hash function

Notice that `hash function` uses another function `hashing scheme` to transform the element (in our case a word) into a number so it can perform modulo on it using `modulo` in order to allocate it to a bucket. The addition of 1 is because modulo can return 0 but Snap! counts positions from 1 (indexing begins at 1).

Thus the ability of a hash function to distribute elements uniformly across the buckets depends on devising a ‘good’ hashing scheme.

Exercise 6.1: Implement a hashing scheme (3 points)

To illustrate a bad hashing scheme, we’ve implemented a very simple one `example word hashing scheme` that simply uses the first letter of a word. This is bad because it means that all words starting with the same letter would go to the same bucket resulting in a lot of collisions. The goal for this task is to try to come up with a better hashing scheme that results in a hash function that is uniform.

Programming Instructions:
Complete this task by editing `hash scheme block` and completing the definition:
`hash scheme block definition` .

Your code is expected to:

  • Use the `unicode` block to transform a letter of a word into a number so we can perform calculations on it. This table outlines the mapping between any character to numbers – it contains not only English (or called ascii) characters, but also alphabets from any written language in the world.
  • Using all the letters of a word would help to avoid collisions.
  • Using both the letter and their position in the word will help distinguish words that are anargrams of each other, e.g. dog or god.

  • Use an appropriate hashing scheme so that your hash set on a modest list of English words is as uniform as possible (i.e. passes the uniformity test block below.)

    Hint: Prime numbers are useful for preventing collisions. See the 3rd answer answer on this page reports an observation of a collision test on 50,000 English words.

Test this block: After finishing your implementation, type a few different words in the input slot. Do you get a (large) integer from each that are different?

Implementing a Hash Set

A hash set is a specialised version of a hash table. A hash table stores both a key and a value using a hash function. In our mailbox analogy, a person was the key and their associated letters was the value. However for hash sets we only store the key. Thus if we don’t need to store a value associated with a key, it is more efficient to use a hash set. Whilst both use hash functions to also look up a key, the difference is that a hash table returns the value associated with a key, where as a hash set simply tells you if it contains or doesn’t contain the key.

A hash set and a hash table are both examples of what is called a data structure. A data structure is simply a particular way of storing and organising data so that we can perform certain operations on it in a efficient manner. For example, a list is also a data structure. In this case, hash functions are used to store and organise the data for fast look up.

For the purposes of this practical we only need to implement two operations: one that insert words into the hash set and another that checks if the hash set contains a particular word.

Exercise 6.2: Implement the insert operation (3 points)

The insert operation allows you add additional elements to the set. For our use case, this will be to add additional words into the set.

To insert an element into the set using separate chaining, we’ll need to first create a hash value in order to determine the bucket we will insert it into. Then we’ll need to check if the bucket is empty; if so we can simply add the element, otherwise we’ll need to add it alongside the other elements already in the bucket. But since sets do not contain duplicates, we should only add the element if the bucket doesn’t already contain that element.

Programming Instructions:

  • Completing the definition for `insert function` by filling in the stub with working code:
    `set_insert_function`

Hints:

  • There is a special constant value called `empty bucket sentinel` , which is set as a global variable. Use this to check if a bucket is empty.
  • Buckets need to be able to hold multiple elements (due to hash collisions). What Snap! block can we use to hold multiple elements?
  • A set cannot contain duplicates of an element. Check that an element is not already in the set before you actually insert it.

  • Inside a hash set there is a variable called `entry count` that keeps track of the number of elements entered into the set. Be sure to increment this by one every time you insert an element.

  • At the end of the ‘insert’ we report the block `report list hash set` . It is essentially because this representation of the hash set is implemented using lists. You don’t strictly need to know why this is, for your function to work. If that didn’t make sense, then don’t worry about it.

Testing this function:

This group of blocks can be used to visualise the hash set buckets. Use it to test your insert and remove functions.

`peek_buckets`

Exercise 6.3: Implement the contains operation (2 points)

The contains operation is used to check if a element is a member of a set. We will use it to check if a word is in the positive lexicon or the negative lexicon.

To check if a word is in a hash set, again we must determine which bucket it belongs to using 'hash function' . Once we have determined which bucket to check, we need to check if our word is contained in that bucket. If we find it then we report 'true' , otherwise we report 'false' .

Snap Instructions:
Complete this task by completing the definition for `contains block` by filling in the stub with working code:
`contains function`

Hints:

  • If the bucket is empty then it clearly cannot contain the element. How can we check this?

  • This lab included a utility block in the ‘variables’ group, called foreach `foreach`

Testing this function: You can test that your contains function is working by using the 'global hashset contains' on your customised hashset, like when testing insert.

Testing

We have provided a number of test blocks, sometimes called a test harness in Software Engineering, to ensure your blocks (code) works.

To use them to test your blocks, simply provide the test harness with a hash set that has a provided hash scheme as shown below:


For the uniformity test, you need to provide the block with a set of words. You can use any of the word sets provided. Notice that each word set will produce a different histogram. You can use this histogram to evaluate how uniform your hash function is. If buckets have a roughly equal number of elements, then your hashing scheme is uniform.

The hash scheme can be `example word hashing scheme` , `hash scheme block` or any other suitable block. Make sure that it is put in the correct place.

Run the test by clicking on the block. It will then then tell you whether it passed or why it failed.

Your implementation should pass all three tests.

Discussion Questions:

  • Does testing ensure that your code is completely correct?
  • Why do different word sets produce different histograms?
  • How does changing the hashing scheme affect the histogram produced?
  • Why might Dr Seuss, Rhyme and Alliteration cause particularly bad results for the example hashing scheme (remember how the example hashins scheme works)?
  • Is it possible to make a set of words which fail the uniformity test, with your hashing scheme?
    • Is this true of any hashing scheme in this scenario?

Exporting blocks for Exercise 6.1-6.3:

This assignment has two parts and you must submit them separately. You should export the blocks that you defined in this lab:

  • “word hashing scheme”,
  • “insert __ into __”,
  • “__ contains __”,

along with any other custom blocks you created that are used by those.

Test before you submit! Always, test your submission file – to make sure everything that should be included is indeed included!

  • The first test for this assignment is included in the test harness above.
  • The second test you can run, is via the Superstorm Sandy visualizer below.

Sentiment Analysis and Visualisation of Superstorm Sandy Tweets (1 point)

Once you’ve finish implementing your hash set and sentiment score you can now use it to analyse the sentiment of tweets about Superstorm Sandy[3] over time and visualise them on a map of the continental United States of America (excluding Alaska and Hawaii).

Exercise 6.4: Computing the sentiment score of a Tweet (1 point)

With the hashset implemented, you are ready to compute the sentiment score of a tweet!

First use your own hash set and hashing scheme implementation to construct the positive and negative word lists, by executing the following blocks. You need to import all of your blocks from the first project to the current project.

You will need to put your blocks into their respective locations.

hashsets

You then modify the block that computes sentiment score sentiscore to loop over words in the tweet and compute its sentiment score. A screenshot of the function stub for you to fill in the code is shown below.

sentiment block

After implementing it, you should run it through a few test cases. Use the few examples from the beginning of the this lab sheet to test out how well your scoring function works. For example, the block below should return 1 for recognizing the positive sentiment.

sentiment test

Snap Instructions:

This part assumes that you have a saved a copy of the blocks from Exercise 6.1-6.3.

  1. Open a new version of Snap and import the Sandy Tweet visualiser project file (available here)
  2. Import your blocks into the project.
  3. If you’ve imported the visualiser for the first time you need to reset the visualiser:
    `reset visualiser`
  4. Similar to testing, simply drag your implemented hash set along with its hashing scheme into the load lexicon blocks and your insert function, and run them:

`load positive` `load negative`
6. Run the visualiser from your desired starting tweet, where 1 denote the first tweet in the dataset and the 582269 is the last.
`run visualiser`

Whilst these tweets cover the dates from 15 Oct 2012 to 12 Nov 2012, the bulk of the tweets come from a very specific time period:
The tweets in our dataset were tweeted from 15 Oct 2012 to 12 Nov 2012, however the majority of tweets come from a very specific time, as shown below:

Frquency of Tweets over Time (Hours)
`tweet_graph`

Discussion Questions:

  • What do you think is the significance of the peaks?
  • The peak number of the tweets occurs between tweets numbered 115000 and 275000. Try visualising and comparing their sentiment.
  • Does the sentiment of tweets before the peak differ from the sentiment of tweests after the peak?

Enjoy!

Submitting your sentiment score block:

Export your sentiment computation block and submit it as a separate xml file

  • “sentiment score (tweet text, positive word set, negative word set)”

Test before you submit! Always, test your submission file – to make sure everything that should be included is indeed included!

Submitting a screenshot of the visualization:

You let the visualizer run for while, and submit a snap of the Snap! stage when you are satisfied with the result. Capturing part of the screen can be done via keyboard shortcut on a mac, or using the Snipping tool on Windows, or the Take Screenshot tool in Ubuntu.

Here is mine, what does yours look like?

stage


Extension Tasks

For those that have completed all the previous tasks, here’s an opportunity to further your understanding, improve your programming skills as well as to earn bonus marks! These tasks have less guidance and support than the previous ones so you are expected to do your own reading or research if needed. You only need to implement one of the following:

Extension Task 1: Implement the remove operation

The remove operation is used to take out elements from the set.In sentiment analysis we can assume all words loaded into our hash sets are correct and thus we didn’t need remove. Our hash set implementation is incomplete without it though.

Snap Instructions:
Complete this task by completing the definition for `remove function` by filling in the stub with working code:
`set_remove_function`

Hints:

  • The remove operation is the opposite of the insert operation. Determine what steps you need to take to remove an element with this in mind. Writing pseudo-code might help.
  • You can’t remove an element that is not already part of the set.
  • Use the provided test harness to check that it works:
    `test remove`

Extension Task 2: Implement dynamic resizing (rehashing)

Ideally, the more buckets we have, the less likely we’ll have collisions. If we could estimate how many elements we’re processing then we can set the number of buckets to at least this estimate. However if we estimate wrong and have too few buckets, then we’ll have unneccesary additional collisions.

Our hash set would be more robust if it could resize itself depending on the load. To do this we will rehash all elements into a larger number of buckets once the number of elements in our hash set exceed some threshod. This threshold is the load factor.

Take a look inside the `insert function` and you will notice that before any elements are inserted, there is a condition being checked to determine if the hash set is exceeding the load factor:
`rehash condition` Unfortunately the `rehash function` is not currently doing anything useful.

Your task is to edit `rehash function` so that it rehashses all the elements into a new larger number of buckets.

Hints:

  • What variables do you need to reset?
  • How many element in the hash set do you need to go through to rehash?

Extension Task 3: Implement the dictionary data structure (hash table)

As our approach to sentiment analysis is very simple (words are either positive or negative), we didn’t need to the full capabilities of a hash table to create our lexicon. But if the words in our lexicon were associated with a polarity value between -1 and +1, using hash sets would be infeasible since we would need a hash set for word with a polarity of -1, -0.9 -0.85… all the way to +1.

In this case we should actually use a hash table (dictionary) to implement our lexicon. Since hash tables store a key and an associated value, we can use it to store a word (key) along with its sentiment polarity (value). We can then look up the sentiment polarity for all words very quickly from a single data structure.

In Computer Science, this idea of storing a key associated with a value is called a dictionary data structure and it is useful for a wide variety of applications. One important area of application is indexing which is used to quickly retrieve information from a query. For example web pages from a search or visualising the usage of certain words or phrases over time using Google Ngram Viewer.

Hints: * A hash table is very similar to a hash set, the only difference is that we now need to store both a key and it’s value. As a start point make a copy of your hash set implementation and try to tweak it, to make it a hash table. * Keep the concept of a key separate from the value; we only hash the key but we need to store both the key and the value together in a bucket. * An example of how you could store the key and the value: `dictionary hint` .

References:

20 March, 2016
5390 words


Written by
Tags
labs

Quick links

2016 S1 lectures schedule:
  • Monday 1-2pm (RS Chem T)
  • Tuesday 9-10am (ENGN T)
  • Thursday 12-1pm (JD101)

Note: TBD.

Quick-access class pointers: TBC