Compsci 101, Fall 2012, Lab 10: Word Clouds

This is a word cloud of the tags used on the blog Universities and the Web.

During today's lab, we'll be creating word clouds. A word or tag cloud is a way of representing the most popular words from a body of work, and pointing out which of these popular words are more popular than others by representing each word in a size proportional to how important (e.g., number of occurrences) the word is. Perhaps the most famous word cloud is the one that collects all important Presidential Speeches and puts them in a scrollable interface so that you can compare important issues from today's presidents all the way back to the first one. This tool has been referenced in a number of scholarly papers because of the ease in which it supports comparisons. You can also create your own word clouds online at TagCrowd --- this is the site we used on the first day of class to get to know the class.

To modify the textual representation of a word we'll be creating our output in HTML. The actual code to generate HTML is accessible via the module HTMLWriter. The HTML is written to a separate file which you can view in a browser (or in Eclipse!).

To write formatted HTML you could call the following three finctions (note, this is already done for you in the code):

Use the hand-in sheet and browse the code

Writing Code in Lab

During today's lab, you will fill in the functions in the WordCloud module, you will not need to worry about the support module HTMLWriter --- the code that uses this is already written for you.

  1. Complete the loop in function count_words. The purpose of the loop is to count the frequencies of each word in the file passed to the function (similar to this week's classwork). Outside the loop, the code opens the file, gets all the words in the file, and returns a list of (word, count) pairs for the words in the file. You will need to use a dictionary for this task, so think about how to update your counts in two cases: it is already in the dictionary or it is being counted for the first time.

  2. Complete the function sanitize_word. In dealing with arbitrary data, there is often a need to sanitize it, i.e., clean up trivial inconsistencies so that it can be used more effectively. Currently, many words that are similar are being counted as different words even though they differ only in trivial features: capitalization and puncutation. Discuss other ideas for cleaning up words and try to implement at least one of them. based on the number of characters in a word --- for example, any word with fewer than four characters is too common.

    Add a call to this function in the function count_words and see how much better the results are.

  3. Complete the function top_words. A word cloud typically only contains the top most frequently occurring word, not the entire collection. The problem is that the list of words cannot easily be sorted based on the word's counts because that is the second value in the tuple. To sort them properly requires a three step process:
    1. transpose the (word, count) pairs into (count, word) pairs since sorting occurs primarily on the first value in the tuple
    2. sort the resulting list and save only the numToKeep top pairs
    3. transpose the smaller list to get back (word, count) pairs for the reamining functions to work with (because the final output will be ordered alphabetically)

  4. Complete the function word_is_taggable. Looking at the resulting top words, most of them are trivial words (such as "the" and "it") that are not content relevant. Get rid of those unimportant words in the word cloud by comparing them to a premade list of common words (stored in the file common.txt). Discuss other ideas for determining that a word is common or generally not taggable and try to implement at least one of them.

    Add a call to this function in the function count_words and see how much better the results are.

  5. Complete the function size_words. If you use the number of occurrences as the size of each word written in HTML, many of the words are too big. Rather than using the number of occurrences directly as the font size, convert the number of occurrences to a font size in the range 10pt to 48pt.

    To do this you will need to bucket or group the occurrences (if words occur a "similar" number of times make them the same font size in the word cloud). Calculate the size of the buckets based on numDivisions and the most frequently occurring word in the file. For example, suppose numDivisions is 6 and you calculate the bucket size to be 15: any word occurring up to 15 times will have a font-size of 10. Words occurring 16-30 times have a font size of 17. Words occurring 30-44 times have a font size of 24, and so on. Effectively you will use the division operator to see how many times 15 goes into the number of occurrences and use that number to determine the font size. Be sure that the maximally ocurring word does not get a font-size greater than 48, even if it occurs a million times.