Compsci 6/101, Spring 2011, Lab 9

Compsci 6/101, Spring 2011, Lab 9: Word Clouds

This is a word cloud of the tags used on the blog Universities and the Web.

During today's lab, we'll be creating word clouds. A word or tag cloud is a way of representing the most popular words from a body of work, and pointing out which of these popular words are more popular than others by representing each word in a size proportional to how important (e.g., number of occurrences) the word is. To modify the textual representation of a word we'll be creating our output in HTML. However, the actual code to generate HTML is accessible via an API/function call in the module HTMLWriter.py. The HTML is written to a separate file which you can view in a browser (or in Eclipse!)

To start out, snarf the code for today's lab. You can also browse the code here. It contains the HTMLWriter module, which contains functions to write HTML for you. Each of the functions takes an open file as a parameter and it is to this file that the HTML is written. Some functions take more parameters as described below.

To write formatted HTML you do three things in the code you write:

Call start before generating HTML, this writes the beginning HTML code that is the start of a web page --- pass an open file to start.
Call write_sized_word with writes a word/string in a specified size to the file that is the first parameter (see the code). You will need to call this function once per word you want to create.
Call finish to finish writing the formatted HTML -- this function takes the same open file as a parameter.

Writing Code in Lab

During today's lab, you will fill in the functions in the WordCloud module, calling the HTMLWriter module helper functions.

Write get_popular_words. It opens the given filename, counts the frequencies of each word in the file passed to the function (similar to something you saw in the FUN in lecture this week), and returns a list of (count, word) pairs for the words that occur most often (i.e., the list returned only contains the number of pairs specified by the function's parameter -- these are the most frequently occurring words.) In counting words be sure to convert them to lower case so that you don't take capitalization into account when differentiating words. You'll need to use a dictionary for this task. The code below represents a list of every word in the file, this can be a starting point, though you can also loop over file.read.split() explicitly rather than in a list comprehension: [w.lower() for w in file.read().split()]

You'll want to first print the list returned and answer the questions in the handin document related to the values printed.

Write the function make_word_cloud, which generates a word cloud. It should take an input file (from which words are read), an output file (to which the word cloud will be written), and a number, denoting how many words (the top most frequent words) should be written. Your code will write the most popular words to the file, in random order, with sizes proportional to how often the word occurs. Remember to call the helper functions in the HTMLWriter.py module when they would be useful. This means call start, finish and write_sized_word with a size that's larger for more frequently occurring words. You can use the number of occurrences as the size, though that will likely be too large (see the next question, don't forget to complete the handin pages.

If you use the number of occurrences as the size of each word written in HTML, many of the words are too big. Rather than using the number of occurrences you'll modify the code you wrote in the previous step to convert the number of occurrences to a font size in the range 6pt to 36pt. You'll use only even font sizes, so you'll map occurrences to a font size that's one of 6, 8, 10, 12, ..., 32, 34, 36.

To do this you'll bucket or group the occurrences (if words occur a "similar" number of times, e.g., the numbers are "close", make them the same font size in the word cloud). To do this use a bucket size of 15, so any word occurring up to 14 times will have a font-size of 6. Words occurring 15-29 times have a font size of 8. Words occurring 30-44 times have a font size of 10, and so on. Effectively you'll use the division operator to see how many times 15 goes into the number of occurrences and use that number to determine the font size. For example, if the number of occurrences is 46, you can calculate 46/15 = 3 and use the three to index [6,8,10,12,14,...,36] to get a font-size of 12. Be sure that the maximally ocurring word doesn't get a font-size greater than 36, even if it occurs a million times.

To make this change write and call the function clean_up_size to convert a number of occurrences to a font-size as described above.

Call this function to get the font-size you use in calling the function write_sized_word.

Finally, let's try getting rid of the unimportant words in the word cloud (such as "the" and "it"). Add a new function is_common(word) that returns true in case word is a common word (word that could have a high occurrence frequency, but are not content relevant). You should come up with a few ideas and implement at least one based on the number of characters in a word --- for example, any word with fewer than four characters is too common. But come up with some other ideas and document these on the handin pages.

Use is_common(word) in get_popular_words to throw out any common words. Remember to make sure you still write the correct number of words to the file!