Introduction to Computer Science
CompSci 101 : Spring 2014

Word Clouds

During lab, we will be creating word clouds. A word or tag cloud is a way of representing the most popular words from a body of work, and pointing out which of these popular words are more popular than others by representing each word in a size proportional to how important (e.g., number of occurrences) the word is. Perhaps the most famous word cloud is the one that collects all important Presidential Speeches and puts them in a scrollable interface so that you can compare important issues from today's presidents all the way back to the first one. This tool has been referenced in a number of scholarly papers because of the ease in which it supports comparisons. You can also create your own word clouds online at TagCrowd --- this is the site we used on the first day of class to get to know the class.

To modify the textual representation of a word we will be output the text in HTML. The actual code to generate HTML is accessible via the module HTMLWriter. The HTML is written to a separate file which you can view in a browser (or in Eclipse!).

To write formatted HTML you could call the following three finctions:

Writing Code in Lab

During today's lab, you will fill in the functions in the WordCloud module, you will not need to modify the support module HTMLWriter. Start by snarfing the lab code from within Eclipse (alternatively, you can browse code here).

  1. Complete the loop in function printWords that formats each word using HTML. Currently, the function simply writes each word to the file which, in HTML, looks like one long word with no formatting. You do not need to understand how HTML files are formatted to complete this step, just how to call the function formatWords in the module HTMLWriter for each (word, size) pair in the list.

    Now you should be able to view the HTML output file to see the formatted, sized, words.

  2. Complete the function isTaggable. Looking at the resulting top words, most of them are trivial words (such as "the" and "it") that are not content relevant. Get rid of those unimportant words in the word cloud by comparing them to a premade list of common words (stored in the file common.txt). Discuss other ideas for determining that a word is common or generally not taggable and try to implement at least one of them.

    Now you should be able to view the HTML output file to see a more interesting set of words.

  3. Complete the function sizeWords. If you use the number of occurrences as the size of each word written in HTML, many of the words are too big. Rather than using the number of occurrences directly as the font size, convert the number of occurrences to a font size in the range 10pt to 48pt.

    To do this you will need to bucket or group the occurrences (if words occur a "similar" number of times make them the same font size in the word cloud). Calculate the size of the buckets based on numDivisions and the most frequently occurring word in the file. For example, suppose numDivisions is 6 and you calculate the bucket size to be 15: any word occurring up to 15 times will have a font-size of 10. Words occurring 16-30 times have a font size of 17. Words occurring 30-44 times have a font size of 24, and so on. Effectively you will use the division operator to see how many times 15 goes into the number of occurrences and use that number to determine the font size. Be sure that the maximally ocurring word does not get a font-size greater than 48, even if it occurs a million times.