During today's lab, we'll be creating word clouds.
A word or tag cloud is a way of representing the most popular
words from a body of work, and pointing out which of these popular words are
more popular than others by representing each word in a size
proportional
to how important (e.g., number of occurrences) the word is.
To modify the textual representation of a word
we'll be creating our output in HTML. However, the actual
code to generate HTML is accessible via an API/function call
in the module
HTMLWriter.py.
The HTML is written to a separate file which you can view in a browser
(or in Eclipse!)
To start out, snarf the code for today's lab. You can
also browse the code here.
It contains the
HTMLWriter
module, which contains
functions to write HTML for you.
Each of the functions takes an open file as a parameter and it is to
this file that the HTML is written. Some functions take more
parameters as described below.
To write formatted HTML you do three things in the code you write:
- Call
start
before generating HTML, this writes the
beginning HTML code that is the start of a web page
--- pass an open file to start
.
- Call
write_sized_word
with writes a word/string in
a specified size to the file that is the first parameter (see the code).
You will need to call this function once per word you want to create.
- Call
finish
to finish writing the formatted HTML --
this function takes the same open file as a parameter.
- Write
get_popular_words
. It opens the given filename, counts the
frequencies of each word in the file passed to the function
(similar to something you saw in the FUN
in lecture this week), and returns a list of (count, word) pairs for the words
that occur most often (i.e., the list returned only contains the
number of pairs
specified by the function's parameter -- these are the most frequently
occurring words.)
In counting words be sure to convert them to lower case
so that you don't take
capitalization into account when
differentiating words. You'll need to use a dictionary for
this task. The code below represents a list of every word in
the file, this can be a starting point, though you can also
loop over file.read.split()
explicitly rather than
in a list comprehension:
[w.lower() for w in file.read().split()]
You'll want to first print the list returned and answer the questions
in the handin document related to the values printed.
- Write the function
make_word_cloud
, which generates a word
cloud. It should take an input file (from which words are read), an output
file (to which the word cloud will be written), and a number, denoting how
many words (the top most frequent words)
should be written. Your code will write the most popular words to the file,
in random order, with sizes proportional to how often the word
occurs. Remember to call the helper functions in the
HTMLWriter.py module when they would be
useful. This means call start
, finish
and
write_sized_word
with a size that's larger for more
frequently occurring words. You can use the number of
occurrences as the size, though that will likely be too large
(see the next question, don't forget to complete the handin pages.
- If you use the number of occurrences as the size
of each word written in HTML, many of the words are too big. Rather than
using the number of occurrences you'll modify the code you wrote
in the previous step to convert the number of occurrences to a font size
in the range 6pt to 36pt. You'll use only even font sizes, so
you'll map occurrences to a font size that's one of
6, 8, 10, 12, ..., 32, 34, 36.
To do this you'll
bucket or group
the occurrences
(if words occur a "similar"
number of times, e.g., the numbers are "close", make them the same
font size in the word cloud). To do this use a bucket size of 15, so any
word occurring up to 14 times will have a font-size of
6. Words occurring 15-29 times have a font size of 8. Words
occurring 30-44 times have a font size of 10, and so
on. Effectively you'll use the division operator to see how
many times 15 goes into the number of occurrences and use that
number to determine the font size. For example,
if the number of occurrences is 46, you can calculate 46/15 = 3 and use
the three to index [6,8,10,12,14,...,36] to get a font-size of
12. Be sure that the maximally ocurring word doesn't get a
font-size greater than 36, even if it occurs a million times.
To make this change write and call the function
clean_up_size
to convert a number of occurrences to a
font-size as described above.
Call this function to get the font-size you use in calling the function
write_sized_word
.
- Finally, let's try getting rid of the unimportant words in the word cloud
(such as "the" and "it"). Add a new function
is_common(word)
that returns
true in case word
is a common word (word that could
have a high occurrence frequency, but are not content
relevant).
You should come up with a few ideas and implement at least one based
on the number of characters in a word --- for example, any word with
fewer than four characters is too common. But come up with some
other ideas and document these on the
handin pages.
Use is_common(word)
in get_popular_words
to throw out any common words. Remember to make sure you still write the correct number of words to the file!