Tuesday 1/21 11:59pm
All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci290
and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt
listing members of your team who are present at the lab. To earn extra credit for the lab challenge, submit the required files for challenge problems (see WHAT TO SUBMIT below).
In homework 2, you performed tokenization, word counts, and possibly calculated tf-idf scores for words. In Python, two libraries greatly simplify this process: NLTK - Natural Language Toolkit and Scikit-learn. NLTK provides support for a wide variety of text processing tasks: tokenization, stemming, proper name identification, part of speech identification, and so on. Scikit-learn (generally speaking) provides advanced analytic tasks: tfidf, clustering, classification, etc.
In class, we did a basic word count of Shakespeare using the command line. Let's use NLTK for the same task.
import nltk
import string
from collections import Counter
def get_tokens():
with open('/opt/datacourse/data/parts/shakes-1.txt', 'r') as shakes:
text = shakes.read()
lowers = text.lower()
#remove the punctuation using the character deletion step of translate
no_punctuation = lowers.translate(None, string.punctuation)
tokens = nltk.word_tokenize(no_punctuation)
return tokens
tokens = get_tokens()
count = Counter(tokens)
print count.most_common(10)
These are uninformative, so let's remove the stop words.
from nltk.corpus import stopwords
tokens = get_tokens()
filtered = [w for w in tokens if not w in stopwords.words('english')]
count = Counter(filtered)
print count.most_common(100)
We can also do stemming using NLTK using a Porter Stemmer.
But will this work well on Shakespeare's writings?
from nltk.stem.porter import *
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
stemmer = PorterStemmer()
stemmed = stem_tokens(filtered, stemmer)
count = Counter(stemmed)
print count.most_common(100)
Porter-Stemmer ends up stemming a few words here (parolles, tis, nature, marry).
What is more interesting is the counts are different - in fact, so much so that the ordering has been affected. Compare the two lists, especially the bottom of them, and you'll notice substantial differences.
With our cleaned up text, we can now use it for searching, document similarity, or other tasks (clustering, classification) that we'll learn about later on. Unfortunately, calculating tf-idf is not available in NLTK so we'll use another data analysis library, scikit-learn. Scikit-learn has a built in Tf-Idf implementation but we can use NLTK's tokenizer and stemmer to preprocess the text.
import nltk
import string
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
path = '/opt/datacourse/data/parts'
token_dict = {}
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
for subdir, dirs, files in os.walk(path):
for file in files:
file_path = subdir + os.path.sep + file
shakes = open(file_path, 'r')
text = shakes.read()
lowers = text.lower()
no_punctuation = lowers.translate(None, string.punctuation)
token_dict[file] = no_punctuation
#this can take some time
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())
First, we iterate through every file in the Shakespeare collection, converting the text to lowercase and removing punctuation. Next, we initialize TfidfVectorizer. In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's. Then we call fit_transform which does a few things: first, it creates a dictionary of 'known' words based on the input text given to it. Then it calculates the tf-idf for each term found in an article.
This results in a matrix, where the rows are the individual Shakespeare files and the columns are the terms. Thus, every cell represents the tf-idf score of a term in a file.
The table represents a sample tf-idf entry from the Shakespeare files. In general, these should be small (or 0 if the term isn't present in the document).
So now we have a collection tf-idf numbers, what can we do with it? Here are some options:
To do any of these, we have to input a new document (or existing) into the model and get a tf-idf answer back. We do this by using the transform function, which will use our existing NLTK preprocessor first.
str = 'this sentence has unseen text such as computer but also king lord juliet'
response = tfidf.transform([str])
print response
Text it hasn't seen gets excluded, unless you call fit_and_transform(), which adds the document to the model, whereas transform does not.
We can get the specific terms and their tf-idf score (but it's not straightforward):
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print feature_names[col], ' - ', response[0, col]
In this part of the lab, we will continue with our exploration of the Reuters data set, but using the libraries we introduced earlier and cosine similarity. First, let's install NLTK and Scikit-learn.
We'll install both NLTK and Scikit-learn on our VM using pip, which is already installed.
First: Run the sync.sh script in your vm, this should install everything required.
If for some reason that didn't work
Run the following two commands from a terminal in the VM:
pip install nltk
pip install scikit-learn
We'll also need to install models from nltk. Open up a python shell (or Enthought Canopy), and type:
import nltk
nltk.download()
This should bring up a window showing available models to download. Select the 'models' tab and click on the 'punkt' package, and under the 'corpora' tab we want to downlod the 'stopwords' package. You should then have everything you need for the exercises.
If that hung: You can download the data manually from https://s3.amazonaws.com/textblob/nltk_data.tar.gz. Extract this file to ~/nltk_data.
Please answer the following questions.
At the end of the class, each group will be asked to give their top 10 sentences for a randomly chosen organization.
A file 'orgsSim.txt' that contains one line per organization. Output the organization string, followed by a list of 5 document ids (use the NEWID attribute of the REUTERS xml element). For instance,
opec:::(aa, bb, cc, dd, ee)
ecafe:::(aa, bb, cc, dd, ee)
fao:::(aa, bb, cc, dd, ee)