Compsci 101, Fall 2017, Lab 7

The lab

In this lab you will:

use dictionaries to answer questions about data from CSV files

There is code to snarf for lab 7, that code is also here.

Getting Credit for Lab 7

To get credit for this lab, you will need to do the following by Sunday night.

Complete and submit this google form to answer questions.

Submit any code you created to the lab07 folder using Eclipse/Ambient or the Websubmit

Part 1: Rock and Roll Songs

For this problem, you are given a data-file of the top 1000 rock and roll songs, e.g., from this source among others: http://www.rocknrollamerica.net/Top1000.html -- you can see the same data here as a google spreadsheet

The data file you're given is a file in CSV format (comma separated values). Each row of the CSV-table includes the rank, the song, and the artist. The first seven lines are reproduced below. As with many CSV files, there's a header as the first row of the file and also a header in the spreadsheet the file models.

Rank	Song	Artist
1	Stairway to Heaven	Led Zeppelin
2	Hey Jude	Beatles
3	All Along the Watchtower	Hendrix, Jimi
4	Satisfaction	Rolling Stones
5	Like A Rolling Stone	Dylan, Bob
6	Another Brick In The Wall	Pink Floyd

You'll answer a few questions about this data by modifying the Python code in module SongReader.py.

Which artist/group has the most songs in the top 1,000?
What are the top 5 artists in terms of number of songs in the top 1,000?
What word greater than four letters long appears in more song title than any other word?

CHANGED TO FOUR LETTERS, FIX EVERYWHERE

If you've had experience with spreadsheets, it's possible you could answer some of these questions using functions in the spreadsheet, but for many problems using Python will let you solve problems more easily.

To answer these questions you'll modify the module SongReader.py and use dictionaries. Additional problems with a different data set follow the questions below. These questions are meant to guide you toward a solution to the three questions above.

Documentation for the Python csv library is here: https://docs.python.org/2/library/csv.html, though you likely will not need to read this to complete the lab.

In the program, the artist/group in each row read is row[2], the name of the song is row[1]. You can see this in the program. To find the top artists, use the dictionary in the variable datasg inside the loop. Use the artist as the key mapped to the associated value, which is a list of that artist's songs. Add each song read to the list of songs for the artist. Here's the code you'll need (if variables artist and song have been set appropriately). This is typical code that initializes a dictionary value associated with a key the first time a key is seen, and modifies the value after the first time.

 if artist not in datasg: 
     datasg[artist] = [song]
 else:
     datasg[artist].append(song)

After maintaining the dictionary, you should be able to print the dictionary keys and values after the loop to eyeball top artists, try this code to see that yours works

 for artist in datasg: 
     print artist, datasg[artist]

Add the following code/list comprehensions shown below to replace the statements you just entered above: this code sorts and prints sorted data rather than all the data

  info = datasg.items() 
  tosort = [(len(t[1]),t[0]) for t in info] 
  info = sorted(tosort) 
  print info[-30:]

Then explain the answers to the following questions in the online form for lab.

Describe the value and type of the variable info

The variable tosort is a list of tuples, describe the first element of each tuple, what's its type and conceptually what does it represent?

Why is the slice [-30:] used in the print statement? Describe what it means in Python and why it's used here.

Problem 1: Who is the top artist?

Problem 2: Who are the top five artists?

Problem 3: What word of greater than 3 letters appears in more titles than any other word?

For problem 3, how do you handle the word Sunday in the title Sunday Bloody Sunday so it is not counted more than once?

How did you solve problem 3? Use words and paste your code that will explain how you answered the question.

Part 2 - Movies

You're also given a file 9600movies.csv that has information on 9,600 movies including director, year, title, genre, country, length, and whether the movie is black-and-white or color. You can also see the data here in this online spreadsheet

You should create a module MovieReader.py and write code to answer the following questions. You can use the code from SongReader.py as a model. In some problems you may need to use a dictionary, but not in all problems. Answer the following questions on the online form.

You'll answer several questions and pose your own question. The answers and code you use to answer them should be added to the form you complete for lab.

How many movies were created in the United States (US)?

What are the five most popular lengths for a movie?

How many movies were created between 1980 and 1989, inclusive?

Either answer this question with code, or describe how you would: How many movies were created in each of the decades 40's, 50's, 60's, 70's, 80's, 90's, 00's, 10's.

Describe how you'd write code to answer the question: Which director has films in years that were furthest apart than any other director? For example, Woody Allen has movies in 1969 and 2007 in this data, that's 38 years apart.

What question do you think might be interesting to answer with this data?

After submitting the lab form, then submit the code you wrote for both parts of the lab with ambient/websubmit. Use lab09 as the submission folder.