CompSci 307 : Spring 2019

Data: Baby Names

Data structures are abstracted information; algorithms are abstracted recipes regarding how to proceed and what to do, step-by-step, to achieve some very particular result or results. —David Carew

Data are everywhere, especially given the increasing commitment by government and corporations to making their processes transparent. The flip side is that we are overwhelmed by the sheer amount of data available, unable to turn it into useful information. However, many studies have shown that when data is analyzed, it can be used to gain a deeper understanding of a domain or even make transformative changes based on patterns found. Thus the main challenge we face is how to programmatically turn that data into information we can use.

Choosing a name for a baby can be a daunting task and many people are turning to data to help them, from trending names to a name's meaning to what their name might be today to studies trying to determine how much a name affects a person's opportunities. Of course, there numerous websites devoted to helping you as well: NameTrends offers a wealth of visualizations; BabyNameWizard tries to make the choice more automatic; BabyNames is loaded with more traditional content; and the government even has an interactive tool as well. Working with baby names has been used as a data challenge as well as a programming assignment in CompSci 101 and more generally as an assignment that is highly regarded by educators as "nifty".

This variety is all made possible because the Social Security Administration published data on boy and girl names for children born in the US for all years after 1879.

Specification

Write a Java program from scratch to read data about baby names from multiple files into a data structure of your choice that can be used to get answers to basic questions about baby names. Example questions will be given in expected deliverables for the project.

You do not have to worry about visualizing the data into charts, graphs, or maps, instead think of your program as the "backend" or model that might support any of the example web sites given above. However, you should provide output that demonstrates that your program works as you intend it to (i.e., automated tests that report pass or fail). This means you do not need to create an interactive input system that lets users provide any kind of value that would need to be error checked because your tests provide the inputs directly.

Your tests do not have to be comprehensive but they do need to demonstrate that you have thought about important kinds of values to test and have an understanding of the expected values. Note, this likely means you will need to make your own, smaller, data files with known, simple, values rather than always relying on the complete data set. For example, creating a subset of the data where the number of boys name "Fred" is set to an easy to calculate number of births like 1000 instead of its actual value provides answers that you can figure out by hand if the expected answer is correct so you have more confidence when running it on the complete data set.

Your tests can be home grown or use a tool like JUnit just so long as the results are clear (what is being tested and whether the expected results were returned). We do not recommend trying to learn JUnit right now if you have never used it before, we are just making it an option for those that are familiar with it.

We have collected the data here to make it easier for you to download (or you can browse it here), but your program should support easily changing the location of the source data (e.g., getting from your local file system or the web, getting the full set or your smaller test set). You should comment both in your code and in your README how to change the data source so that we know how to run your program on our own subsets of the data that may have different expected values.

Design

A typical structure for programs that deal with data is one that divides a program's execution into three stages: input, data is provided to the program; process, that data is processed; and output, the program displays/tests the results of processing that data. This input/process/output (IPO) model of programming is used in simple programs like this one as well as in million-line programs that forecast the weather or predict stock market fluctuations. Your program should be clearly separated into these three conceptual stages (either by methods, classes, or packages — whichever you feel most comfortable with) in such a way that you reduce the amount of duplicated code in your project (for example, by writing general utility methods or accepting more general Java interfaces as parameters).

Your program should not assume anything about the content of the data other than its format (e.g., do not assume that there will be exactly 13 decades).

Project Goals

This project is intended as a warm-up to get you started coding a project from scratch: so you can determine if the course is at the right level for your abilities and so we can determine a baseline for your sense of design. It is also intended to introduce you to using the tools for the course: IntelliJ and Git Version Control.

Deliverables

You will submit this project in stages to introduce you to the course's basic workflow.

Basic: implement a basic initial part of the project
Complete: implement a complete version of the project
Analysis: reflect on your experience

Data: Baby Names

Specification

Design

Project Goals

Deliverables

Resources