• Team formation: Monday 2/23 11:59pm
  • In-class proposal presentation: Monday 3/2
  • Proposal writeup: Monday 3/2 11:59pm
  • Mid-term report: Monday 3/30 11:59pm
  • Mini-conference: Wednesday 4/22 in class, and Thursday 4/30 2-5pm (final exam slot)
  • Final report: Thursday 4/30 11:59pm

HOW TO SUBMIT: Submit the required files for all milestones (see the description of milestones below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate project milestone. Only one person needs to submit on behalf of the entire team. You can submit multiple times for each milestone, but please resubmit all files for that milestone each time.

General Guidelines

Your project team should ideally consist of four members. If you want a group of a different size, please talk to the instructors and obtain approval in advance.

Your project should deal with data from the "real" world. We highly recommend three domains below; if you pick one of these domains, it is more likely that you will get more discussion with and help from the course staff and your fellow classmates. Alternatively, you can work on a domain of your own choice, by obtaining data from other public sources, or even collecting your own data (see below for possible datasets and ideas).

We have discussed (or will be discussing) various aspects of working with data, including but not limited to: * Data collection and wrangling: e.g., cleaning, record linkage; * Basic querying and data processing: e.g., using SQL or MapReduce; * Statistical analysis and machine learning: e.g., testing, predicting, clustering; * Visualizing results. Your project should have a bit of most the components above, but it can focus on a subset of these aspects. Also, while data collection can be a component of the project, it cannot be the only focus of the project.

Recommended Domains/Datasets

We highly recommend these three domains for your project:

Yelp Dataset Challenge This Yelp Dataset is provided by Yelp as a research challenge. The massive dataset consists of millions of reviews by hundreds of thousands of users for tens of thosands of businesses. It also constains a rich set of attributes for each business, as well as the social network of users. Yelp's website has a wealth of information, including suggested ideas for analysis, as well as reports by previous winners of the challenge, which you can read for more inspiration. The deadline for this round of challenge is June 30---if you get good results from the course project you should consider entering this challenge for a $5000 cash prize!

If you find the size of this dataset intimidating, consider the Yelp Academic Dataset instead (or as a starting point).

Sports Data There are a number of sports datasets of reasonable quality. For example, in the basketball domain, the following two sites provide downloadable data for recent years: 2005-2012 and 1990-2009. To obtain the complete data for the entire history of NBA, one would need to do some scraping, e.g., from ESPN or Basketball Reference. How much you can scrape depends on what these sites allow.

For the baseball domain, you can try the the Lahman dataset; however, it has only season-by-season data, not game-by-game data. RetroSheet is another possibility, where they provide their own tools for converting their data to text format.

What you do with this type of data is up to you. You may be able to find other information about players and/or teams that will be interesting to look at in conjunction with the game/season data above.

Price Tag of Your Favorite Band Priceconomics published how much it would cost to hire your favorite band. It's up to you what you'd like to do with this data. (Before you start, you will need to figure out how to OCR the data, which currenly exist only as bitmap images!) Can you predict their price tags based on their popularity, style, age, location, etc.? Which bands are charging too much or too little? As you can see, this project is completely open-ended. For example, you have the target of prediction but no features; you will need to be creative about where/how to extract features of the bands that could be useful to you analysis.

Other Possibilities

  • has a huge compilation of data sets produced by the US government.
  • The Supreme Court Database tracks all cases decided by the US Supreme Court.
  • US government spending data has information about government contracts and awards.
  • Federal Election Commission has campaign finance data to download; their "disclosure portal" also provide nice interfaces for exploring the data.
  • tracks all bills through the Congress and all votes casted by its members. We have worked a fair amount with this data in this class. The Washington Post has a nice website for exploring this type of data (in predefined ways), but you can be creative with additional and/or more flexible exploration and analysis options.
  • provides all data related to the Recovery Act of 2009.
  • The Washington Post maintains a list of datasets that have been used to generate investigative news pieces. Most of these datasets hide behind some interface and may need to be scraped. Use this list for examples of what datasets are "interesting" and how to present data to the public effectively.
  • National Institute for Computer-Assisted Reporting maintains a list of datasets of public interest. Use this list for examples of what datasets are "interesting"---they are generally not available to the public, but there may be alternative ways to obtain them.
  • Google Fusion Table hosts quite a number of datasets of public interest. It is a good place to find datasets or data sources to work on, and you can consider using it as a method of hosting your data for public access.
  • Citi Bikes publishes its bike usage data. Some example analysis can be found here.
  • This blog has a long, interesting list of datasets and possible analysis ideas. In fact you will find pointers to Yelp, Priceconomics band price, and various sports datasets here!


Team Formation: Submit a text file team.txt with the following information: * Team name: your choice---any short name is fine. * Team members, along with Duke netids. * Topic: If you know what problem/dataset you want to work on, describe it very briefly here---in a sentence or two. If you only know the general application domain you are interested in (e.g., Yelp, sports, civics), state it. Note that you will need to decide what to work on in another week.

In-Class Proposal Presentation: Each team will have a maximum of 4 minutes in the lab to present (including any setup time). In this presentaiton: * Briefly introduce your team members; * Describe your problem statement, dataset, and how you will quantify success.

Optionally, you may use 1-2 slides (without animation) for your presentation. If you choose to do so, submit your slides (in PDF format) before the lab.

Proposal Writeup: Your project proposal should be 1-3 pages long, with 11- or 12-point font. In your proposal: * Define the problem you are going to solve. * Describe the datasets that you will you use, and how you plan to acquire them. * Propose how you plan to evaluate the project's success. For example, how do you measure the accuracy of your analysis or effectiveness of your visualization? How do you obtain the "true" answers if you are doing some prediction? * Discuss related work or projects, if any. * [Optiona] Briefly describe how you plan to solve your problem.

Submit your proposal as a PDF file.

Mid-Term Report: By now, you should have obtained some data and started to wrangle and play with them. You should know whether you will have real data in enough quantity and quality to support your analysis. If not, you need to come up with alternative plans. Whether or not you have some preliminary analysis results by now, you should have a reasonably good picture of what types of wrangling, analyses, and/or visualizations to try, and what tools you will use to run them (e.g., psql, Python sklearn, MapReduce, etc.).

Your mid-term report should be 3 pages long, with 11- or 12-point font. In your report, please: * Summarize progress that you've made so far, as well as any changes you made to your project goals/plan since the proposal. * Address any questions/issues raised in our feedback to your proposal. * Describe the remaining steps that need to be taken to complete the project.

Submit your report as a PDF file. Also, submit a .zip or .tar.gz ball of any code you have written. Do not submit large datasets; instead, submit a URL to the raw data (and describe the tool/code used for extraction/wrangling).

Mini-Conference: Details to be decided later.

Final Report: The final report should be a self-contained document, clearly motivating and defining the problem, describing your approaches as well as alternatives and related work (if applicable), presenting results and evaluation, and drawing your conclusions. Do not assume that the reader has read your project proposal and mid-term report. If there are unresolved problems/issues with your code or results, be sure to explain them in a section of the report. If you want to respond explicitly to specific issues raised in early feedback, do so in an appendix.

We are not asking for an elaborate term paper, and won't mind a concise document. The report should not exceed 8 pages with single-space text and 11- or 12-point font. A short report is fine as long as it's clear and complete.

Submit your final report (as a PDF file), your presentation slides (as PDF or Powerpoint file), as well as your results, code, together with sufficient instructions for replicating your results. You can submit a single .zip or .tar.gz ball for your results and code, but please submit your report and slides as separate files.