CPS 216: Data-intensive Computing Systems, Fall 2011

Course information

Course schedule and notes

Assignments

Readings

Project

Extra Materials

Announcements

Exercise 8 has been posted. This exercise will not be graded. The solutions will be posted by Dec 8.
We will have two talks from invited speakers in the week of Nov 14. On Monday, Nov 14, 6.00-7.00 PM, Jeffrey Krone will talk about Entering the Zettabyte Age. There will be a reception (with pizza) that starts at 5.45 PM. On Tuesday, Nov 15, 4.15-5.15 PM, Alan Gates will talk about Pig Optimization and Execution. Alan will be around to talk to students until 5.45 PM.
We thank Amazon Web Services (AWS) for giving an educational grant for our students to do programming projects on the Amazon Cloud!

Course Description

Database systems are going through very interesting and chaotic times. Popular relational database systems like IBM DB2, Microsoft SQLServer, Oracle, and Sybase are struggling to handle the massive scale of data introduced by the Web. Today, companies have to deal with extremely large datasets. Facebook absorbs 15 TeraBytes of data each day into their 2.5 PetaByte Hadoop-powered data warehouse. eBay maintains a 6.5 PetaByte (i.e., 6.5 x 1,000,000,000,000,000 Bytes!) data warehouse.

A new breed of database systems are emerging to handle data at massive scale. These systems take some of the successful features from conventional relational databases---like run-time query optimization, automated crash recovery, and self-tuning---and make them work at the scale of 100s-1000s of processors and disks. As we move into the world of "big data", many traditional assumptions break, new query and programming interfaces are required, and new computing models will emerge. Did you know that each Google search query can touch up to 2,000 servers that must all execute that query and respond in less than a third of a second?

This course covers a spectrum of topics from core techniques in relational data management to highly-scalable data processing using parallel database systems and MapReduce. The course material will be drawn from textbooks as well as recent research literature. The following topics will be covered this year. The figures in brackets indicate the amount of time devoted to the topic relative to the total duration of the course.

Principles of query processing (35%)
- Indexes
- Query execution plans and operators
- Query optimization
Data storage (15%)
- Databases Vs. FileSystems (Google FileSystem, Hadoop Distributed FileSystem)
- Data layouts (row-stores, column-stores, partitioning, compression)
Scalable data processing (40%)
- Parallel query plans and operators
- Systems based on MapReduce (Hadoop, Pig, Hive)
- Scalable key-value stores (Amazon Dynamo, Cassandra, Google BigTable, HBase)
- Processing rapid, high-speed data streams
Concurrency control and recovery (10%)
- Consistency models for data (ACID, Serializability)
- Write-ahead logging

Prerequisites: An introductory database course will be helpful, but it is not required. If you have not taken an introductory database course before, please talk to the instructor first. A lot of the material that we cover cannot be found in textbooks. Be prepared to do a fair amount of reading.

Time and Place

2:50pm-4:05 PM on Mondays and Wednesdays; D243 LSRC

Books and References

(Highly recommended) Hadoop: The Definitive Guide, by Tom White. O'Reilly Media. October 2010. (Second edition of the book at Amazon.com)

Cassandra: The Definitive Guide, by Eben Hewitt. O'Reilly Media. November 2010. (First edition of the book at Amazon.com)

Database Systems: The Complete Book, by Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. Prentice Hall. 2002. (The second edition is available now.)

Readings will be posted on the readings page.

Staff

Instructor: Shivnath Babu
Office: D338 LSRC, Phone: 919-660-6579 (email is recommended)
Office hours: The instructor prefers to have office hours by appointment so that we make best use of time. Send the instructor an email to fix the meeting time. The office hours will be held in the instructor's office.

TA: Rozemary Scarlat
Office: D325 LSRC, Phone: 919-660-6586
Office hours: 11.00AM - 12.00PM on Tuesday and 4.30PM - 5.30PM on Thursday, or by appointment.
The office hours will be held in LSRC room D344.

Grading

Homeworks (written, programming)	25%
Project	25%
Midterm	25%
Final	25%

There is a semester-long course project (done in groups of two or three). Details will be presented in class.

Both midterm and final exams are open-book and open-notes. Laptops and other electronic devices are not allowed. Late homeworks will not be accepted, unless there are documented excuses from a physician or dean.

Honor Code

Under the Duke Honor Code, you are expected to submit your own work in this course, including homeworks, projects, and exams. On many occasions when working on homeworks and projects, it is useful to ask others (the instructor or other students) for hints or debugging help, or to talk generally about the written problems or programming strategies. Such activity is both acceptable and encouraged, but you must indicate in your submission any assistance you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding and being able to explain on your own all written and programming solutions that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.