Duke DBGroup Logo

CPS 216: Advanced Database Systems
(Data-Intensive Computing Systems, Fall 2009)

Course information
Course schedule and notes
Assignments
Readings
Project
This course has a semester-long programming project (done preferably in groups of two).

Candidate Systems for the Project

Here are some candidate systems that can be used in your project. All of these systems are only a few years old, so there are tons of interesting and novel features that can be added to them. Some high-level project ideas are given below. Shivnath will be happy to discuss these ideas with you as well as help you refine your own ideas.

Efficient Join Processing over MapReduce

This project topic will explore efficient techniques to process binary and multiway joins using MapReduce. Here is a proposed agenda:
  • Study various alternatives to process a binary equi-join (e.g., R.A = S.A) as one or more MapReduce jobs.
  • Study various alternatives to process n-way joins (e.g., R.A = S.A and S.B = T.B and T.C = U.C) as one or more MapReduce jobs.
  • If you feel that you have a better way to process joins than what systems like Hive and Pig support, then you have an interesting project right here. Implement and compare your technique against what is supported today. There is some useful reading material here.
  • Another project idea is to design techniques to choose the best alternative automatically for a given join. This step involves studying what statistics need to be maintained, how to estimate the cost of each alternative, and how to choose the best alternative. Implement your design in Hive, Pig, or HadoopDB.

Efficient Computation of Aggregates over MapReduce

Some of the same questions that we asked for join processing can be asked for computing aggregates (e.g., Select A, sum(B) from R where R. A < 100 and R.A > 50) as well. At the same time, some other interesting opportunities can be explored for computing aggregates. For example, add support for computing distributive and algebraic aggregates efficiently in MapReduce.

Parallel Data Layouts, Partitioning, and Rebalancing

Finding good ways to partition and lay out data is crucial for the efficiency of MapReduce jobs and query processing in parallel databases. This is a very fertile area for interesting projects. For example:
  • Build a tool that can take as input a representative workload W of queries in HiveQL (or SQL), and return the best way to partition and lay out data for W.
  • The HDFS rebalancer currently can rebalance the data layout in HDFS based on storage utilization only. Add support to rebalance based on how frequently the files are accessed. There is more information on the rebalancer on the readings page.

Scheduling of Hadoop Jobs

Efficient scheduling of Hadoop jobs is a critical need. While there has been some recent work in this area (as we saw in the "Big Data" Talks), there are many more opportunities to contribute. There is more information on Hadoop scheduling on the readings page. The LATE paper describes the elegant mechanism in Hadoop to deal with "straggler tasks". Can you improve on this work, e.g., for MapReduce jobs that are generated by Hive, Pig, or HadoopDB?

Elastic Hadoop

Ever heard of Amazon Elastic MapReduce? Two Duke PhD students have put together the beginnings of a similar system which they call Elastic Hadoop. They are looking for contributors. Contact Shivnath for more details.

Project Resources