|
|
This course has a semester-long programming
project (done preferably in groups of two).
Candidate Systems for the Project
Here are some candidate systems that can be used in your project. All of these
systems are only a few years old, so there are tons of interesting and novel
features that can be added to them.
Some high-level project ideas are given below. Shivnath will be happy to discuss
these ideas with you as well as help you refine your own ideas.
Efficient Join Processing over MapReduce
This project topic will explore efficient techniques to process binary and multiway
joins using MapReduce. Here is a proposed agenda:
- Study various alternatives to process a binary equi-join (e.g., R.A = S.A) as one or more
MapReduce jobs.
- Study various alternatives to process n-way joins (e.g., R.A = S.A and S.B = T.B and T.C = U.C) as one or more
MapReduce jobs.
- If you feel that you have a better way to process joins than what
systems like Hive and Pig support, then you have an interesting project right here. Implement
and compare your technique against what is supported today. There is some
useful reading material here.
- Another project idea is to
design techniques to choose the best alternative automatically for a
given join. This step involves studying what statistics need to be maintained,
how to estimate the cost of each alternative, and how to choose the best alternative.
Implement your design in Hive, Pig, or HadoopDB.
Efficient Computation of Aggregates over MapReduce
Some of the same questions that we asked for join processing can be asked for computing
aggregates (e.g., Select A, sum(B) from R where R. A < 100 and R.A > 50) as well.
At the same time, some other interesting opportunities can be explored
for computing aggregates. For example, add support for computing distributive
and algebraic aggregates efficiently in MapReduce.
Parallel Data Layouts, Partitioning, and Rebalancing
Finding good ways to partition and lay out data is crucial for the efficiency
of MapReduce jobs and query processing in parallel databases. This is a very fertile
area for interesting projects. For example:
-
Build a tool that can take as input
a representative workload W of queries in HiveQL (or SQL), and return
the best way to partition and lay out data for W.
-
The HDFS rebalancer currently can rebalance the data layout in HDFS based on storage
utilization only. Add support to rebalance based on how frequently
the files are accessed. There is more information on the rebalancer on the
readings page.
Scheduling of Hadoop Jobs
Efficient scheduling of Hadoop jobs is a critical need. While there has been
some recent work in this area (as we saw in the "Big Data" Talks),
there are many more opportunities to contribute.
There is more information on Hadoop scheduling on the
readings page. The
LATE paper describes the elegant mechanism in Hadoop to
deal with "straggler tasks". Can you improve on this work, e.g.,
for MapReduce jobs that are generated by Hive, Pig, or HadoopDB?
Elastic Hadoop
Ever heard of
Amazon Elastic MapReduce? Two Duke PhD students
have put together the beginnings of a similar system which they
call Elastic Hadoop. They are looking for contributors. Contact
Shivnath for more details.
Project Resources
|