CPS 512 (Duke University) Distributed * Systems

Here is a general outline of topics and readings for the course. We may make minor changes and adjustments as the semester progresses. Each section corresponds to a course unit and each bullet corresponds roughly to one class.

Coordination and failures

Structure of distributed systems: clients and servers, multi-tier services, geo-distributed mega-services. Messaging, failure models, and the problem of network partitions. State and the data storage tier: “SQL, NoSQL, and NewSQL”. The consensus problem: consistency, availability, and the CAP theorem. Consistency models: ACID transactions, BASE and eventual consistency. Introduction to the Scala/Akka actor model.

Elastically scalable key-value service

The challenges of scale: partitioning/sharding and replication in the data tier. We use Amazon’s Dynamo as an example, although many systems use similar techniques. Riak is an open-source key-value store based on Dynamo. Reading: Dynamo [7]. Optional reading: A Little Riak Book.

Time, clocks, and causality

How to order concurrent events in a distributed systems with no centralized primary? Logical clocks, version vectors/vector clocks, causal ordering and causal communication. Asynchronous replication, anti-entropy, and update conflicts in Bayou.

Consensus

We spend a few days discussing consensus in theory and in practice. A safe, live consensus algorithm is the cornerstone of reliable distributed systems. But it is impossible to build one! So this is a study of what works in practice.

Trust in networked systems

References

[1] P. Bailis and A. Ghodsi. Eventual consistency today: Limitations, extensions, and beyond. Commun. ACM, 56(5):55–63, May 2013. [local pdf].

[2] P. Bailis, S. Venkataraman, M. J. Franklin, J. M. Hellerstein, and I. Stoica. Quantifying eventual consistency with PBS. Communications of the ACM, 57(8):93–102, Aug. 2014. [local pdf].

[3] E. Brewer. CAP twelve years later: How the “rules” have changed. Computer, 45(2):23–29, February 2012. [local pdf].

[4] M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI), pages 335–350. USENIX Association, May 2006. [local pdf].

[5] M. Burrows, M. Abadi, and R. Needham. A logic of authentication. ACM Transactions on Computing Systems (TOCS), 8(1):18–36, 1990. [local pdf].

[6] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, et al. Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems (TOCS), 31(3):8, 2013. [local pdf].

[7] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s highly available key-value store. In SOSP ’07: Proceedings of 21st ACM SIGOPS Symposium on Operating systems Principles, pages 205–220, New York, NY, USA, 2007. ACM. [local pdf].

[8] M. J. Franklin. Concurrency control and recovery, 1997. [local pdf].

[9] G. Hohpe. Your coffee shop doesn’t use two-phase commit. IEEE Software, 22(2):64–66, Mar. 2005. [local pdf].

[10] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978. [local pdf].

[11] B. W. Lampson. How to build a highly available system using consensus. In Distributed Algorithms, pages 1–17. Springer, 1996. [local pdf].

[12] B. Liskov. Practical uses of synchronized clocks in distributed systems. In Proceedings of the Tenth Annual ACM Symposium on Principles of Distributed Computing, PODC ’91, pages 1–9, New York, NY, USA, 1991. ACM. [local pdf].

[13] B. Liskov. From viewstamped replication to byzantine fault tolerance. In Replication, pages 121–149. Springer, 2010. [local pdf].

[14] D. Ongaro and J. Ousterhout. In search of an understandable consensus algorithm. In Proceedings of the USENIX Annual Technical Conference, pages 305–320, June 2014. [local pdf].

[15] K. Petersen, M. Spreitzer, D. Terry, and M. Theimer. Bayou: Replicated database services for world-wide applications. In Proceedings of the 7th ACM SIGOPS European Workshop: Systems Support for Worldwide Applications, pages 275–280. ACM, September 1996. [local pdf].

[16] K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and A. J. Demers. Flexible update propagation for weakly consistent replication. In SOSP ’97: Proceedings of the Sixteenth ACM Symposium on Operating systems Principles, pages 288–301, New York, NY, USA, October 1997. ACM. [local pdf].

[17] R. Van Renesse and D. Altinbuken. Paxos made moderately complex. ACM Computing Surveys, 47(3):42:1–42:36, Feb. 2015. [local pdf].