Data Science
Jian Pei
Office hours: 2:30-3:30 pm, Tuesdays/Thursdays
Office: LSRC D112A
TAs
Qirui Cao: qirui.cao@duke.edu
Hao-Lun Hsu: hao-lun.hsu@duke.edu
Srikar Katta: srikar.katta@duke.edu
Xiaonan Wang: xiaonan.wang631@duke.edu
Haibo Xiu: haibo.xiu@duke.edu
Dingyan Zhong: dingyan.zhong@duke.edu
Rui Zhang: r.zhang@duke.edu
TA Office hours (LSRC D215):
Monday: 15:30 - 17:00
Wednesday: 17:30 - 19:00
Friday: 13:30 - 15:00
Office hour zoom link: https://duke.zoom.us/j/99187763884
Course Desription
This course provides a general and systematic introduction to the concepts, ideas, tools, and example applications of data science. We focus on the data-driven ideas, the interactions among applications, modeling, and data processing, and the essential algorithms and tools. By completing this course, students will learn methodology and hands-on experience on collecting and analyzing data, extracting insights from data, and transforming knowledge from data to business decisions and actions.
This offering has a programming/experiment component and a substantial independent project, making it suitable for a broad audience, particularly for students from computer science, healthcare, computer engineering, business administration, and statistics.
Prerequisites
Introductory courses to data structures, discrete mathematics, algorithms, statistics, and databases
Fluent in programming using Python, R, or Java.
Objectives
As a result of this course, you will be able to:
Understand the end-to-end data science process.
Grasp the essential concepts and techniques in data science process, such as sampling, data cleaning and integration, pattern mining, processing data using machine learning methods, producing interpretaion of data analytic results.
Practice the basics of telling stories and communicating with business stakeholders using data and analytics.
Get to know how to become a data scientist, how to work with data scientists, and how not to be fooled by data abusers.
Identify interesting research and development problems in data science.
Required Text
J. Han, J. Pei, and H. Tong: Data Mining: Concepts and Techniques, 4th Edition, Elsevier, 2022
References
There are many valuable references on the general subject of (methodology in) data science nowadays. Here only a few and biased examples are listed in a random order.
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: with Applications in R. Springer Publishing Company, Incorporated.
Matt Taddy, Leslie Hendrix, and Matthew Harding. 2023. Modern business analytics. McGraw Hill.
Gilbert Strang. 2019. Linear Algebra and Learning from Data. Wellesley Cambridge Press.
Larry Wasserman. 2010. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, Incorporated.
Judea Pearl, Madelyn Glymour, and Nicholas P. Jewell. 2016. Causal Inference in Statistics: A Primer. Wiley.
Course Requirements (subject to change)
5 Assignments (40%). By default, assignments will be released on Tuesdays, and due by 11:59 pm the next Thursdays. There will be 6-8 assignments. The best 5 will be used in the final grade calculation.
1 exam (30%).
1 group project (3-4 students per group, 30%)
Late submission policy
Every student has 3 late submission tokens. Each token has the power to defer an assignment submission deadline for 24 hours.
A project submission up to 72 hours late will still be accepted, with 5% deducted every 24 hours or partially.
Course Calendar (Dates and topics are subject to change at the discretion of the instructor)
January 12: Introduction
January 17: Knowing your data (Assignment 1)
January 19, 24: Data quality, data cleaning, and data integration (Assignment 2)
January 24, 26, and 31: Sampling (Assignment 3)
February 2 and 7: Data storage: data warehouses, data lakes, and data lakehouses (Assignment 4)
February 9: Regression
February 14: Uncertainty quantification (Assignment 5)
February 16: Regularization
February 21, 23, and 28: Frequent pattern mining (Assignment 6)
March 2, 7, and 9: classification (Assignment 7)
March 14 and 16: Spring recess, no classes
March 21 and 23: Causual inference
March 28 and 30: Clustering and factor models (Assignment 8)
April 4: Outlier detection
April 6 and 11: Causal analysis: from association to intervention and counterfactuals (Assignment 9)
April 13: Data markets
April 18: Data science stewardship: findable, accessible, interoperable, and reusable (FAIR)