COMPSCI 561/CBB 561 - Computational Sequence Biology

Spring 2021

Tue & Thu 8:30am - 9:45am, virtual


Course Description:

Algorithmic and computational issues in analysis of biological sequences: DNA, RNA, and protein. Emphasizes probabilistic approaches and machine learning methods, e.g. Hidden Markov models. Explores applications in analysis of high-throughput sequencing data, protein and DNA homology detection, gene finding, motif discovery, comparative genomics and phylogenetics, genome segmentation, DNA/RNA/protein structure prediction, with a strong focus on algorithmic aspects. Prerequisites: basic knowledge of algorithmic design (COMPSCI 330 or equivalent), probability and statistics (STA 611 or equivalent), molecular biology (BIO 201L or equivalent), basic computer programming skills (preferred programming languages: Python, Java, C/C++, Perl, R, or Matlab).

Course materials, homeworks and quizzes are avalaible through Sakai.

Raluca Gordan
Office hours: Tue 9:45am-10:45am (right after class)
Zoom link: same as the class meeting for that day
Email: raluca.gordan at duke dot edu

Harshit Sahay
Office hours: TBP
Zoom link: TBD
Email: harshit.sahay at duke dot edu

Course grade is based on homeworks (70%), pre-class quizzes (15%), and class participation (15%). Homeworks and quizzes will be distributed through Sakai.
You will have 2 weeks to complete each homework. Late homeworks will not be accepted; however, you are allowed one late homework for the course, for a maximum of 1 week.
Pre-class quizzes will be due 1 hour before class. The quizzes will test either your background on a subject (to make sure you will be able to follow and participate in the lecture) or your understanding of a subject or paper presented in a previous lecture. You can take each quiz twice; only the highest grade will be considered.

Collaboration policy:
All homeworks and pre-class quizzes should be completed individually, unless otherwise stated. However, if you have worked for a while on a particular problem and have encountered a mental wall, and if you have banged your head against the wall for a while, you should consult others to make progress—that is better than giving up entirely. Your first course of action is to speak to the instructor or TA. If for any reason you consult your peers, it should remain understood that such an interaction must be one of consultation and not collaboration: hints rather than answers; after consultation, it is expected that you should still have some thinking to do (otherwise this course will not be very useful for you!). In addition, if you happen to consult with another student, both of you must cite this.

We will have readings for the course (which will be available on Sakai), but there is no formal textbook. Useful resources include:

•    Durbin, Eddy, Krogh, Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
•    Cristianini and Hahn, Introduction to Computational Genomics: A Case Studies Approach
•    Jones and Pevzner, An Introduction to Bioinformatics Algorithms
•    Majoros, Methods for Computational Gene Prediction
•    Alberts, Johnson, Lewis, Raff, Roberts, Walter, Molecular Biology of the Cell
•    Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms


This syllabus is tentative and may change (slighly) during the semester. Please check Sakai for the latest version.

1 Jan-21 Introduction; DNA sequencing

2 Jan-26 Global sequence alignment; Needleman-Wunsch
3 Jan-28 Local sequence alignment; Smith-Waterman

4 Feb-2 Heuristic search; FASTA; BLAST
5 Feb-4 String matching; suffix arrays

6 Feb-9 Short read alignment; BWA; Bowtie
7 Feb-11 Probabilistic models for biological sequences

8 Feb-16 HMM parsing; Viterbi
9 Feb-18 HMM training; Baum-Welch

10 Feb-23 HMM applications
11 Feb-25 Profile HMMs; PSIBLAST

12 Mar-2 Phylogenetic trees: UPGMA; NJ
13 Mar-4 Unsupervised learning

14 Mar-11 Clustering; non-negative matrix factorization

15 Mar-16 Algorithms in single-cell data analysis
16 Mar-18 Supervised learning; classification and regression

17 Mar-23 SVM; string kernels
18 Mar-25 Naive Bayes; logistic regression

19 Mar-30 Deep neural networks
20 Apr-1 Motif finding: EM and Gibbs sampling

21 Apr-6 Motif finding: Bayesian networks
Apr-8 to Apr-22 Student presentations

Link to Sakai