A computational perspective on the exploration and analysis of genomic and genome-scale information. Provides an integrated introduction to genome biology, algorithm design and analysis, and probabilistic and statistical modeling. Topics include genome sequencing, genome sequence assembly, read mapping, local and global sequence alignment, sequence database search, gene finding, phylogenetic tree construction, and elementary gene expression analysis. Methods include dynamic programming, indexing, hidden Markov models, and elementary supervised machine learning. Focuses on foundational algorithmic principles. Development of practical experience with handling, analyzing, and visualizing genomic data using the computer language Python.
The course requires students to program often in Python. Students coming in to the course must already know how to program in some computer language, but it need not be Python. If it is not Python, students will be expected to come quickly up to speed in Python on their own. Additionally, students should be comfortable with mathematical thinking and formulas, and should have had some exposure to basic probability as well as molecular or cellular biology; however, the course has no formal course prerequisites, and quick refreshers of relevant background will be provided. Please speak to the instructor if you are unsure about your background. This course is a valid elective in both biology and computer science.
Professor Alex Hartemink
Kyle Pinheiro, TA |
Email: kyle.pinheiro at duke.edu |
Lara Breithaupt, TA |
Email: lara.breithaupt at duke.edu |
Nhat Duong, TA |
Email: nhat.duong at duke.edu |
Gene Yang, UTA |
Email: gene.yang at duke.edu |
Helen Xu, UTA |
Email: helen.z.xu at duke.edu |
Henry Gussis, UTA |
Email: henry.gussis at duke.edu |
Holly Zhuang, UTA |
Email: mingming.zhuang at duke.edu |
Joshua Tennyson, UTA |
Email: joshua.tennyson at duke.edu |
Kash Sreeram, UTA |
Email: kashyap.sreeram at duke.edu |
Lola Maglione Silva, UTA |
Email: carola.maglione.silva at duke.edu |
Michelle Kwan, UTA |
Email: michelle.kwan at duke.edu |
Naomie Gao, UTA |
Email: naomie.gao at duke.edu |
Sierra Seifert, UTA |
Email: sierra.seifert at duke.edu |
Zach Pracher, UTA |
Email: zachary.pracher at duke.edu |
Office hours with TAs and UTAs will be held at the following times, starting on Tuesday 16 January; all sessions are in-person. Most sessions will be held in Reuben-Cooke 133 (go through the front door, turn right at the hallway, and it's the room immediately on your right), but Monday sessions from 2:30–5:30pm will instead be held in LSRC D344 (D-wing entrance is at the far end, closest to Research Drive: climb the stairs to the third floor, and D344 will be straight ahead). Remember that if your question is pretty simple, you are likely to get help quickest by asking on Ed (in fact, your question may already be answered there).
Sunday | 5–6pm | Naomie Gao | Reuben-Cooke 133 |
8–10pm | Helen Xu | Reuben-Cooke 133 | |
Monday | 10am–12pm | Sierra Seifert | Reuben-Cooke 133 |
12:30–2:30pm | Lara Breithaupt | Reuben-Cooke 133 | |
Monday | 2:30–3:30pm | Naomie Gao | LSRC D344 |
3:30–5:30pm | Michelle Kwan | LSRC D344 | |
Tuesday | 2–4pm | Gene Yang | Reuben-Cooke 133 |
4–6pm | Holly Zhuang | Reuben-Cooke 133 | |
6–8pm | Zach Pracher | Reuben-Cooke 133 | |
7–9pm | Kash Sreeram | Reuben-Cooke 133 | |
Wednesday | 1–3pm | Kyle Pinheiro | Reuben-Cooke 133 |
3–5pm | Nhat Duong | Reuben-Cooke 133 | |
5–7pm | Henry Gussis | Reuben-Cooke 133 | |
7–9pm | Joshua Tennyson | Reuben-Cooke 133 | |
Thursday | 3–5pm | Lola Maglione Silva | Reuben-Cooke 133 |
If you would like to speak with the instructor about anything, you are welcome to stick around after lecture to chat, or you can send an email to schedule a meeting at a time that is convenient for you.
The class meets 10:05–11:20am on Tuesdays and Thursdays in 111 BioSci.
Note: The course schedule may change subtly from time to time. Always check this page for the most up-to-date schedule.
Session | Date | Topic | Assignment (out and due on Fridays except where indicated) |
---|---|---|---|
1 | Thu 11 Jan | Course introduction; SARS-CoV-2 genome introduction | PS1a out |
2 | Tue 16 Jan | Molecular biology primer: DNA, RNA, and protein | PS1b out by Wed |
3 | Thu 18 Jan | Gene/genome organization; SARS-CoV-2 genome revisited | PS1a due 5pm Fri |
4 | Tue 23 Jan | Algorithm introduction; Time and space resources | |
5 | Thu 25 Jan | Analyzing algorithms; Designing efficient algorithms | PS1b due 5pm Fri; PS2 out |
6 | Tue 30 Jan | Divide-and-conquer recursion | |
7 | Thu 01 Feb | Divide-and-conquer recursion fails; Memoization | |
8 | Tue 06 Feb | Dynamic programming; Greedy algorithms; Short-read mapping | |
9 | Thu 08 Feb | Prefix trees; Suffix trees and arrays; BWT | PS2 due 5pm Fri; PS3 out |
10 | Tue 13 Feb | BWT; FM-index introduction | |
11 | Thu 15 Feb | FM-index continued; DNA sequencing | |
12 | Tue 20 Feb | Challenge of genome assembly; Human Genome Project (HGP) | |
13 | Thu 22 Feb | HGP and Celera | PS3 due 5pm Fri; PS4 out |
14 | Tue 27 Feb | Sequence variation; Global alignment | |
15 | Thu 29 Feb | Global alignment traceback; Affine gap penalties and alignment | |
16 | Tue 05 Mar | Affine gap alignment traceback; Local alignment; FASTA and BLAST heuristics | |
17 | Thu 07 Mar | Phylogenetic trees; Time and distance | PS4 due 5pm Fri; PS5 out |
Tue 12 Mar | SPRING BREAK — enjoy! | ||
Thu 14 Mar | SPRING BREAK — enjoy! | ||
18 | Tue 19 Mar | Building phylogenetic trees with UPGMA | |
19 | Thu 21 Mar | Building phylogenetic trees with NJ | |
20 | Tue 26 Mar | Random variables; Probability; Discrete and continuous | |
21 | Thu 28 Mar | Joint, marginal, conditional; Bayes rule | PS5 due 5pm Fri; PS6 out |
22 | Tue 02 Apr | Models; Parameter estimation: ML, MAP, PME; Factoring | |
23 | Thu 04 Apr | Graphical models; Markov models | |
24 | Tue 09 Apr | Hidden Markov models (HMMs); HMM decoding | |
25 | Thu 11 Apr | Viterbi decoding and traceback | PS6 due 5pm Fri; PS7 out |
26 | Tue 16 Apr | Posterior decoding; Estimating HMM parameters; Baum-Welch | |
27 | Thu 18 Apr | GENSCAN; Profile HMMs and other HMM extensions | |
28 | Tue 23 Apr | Course summary; Open time for questions | PS7 due 5pm Fri |
Here is a proof of the claim in the neighbor-joining (NJ) algorithm that selecting the pair of nodes whose adjusted distance is least (using the specific definition of adjusted distance outlined in class) will indeed reliably reveal true neighbors in the tree.
An overview of the DFS and BFS algorithms for visiting the nodes of a graph.
The slide presented in class showing a portion of an alignment between the spike protein (GenBank ID: YP_009724390.1) from SARS-CoV-2 (the virus causing COVID-19) and the spike protein (GenBank ID: NP_828851.1) from SARS-CoV-1 (the virus causing the original SARS outbreak of 2003).
The example presented in class illustrating the Burrows-Wheeler Transform (BWT) of a short genomic text, and relating that to the simplest version of a suffix array, as well as to the beginnings of the FM-index data structure.
A careful description of the algorithm for finding the closest pair of points in O(n log n) time. This is from the 2nd edition of “Introduction to Algorithms” (fondly known as CLRS).
The recurrence relations that arise in analyzing divide-and-conquer algorithms commonly take on a certain form in which the running time for a problem of size n can be expressed in terms of the running time of a copies of a problem that is b times smaller (i.e., size n/b), plus some extra work (which might depend on n). In such cases, a powerful theorem can help you solve just such a recurrence.
If you’d like to learn a bit more about sorting—how different algorithms work and how they compare in practical terms for specific kinds of inputs—check out this cool demo site. Also, here are some fun videos: quickly visualizing the execution of 15 sorting algorithms and a Hungarian folk dance version of bubble sort (if you can’t get enough of that, there are many more).
Here are the two videos I showed in class illustrating replication, transcription, and translation if you want to rewatch them:
All the PDF slides, Python exercises, and solutions we provided during the Python tutorials, Parts 1 and 2, are available for download on Canvas.
Here are a few different kinds of resources for those with less biology background, ranging from the comprehensive to a basic overview:
Please note that none of these books is required for the class (or even used in the class!), but I include this list because some students may benefit from having one or more of them to which to refer, whether during the semester or in the future. Each of the books summarized here is linked to Amazon where you can read more (these are not affiliate links). As for books that can serve as Python references, many resources can be found for free online, even complete textbooks downloadable as PDFs (you’ll save trees that way (unless you print them)).
In this class, we will write all our code using Python 3.11, the latest version of which can be downloaded free for any OS from Anaconda. Anaconda (or its minimalist cousin Miniconda) sets up Python in a special environment to prevent it from conflicting with other versions of Python you may have installed. It includes a command-line tool called conda for managing Python environments and adding new packages (though we will not be using any extra packages in this course).
We encourage everyone to develop their Python code using the PyCharm IDE from JetBrains, which is free for educational use (even the Professional edition). We provide clear directions for setting up the PyCharm IDE, and can offer assistance for students that use it, but if you prefer to develop your code using another IDE, you are free to go your own way.
Complete directions for setting all this up can be found here.
Once you have set up a Python 3.11 environment, and have the PyCharm IDE configured to use it, you are ready to start on the first half of Problem Set 1, which we call PS1a. This will familiarize you with how course problem sets are structured, and will confirm that you are ready to download problem sets, write Python code to work with strings representing genomic sequences, and submit your work to Gradescope.
All problem sets will be available within the Files section of Canvas and their release will be announced on Ed, so to work on PS1a, be sure you can access Ed and then look for the post announcing its release.
When you are ready to submit the problem set, you will do so directly to our course Gradescope site.
All students are expected to abide by standards of academic integrity. This includes all the various aspects of Duke’s Community Standard. In particular, be reminded that it is not acceptable to take the ideas/work of another (including the output of a large language model like ChatGPT) and represent it as one’s own, even if paraphrased. Ideas/work taken from others—including Internet sources, peers in the class, peers from outside the class—must always be appropriately cited.
Violations of academic integrity will be taken very seriously. At a minimum, assignments in which a student either receives inappropriate input from others or provides inappropriate input to others will be graded as 0. In addition, all violations will be communicated to the Dean who directs the Office of Student Conduct.
Unless expressly granted otherwise, the entirety of every problem set should be completed individually; no collaboration is permitted, nor is access to solutions or another student’s work; to be explicit, you may also never use a large language model like ChatGPT or Copilot in this class. If you have worked for a while on a particular problem and have encountered a mental wall, and if you have banged your head against said wall for a while, please post a question on Ed, or speak to the instructor or TAs.
If for any reason you consult your peers outside of Ed, such an interaction must be one of consultation and not collaboration, and should never involve looking at a peer’s screen or other work: clarification of general concepts is fine, but nothing specific about the answer should be communicated, and afterward, it is expected that you should still have plenty of thinking to do. In addition, if you do happen to consult with another student, both of you must cite this.
Note that posting your work or our course materials, especially our solutions, onto a repository accessible to other students—whether a publicly accessible one like GitHub or a less public one nevertheless accessible to other students, now or in the future—is a violation of the collaboration policy (and copyright law). Be considerate to other students: don’t post materials that might tempt others to violate the course collaboration policy and thereby their academic integrity.
You will generally have two weeks to work on each problem set. All problem sets will be due at 5pm on Friday evening. If you turn in work after the respective 5pm deadline, a late penalty will be applied amounting to 10% of the total number of available points for every 12 hours late, rounded up: 10% if you are 0–12 hours late, 20% if you are 12–24 hours late, 30% if you are 24–36 hours late, and 40% if you are 36–48 hours late. No work will be graded if it is turned in more than 48 hours late.
Students generally have two weeks to work on problem sets—not because two weeks are generally required to finish, but 1) to allow students who start early sufficient time to ruminate on problems if they reach an impasse (solving a complex problem often requires some gestation period before things become clear) and 2) to provide flexibility as to when students complete their work while they juggle other requirements and commitments during the semester.
Given this latter point, students should not request extensions for turning in their work beyond the two weeks already allotted. However, everyone invariably has especially tough intervals during the semester. In acknowledgement of that, we will waive each student’s two largest late penalties, up to a maximum total of 40% in waived penalties. For example, this means you may turn in work up to 48 hours late once without penalty, or you may turn in work up to 24 hours late twice without penalty.
We have designed problem sets in the class to permit you to explore the material, and to develop deeper understanding of the material through that exploration. I ask you to focus on the ideas and the learning rather than on the points and the credit; put another way, please adopt a perspective of working to satisfy your own expectations rather than working to satisfy the instructor’s expectations.
That said, we still need to assign points and credit when evaluating work: this is unfortunately unavoidable. However, we have designed our approach to evaluating work to be consistent with the perspective of the previous paragraph; the approach is perhaps a little different from what you may be familiar with in other classes. Specifically, I have asked the graders to frame their grading in terms of ’positive earning’ rather than ’negative error’.
What do I mean by this? Well, a ’negative error’ approach is one in which one assumes one’s work will earn full credit unless there are mistakes present. Under such an approach, graders are negatively tasked with finding mistakes and errors, and taking away points for any they find.
I have inverted this by choosing to adopt a ’positive earning’ approach in which an empty problem set earns no points, and students earn more points as they demonstrate deeper levels of mastery of the material and challenge. Under such an approach, graders are instead positively tasked with finding ways that students should earn credit for deeply engaging the material.
A corollary of the ’negative error’ approach is that unless a student makes a mistake, they are entitled to the highest number of points possible. Conversely, a corollary of the ’positive earning’ approach is that it is possible for a student to not make any mistakes yet still not earn the highest number of points possible. For example, this can happen if a student minimally engages the material, and while not making any mistakes, never demonstrates mastery or depth of understanding. Our ’positive earning’ approach not only focuses on the positive instead of the negative, but it also leaves room to grant more credit to students who engage the material more deeply.
I write all this because if you find that you did a problem without making a mistake, but got only +3 when some other student may have gotten +4, it doesn’t necessarily mean that something is wrong (though it might be). It could mean that there were some interesting ways to engage the problem you didn’t explore that the other student did. An analogy might be from a video game like Mario Brothers: you can successfully rescue the princess but still not end up with the highest score because someone can score higher if they take the time to explore a pipe that leads in a new direction. Analogously, earning the highest number of points possible usually requires more than just ’no mistakes’; it also requires demonstration of mastery and engagement. We use rubrics to apply these judgments consistently across the class, and the rubrics are not pre-determined: our rubrics adapt to give credit for new ways we see students engaging a problem.
Students will submit their problem set work directly to our course Gradescope site. After each assignment is graded, scores will be available to students within Gradescope. Once those are finalized in Gradescope, they will move into the gradebook on Canvas, where they will accumulate throughout the semester. Importantly, note that scores in Gradescope and Canvas do not take into account late penalties. Those are assessed at the end of the semester, after the largest two late penalties are waived (see above).
This class uses Ed for course announcements, communication, and discussion. The Ed platform is designed to getting you help quickly and efficiently from classmates, the TAs, and the instructor. Rather than emailing questions to the teaching staff, please post your questions on Ed so everyone can provide responses, and benefit from those responses.
Posting a question to Ed is the fastest way to get help, and it’s also the most efficient way for us to provide help, because if two people have the same question, we only need to answer it once. On that note, don’t forget to do a quick keyword search to see if your question has already been answered before posting it: the fastest answer is the one that’s already there!
To enroll yourself in the Ed site for this class, you will first need to log in to the COMPSCI 260 site on Canvas. Once in Canvas, you should select Ed from the menu on the left: You will be taken to Ed in a new tab and prompted to log in there (or create a new Ed account if you do not yet have one).
Once you have enrolled yourself in the Ed site by accessing it via Canvas, you no longer need to go through Canvas to access it in the future. You can instead visit our Ed class discussion board directly.
IMPORTANT DETAILS:
This is an optional challenge for students interested in applying what we have learned in class to a real computational genomics research problem; practicing the skills of using Python or R (or any other tool you wish) to visualize, analyze, model, and interpret real genomic data; and exploring the science linking chromatin structure and transcriptional regulation. Since this problem represents an open challenge for the genomics community, you are free to choose the approaches you use to analyze the data, as well as the questions you explore. Creative projects are highly encouraged. You may work in small teams (2-3 is ideal). For all submissions we receive by the deadline of 15 Dec 2024, we will provide feedback, and will also designate a best project as well as a most creative project. There will be (simple) prizes!
In this data expedition challenge, we will explore next-generation sequencing reads from MNase-seq experiments in yeast. The data was generated to detect genome-wide binding locations of various kinds of DNA-binding proteins. The MNase-seq data sets were collected at Duke as part of our ongoing computational genomic research collaboration with the lab of Prof. David MacAlpine in the Department of Pharmacology and Cancer Biology.
DNA-binding proteins, including nucleosomes and transcription factors (TFs), play essential roles in gene regulation, and their locations along the genome help give us clues about how genes are regulated. Recently, a new MNase-seq protocol was developed by the MacAlpine group at Duke in conjunction with the Henikoff group at the University of Washington. [1] The basic idea is that genomic locations not bound by proteins are accessible to micrococcal nuclease (MNase) and are therefore more sensitive to MNase digestion. Conversely, genomic locations bound by proteins are less sensitive to MNase digestion.
Consequently, if we sequence the ends of the fragments that remain after MNase digestion, and map the paired sequencing reads that arise, we should be able to see where MNase was able to digest/cut the genome, revealing something about the binding locations of DNA-binding proteins along the genome. It is important to note that the genome of each individual cell in a population may be in a slightly different occupancy/protection state. We collect data from a population of cells so this experiment is sampling the different protection states present in the cell population.
Complicating the issue further, MNase is also known to have a nucleotide-specific bias as it digests DNA, meaning that it tends to cleave/digest certain sequences more than others. For example, it prefers to digest A/T nucleotides compared to G/C (its bias is actually a bit more subtle/complex than that, which is a nice model selection challenge you can explore: what is the simplest model that captures well this bias?). To give you further information about this sequence bias, we are also providing MNase digestion data of naked (deproteinized) DNA in vitro which will allow for the development of models to quantify such bias (because with this data, the variation in cutting that you see is only the result of the MNase interacting with the naked DNA and is not influenced by protein protection).
Usually, sequencing reads are stored in files of fastq format. In this case, we downloaded two large yeast MNase-seq read fastq files: in vivo yeast MNase-seq read files generated by Henikoff et al. [1], and in vitro yeast MNase-seq read files generated by Deniz et al. [2] for use in quantifying MNase digestion bias. Both files contain short sequencing reads, of length 25 and 54 base pairs, respectively. The total number of reads in each file is on the order of 100 million. For reference, the yeast genome contains 16 chromosomes whose total size is approximately 12.5 million base pairs.
To analyze those sequencing reads, you would typically first need to map the reads to a reference genome, using tools like BOWTIE. However, to simplify this challenge, we have already performed this mapping step for you. We are thus providing you one tab-delimited text file for each of the first 12 yeast chromosomes, named with ChrI to ChrXII (yeast geneticists like Roman numerals); we will reserve the remaining 4 yeast chromosomes to evaluate your submitted results. Each file contains the start and end genome coordinates of all the reads mapped to that chromosome, one read per line. You may notice that the distances between the start and end coordinates are larger than 25 or 54 base pairs. That is because the MNase-seq experiments produce paired-end reads and we are indicating the coordinates of the spanned fragment from which the two reads come. So the provided coordinates are the start coordinate of one read, along with the start coordinate of its mated read on the opposite strand; or, put another way, the first and last nucleotide of the fragment.
It is reasonable to think of the start and end coordinates as nucleotides just beyond which MNase cleaved the DNA, while the sequence between the start and end coordinates was not digested by MNase. We also provide the whole yeast genome sequence (sacCer2 2008 version, in separate fasta files) if you wish to extract the actual sequence around the cleavage sites based on the provided coordinates.
All data files for this challenge are available from:
You will need to do some independent exploration to figure out what to do next. You may want to read more about the MNase enzyme and how it works, or what is known about it. You probably want to get more info about the MNase-seq protocol, as described in the original paper. [1] Then you can start exploring one or multiple of the following, depending on what suits your fancy, or you may have other ideas of your own:
Good luck, and have fun on this expedition!