COMPSCI 260, Spring 2026

Overview

A computational perspective on the exploration and analysis of genomic and genome-scale information. Provides an integrated introduction to genome biology, algorithm design and analysis, and probabilistic and statistical modeling. Topics include genome sequencing, genome sequence assembly, read mapping, local and global sequence alignment, sequence database search, gene finding, phylogenetic tree construction, and elementary gene expression analysis. Methods include dynamic programming, indexing, hidden Markov models, and elementary supervised machine learning. Focuses on foundational algorithmic principles. Development of practical experience with handling, analyzing, and visualizing genomic data using the computer language Python.

The course requires students to program often in Python. Students coming in to the course must already know how to program in some computer language, but it need not be Python. If it is not Python, students will be expected to come quickly up to speed in Python on their own. Additionally, students should be comfortable with mathematical thinking and formulas, and should have had some exposure to basic probability as well as molecular or cellular biology; however, the course has no formal course prerequisites, and quick refreshers of relevant background will be provided. Please speak to the instructor if you are unsure about your background. This course is a valid elective in both biology and computer science.

Staff

Professor Alex Hartemink

Webpage: https://users.cs.duke.edu/~amink
Email: amink at cs.duke.edu
Office Location: LSRC D239
Office Phone: (919) 660-6514

Mun Hong Fong, TA	Email: munhong.fong at duke.edu

Bo Chi, UTA	Email: bo.chi at duke.edu
Elise Nackley, UTA	Email: elise.nackley at duke.edu
Kunling Tong, UTA	Email: kunling.tong at duke.edu
Maria Hromcenco, UTA	Email: maria.hromcenco at duke.edu
Nikita Rao, UTA	Email: nikita.rao at duke.edu
Palak Jolly, UTA	Email: palak.jolly at duke.edu
Sushrit Pasumarthy, UTA	Email: sushrit.pasumarthy at duke.edu
Will Harris, UTA	Email: will.harris at duke.edu

Office hours

In-person office hours with UTAs will be held at the following times, starting on Tuesday 13 January. To find LSRC B102, where all sessions will be held, enter the LSRC Hall of Science just beyond the DIBS Cube and the Pratt POD—as if you're going to Love Auditorium—and it's the room on your left near the base of the stairs. Remember that if your question is pretty simple, you are likely to get help quickest by asking on Ed (in fact, your question may already be answered there).

Sunday	2–4pm	Will Harris	LSRC B102
Tuesday	6:30–8:30pm	Bo Chi	LSRC B102
Wednesday	2:45–4:45pm	Nikita Rao	LSRC B102
	4:45–6:45pm	Elise Nackley	LSRC B102
Thursday	4:30–6:30pm	Maria Hromcenco	LSRC B102
	6:30–8:30pm	Kunling Tong	LSRC B102
Friday	11am–1pm	Sushrit Pasumarthy	LSRC B102
	3–5pm	Palak Jolly	LSRC B102

If you would like to speak with the instructor about anything, you are welcome to stick around after lecture to chat, or you can send an email to schedule a meeting at a time that is convenient for you.

Logistics

The class meets 1:25–2:40pm on Tuesdays and Thursdays in 111 BioSci.

Course schedule

Note: The course schedule may change subtly from time to time. Always check this page for the most up-to-date schedule.

Session	Date	Topic	Assignment (always due 5pm following Monday unless indicated)
1	Thu 08 Jan	Course introduction; SARS-CoV-2 genome introduction
2	Tue 13 Jan	Molecular biology primer: DNA, RNA, and protein	PS1 out
3	Thu 15 Jan	Gene/genome organization; SARS-CoV-2 genome revisited
4	Tue 20 Jan	Algorithm introduction; Time and space resources	PS2 out
5	Thu 22 Jan	Analyzing algorithms; Designing efficient algorithms
6	Tue 27 Jan	Divide-and-conquer recursion	PS3 out
7	Thu 29 Jan	Divide-and-conquer recursion fails; Memoization
8	Tue 03 Feb	Dynamic programming; Greedy algorithms
9	Thu 05 Feb	UNIT 1 MIDTERM
10	Tue 10 Feb	Short-read mapping; Prefix trees; Suffix trees and arrays	PS4 out
11	Thu 12 Feb	BWT; Introduction to FM-index
12	Tue 17 Feb	FM-index continued; DNA sequencing	PS5 out
13	Thu 19 Feb	Genome assembly: Human Genome Project (HGP) and Celera
14	Tue 24 Feb	Sequence variation; Global alignment	PS6 out
15	Thu 26 Feb	Global alignment traceback; Affine gap penalties and alignment
16	Tue 03 Mar	Affine gap alignment traceback; Local alignment; FASTA and BLAST heuristics
17	Thu 05 Mar	UNIT 2 MIDTERM
	Tue 10 Mar	SPRING BREAK — enjoy!
	Thu 12 Mar	SPRING BREAK — enjoy!
18	Tue 17 Mar	Phylogenetic trees; Time and distance	PS7 out
19	Thu 19 Mar	Building phylogenetic trees (UPGMA and NJ)
20	Tue 24 Mar	Random variables; Probability; Discrete and continuous	PS8 out
21	Thu 26 Mar	Joint, marginal, conditional; Bayes rule
22	Tue 31 Mar	Models; Parameter estimation: ML, MAP, PME; Factoring	PS9 out
23	Thu 02 Apr	Graphical models; Markov models
24	Tue 07 Apr	Hidden Markov models (HMMs); HMM decoding	PS10 out
25	Thu 09 Apr	Viterbi decoding and traceback
26	Tue 14 Apr	Posterior decoding; Estimating HMM parameters; Baum-Welch	PS11 out
27	Thu 16 Apr	GENSCAN: the HMM-based gene finder
28	Tue 21 Apr	Course summary; Open time for questions
	Sat 02 May	FINAL EXAM	2pm–5pm

Closest pair of points

A careful description of the algorithm for finding the closest pair of points in O(n log n) time. This is from the 2nd edition of “Introduction to Algorithms” (fondly known as CLRS).

Theorem for solving certain recurrence relations

The recurrence relations that arise in analyzing divide-and-conquer algorithms commonly take on a certain form in which the running time for a problem of size n can be expressed in terms of the running time of a copies of a problem that is b times smaller (i.e., size n/b), plus some extra work (which might depend on n). In such cases, a powerful theorem can help you solve just such a recurrence.

Learn more about sorting

If you’d like to learn a bit more about sorting—how different algorithms work and how they compare in practical terms for specific kinds of inputs—check out this cool demo site. Also, here are some fun videos: quickly visualizing the execution of 15 sorting algorithms and a Hungarian folk dance version of bubble sort (if you can’t get enough of that, there are many more).

Videos illustrating replication, transcription, and translation

Here are the two videos I show in class during Lecture 3 illustrating replication, transcription, and translation if you want to rewatch them:

and here are a few useful ones that I didn't show in class, with slightly more molecular visualizations:

Python tutorial materials

All the PDF slides, Python exercises, and solutions from our Python tutorial materials, Parts 1 and 2, are available for download on Canvas.

For shoring up your biology background

Here are a few different kinds of resources for those with less biology background, ranging from the comprehensive to a basic overview:

Molecular Biology for Computer Scientists, a short primer by Larry Hunter (from the early 1990’s but still an excellent brief distillation).
The Chemistry of Life, a short primer by Michael Behe.
Molecular Biology of the Cell, a huge classic textbook by Bruce Alberts, et al., is available for free online at NCBI. It is a good resource for biological questions and background, but can only by accessed via a search interface. If you don’t understand a term or concept, you can try searching here.

Optional reference textbooks

Please note that none of these books is required for the class (or even used in the class!), but I include this list because some students may benefit from having one or more of them to which to refer, whether during the semester or in the future. Each of the books summarized here is linked to Amazon where you can read more (these are not affiliate links). As for books that can serve as Python references, many resources can be found for free online, even complete textbooks downloadable as PDFs (you’ll save trees that way (unless you print them)).

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids: This book covers a good chunk of the material that will be covered in the course with a distinctly probabilistic focus. It is very well written and at a fairly advanced level. It is an excellent reference for folks continuing on in this area.
https://www.amazon.com/exec/obidos/ASIN/0521629713/
An Introduction to Bioinformatics Algorithms: This book by Jones and Pevzner covers many of the topics that will be covered in class, but organizes the material around algorithmic methods rather than biological problems. So one chapter presents many problems across bioinformatics that exploit string matching, etc. While not complete for this course, it is a nice, readable, undergraduate textbook.
https://www.amazon.com/exec/obidos/ASIN/0262101068/
Bioinformatics Algorithms: An Active Learning Approach: A more recent book by Pevzner and Compeau, this time an attempt to capture/complement the teaching that they did together as part of a Coursera course. I have not yet had a chance to examine it too closely, but the course received high praise.
https://www.amazon.com/exec/obidos/ASIN/0990374637/
Introduction to Computational Genomics: A Case Studies Approach: Hahn was a grad student at Duke in the lab of Greg Wray and took this very class once upon a time. I love the case study approach: it makes for very interesting reading. At the end of the day, I decided not to require the book because it doesn’t go into much depth, but in terms of overlap with course content, it comes pretty close so it may be useful for some.
https://www.amazon.com/exec/obidos/ASIN/0521671914/

Directions for setting up Anaconda Python and PyCharm IDE

In this class, we will write all our code using Python 3.13, the latest version of which can be downloaded free for any OS from Anaconda. Anaconda (or its minimalist cousin Miniconda) sets up Python in a special environment to prevent it from conflicting with other versions of Python you may have installed. It includes a command-line tool called conda for managing Python environments and adding new packages (though we will not be using any extra packages in this course).

We encourage everyone to develop their Python code using the PyCharm IDE from JetBrains, which is free (and the Pro features are even free for educational use). We provide clear directions for setting up the PyCharm IDE, and can offer assistance for students that use it, but if you prefer to develop your code using another IDE, you are free to go your own way.

Complete directions for setting all this up can be found here.

Directions for accessing Problem Sets

Once you have set up a Python 3.13 environment, and have the PyCharm IDE configured to use it, you will be ready to start on Problem Set 1 when it is released. Be careful to read the introduction to this problem set closely as it will familiarize you with how course problem sets are structured.

All problem sets will be made available within the Files section of Canvas and their release will be announced on Ed, so be sure you can access Ed and then look for the post announcing its release.

When you are ready to submit the problem set, you will do so directly to our course Gradescope site.

Academic integrity

All students are expected to abide by standards of academic integrity. This includes all the various aspects of Duke’s Community Standard. In particular, be reminded that it is not acceptable to take the ideas/work of another (including the output of a large language model) and represent it as one’s own, even if paraphrased. Ideas or work taken from others—including large language models (LLMs), online resources, peers in the class, or peers from outside the class—must always be appropriately cited. (You need not cite course staff providing information via lecture or during office hours.)

Violations of academic integrity will be taken very seriously. At a minimum, problem sets or exams in which a student violates an academic integrity policy will be graded as 0, and severe violations may result in a failing (F) final grade. In addition, all violations will be communicated to the Dean who directs the Office of Student Conduct and Community Standards.

Grade components

Problem Sets 1–11: 25%
Unit 1 Midterm: 15%
Unit 2 Midterm: 20%
Final Exam: 35%
Engaged attendance: 3%: Students are expected to attend class regularly, to be on time for class, and to be engaged and fully present during class.
Course evaluation surveys: 2%: At the end of the semester, you will be asked to complete two course evaluation surveys, one the official university evaluation and one a customized Google survey, each contributing equally to this component of the grade.

More details about the problem sets and midterms are in the following sections.

Problem set resource and collaboration policy

On the problem sets, the most empowering way to complete a problem may be by yourself, but you are permitted to partner with a classmate when working your way through a problem (not in a group of more than 2: partners only, please). If you elect to partner with a classmate, you must each write up your own work for that problem independently and in your own words, and you must provide the name of the partner with whom you worked (not providing the name of the partner with whom you worked would constitute a violation of academic integrity: see above). The idea behind this collaboration policy is to permit you to take a journey through a problem with a partner; the policy does not permit simply splitting up the problem into two portions, each doing one portion, and then swapping results.

As for other resources beyond partnering with a classmate, you are always permitted and strongly encouraged to post questions on Ed, come to office hours (there are 8 two-hour sessions each week), or speak to the instructor. You may use online resources when instructed to do so (naturally), to learn anything about Python syntax or error messages, or to learn more about a topic discussed in class. You may not look for or make use of course materials from previous semesters, whether found online or from peers who may have taken the class before. With respect to LLMs, they may not be used to answer any question asked on a problem set where you are expected to write a response in your own words. As for using an LLM while coding, this is permitted if you cite it (you must cite it, and explain how it was used; not citing it would constitute a violation of academic integrity: see above), but in my experience, LLMs often introduce small errors in code that can be subtle or hard to debug. Moreover, you will be expected to know the material on the exams, so please proceed with caution. My advice is to avoid using LLMs for coding if you want to maximally develop your coding and problem solving skills, and instead make use of Ed, office hours, or reasoning things through with a partner. And if you do use an LLM, there are degrees: you'll benefit more by using it less.

Note that posting your work or our course materials (especially our solutions) onto a repository accessible to other students—whether a publicly accessible one like GitHub or a less public one nevertheless accessible to other students, now or in the future—is a violation of the collaboration policy (and copyright law). Be considerate to other students: don’t post materials that might tempt others to violate the course collaboration policy and thereby their academic integrity.

Problem set grading

The problem sets in this class have been designed to permit you to explore the material, and to develop deeper understanding of the material through that exploration. I encourage you to focus on the ideas and the learning that happens along the journey rather than on the points and the credit; put another way, please adopt a perspective of working to satisfy your own expectations rather than working to satisfy the instructor’s expectations. In support of this design, each problem will be graded on a very coarse 3-point scale:

3 points: Students have correctly completed all or almost all of the problem. You can think of this as achieving 90% or more of what was expected.
2 points: Students have correctly completed more than half but well less than almost all of the problem. You can think of this as achieving 50–90% of what was expected.
1 point: Students have correctly completed less than half of the problem, but have at least made some progress on it. You can think of this as achieving 10–50% of what was expected.
0 points: Students have correctly completed almost none of the problem. You can think of this as achieving 10% or less of what was expected.

There are 11 problem sets, each with two problems on them, each worth up to 3 points, so the maximum number of achievable points is 66. Scores will be capped at 60, so anyone earning 60 or more points will get full credit for the problem sets, and anyone earning fewer than 60 points will get credit proportional to the number of points they earned divided by 60. This means that a student can skip a full problem set, or can miss a few points here and there over the course of the semester, and still achieve the maximum possible score.

Problem set deadlines and late policy

You will have roughly one week to work on each problem set: All problem sets will go out on a Tuesday morning and be due at 5pm on the following Monday evening. The server is configured to accept late work until 11:59PM, but will not accept any work after. If you turn in work after the 5pm deadline but before 11:59PM, a late penalty will be applied, amounting to 1/3 of the number of points available on the problem set. In acknowledgement of the fact that sometimes things go wrong or take longer to complete than expected, the first two penalties will be waived. Thus, penalties only apply starting with the third late submission.

Problem set grade distribution

Students will submit their problem set work directly to our course Gradescope site. After each assignment is graded, scores will be available to students within Gradescope. Once those are finalized in Gradescope, they will move into the gradebook on Canvas, where they will accumulate throughout the semester. Importantly, note that scores in Gradescope and Canvas do not take into account late penalties. Those are assessed at the end of the semester, after the first two late penalties are waived (see above).

Midterm score improvement (and make-up) policy

The final exam will be cumulative, and as such, will include material corresponding to the two unit midterms. We will offer a mechanism for students to improve their scores on their two unit midterms if they happen to do better on the corresponding material on the final (a midterm score will never decrease under this mechanism, even if they happen to do worse on the final). There will be no make-up midterm exams: If a student misses a unit midterm for any reason, their score on that midterm will be calculated based on the corresponding material on the final exam. Put another way, corresponding material on the final exam will serve as the make-up midterm, and will contribute both to the midterm score and to the final score.

Ed discussion board

This class uses Ed for course announcements, communication, and discussion. The Ed platform is designed to getting you help quickly and efficiently from classmates, the TAs, and the instructor. Rather than emailing questions to the teaching staff, please post your questions on Ed so everyone can provide responses, and benefit from those responses.

Posting a question to Ed is the fastest way to get help, and it’s also the most efficient way for us to provide help, because if two people have the same question, we only need to answer it once. On that note, don’t forget to do a quick keyword search to see if your question has already been answered before posting it: the fastest answer is the one that’s already there!

To enroll yourself in the Ed site for this class, you will first need to log in to the COMPSCI 260 site on Canvas. Once in Canvas, you should select Ed from the menu on the left: You will be taken to Ed in a new tab and prompted to log in there (or create a new Ed account if you do not yet have one).

Once you have enrolled yourself in the Ed site by accessing it via Canvas, you no longer need to go through Canvas to access it in the future. You can instead visit our Ed class discussion board directly.

IMPORTANT DETAILS:

Adding your picture to Ed is super helpful to put faces with names as we answer your questions, so please add a photo of yourself if you’re comfortable doing so.
Many of you may be in the situation that you’ve enrolled (or been asked to enroll) in Ed using more than one email address (e.g., netid@duke.edu and first.last@duke.edu). This can lead to multiple accounts, or mismatches between Ed and Canvas, etc. The best way to handle this is to decide on whatever will be your primary email address, log in to your Ed account with that address, and then under Account/Email Settings (look for the gear icon in the upper right), add the other email address(es) to your account. This will merge your multiple accounts into one, and make sure that you’re able to be properly enrolled in a Ed course no matter what email address is used to invite you.

Data expedition challenge: Signal, noise, and bias in yeast MNase-seq data

This is an optional challenge for students interested in applying what we have learned in class to a real computational genomics research problem; practicing the skills of using Python or R (or any other tool you wish) to visualize, analyze, model, and interpret real genomic data; and exploring the science linking chromatin structure and transcriptional regulation. Since this problem represents an open challenge for the genomics community, you are free to choose the approaches you use to analyze the data, as well as the questions you explore. Creative projects are highly encouraged. You may work in small teams (2-3 is ideal). For all submissions we receive by the deadline of 15 Dec 2026, we will provide feedback, and will also designate a best project as well as a most creative project. There will be (simple) prizes!

Data description

In this data expedition challenge, we will explore next-generation sequencing reads from MNase-seq experiments in yeast. The data was generated to detect genome-wide binding locations of various kinds of DNA-binding proteins. The MNase-seq data sets were collected at Duke as part of our ongoing computational genomic research collaboration with the lab of Prof. David MacAlpine in the Department of Pharmacology and Cancer Biology.

Biological background

DNA-binding proteins, including nucleosomes and transcription factors (TFs), play essential roles in gene regulation, and their locations along the genome help give us clues about how genes are regulated. Recently, a new MNase-seq protocol was developed by the MacAlpine group at Duke in conjunction with the Henikoff group at the University of Washington. [1] The basic idea is that genomic locations not bound by proteins are accessible to micrococcal nuclease (MNase) and are therefore more sensitive to MNase digestion. Conversely, genomic locations bound by proteins are less sensitive to MNase digestion.

Consequently, if we sequence the ends of the fragments that remain after MNase digestion, and map the paired sequencing reads that arise, we should be able to see where MNase was able to digest/cut the genome, revealing something about the binding locations of DNA-binding proteins along the genome. It is important to note that the genome of each individual cell in a population may be in a slightly different occupancy/protection state. We collect data from a population of cells so this experiment is sampling the different protection states present in the cell population.

Complicating the issue further, MNase is also known to have a nucleotide-specific bias as it digests DNA, meaning that it tends to cleave/digest certain sequences more than others. For example, it prefers to digest A/T nucleotides compared to G/C (its bias is actually a bit more subtle/complex than that, which is a nice model selection challenge you can explore: what is the simplest model that captures well this bias?). To give you further information about this sequence bias, we are also providing MNase digestion data of naked (deproteinized) DNA in vitro which will allow for the development of models to quantify such bias (because with this data, the variation in cutting that you see is only the result of the MNase interacting with the naked DNA and is not influenced by protein protection).

Data sets

Usually, sequencing reads are stored in files of fastq format. In this case, we downloaded two large yeast MNase-seq read fastq files: in vivo yeast MNase-seq read files generated by Henikoff et al. [1], and in vitro yeast MNase-seq read files generated by Deniz et al. [2] for use in quantifying MNase digestion bias. Both files contain short sequencing reads, of length 25 and 54 base pairs, respectively. The total number of reads in each file is on the order of 100 million. For reference, the yeast genome contains 16 chromosomes whose total size is approximately 12.5 million base pairs.

To analyze those sequencing reads, you would typically first need to map the reads to a reference genome, using tools like BOWTIE. However, to simplify this challenge, we have already performed this mapping step for you. We are thus providing you one tab-delimited text file for each of the first 12 yeast chromosomes, named with ChrI to ChrXII (yeast geneticists like Roman numerals); we will reserve the remaining 4 yeast chromosomes to evaluate your submitted results. Each file contains the start and end genome coordinates of all the reads mapped to that chromosome, one read per line. You may notice that the distances between the start and end coordinates are larger than 25 or 54 base pairs. That is because the MNase-seq experiments produce paired-end reads and we are indicating the coordinates of the spanned fragment from which the two reads come. So the provided coordinates are the start coordinate of one read, along with the start coordinate of its mated read on the opposite strand; or, put another way, the first and last nucleotide of the fragment.

It is reasonable to think of the start and end coordinates as nucleotides just beyond which MNase cleaved the DNA, while the sequence between the start and end coordinates was not digested by MNase. We also provide the whole yeast genome sequence (sacCer2 2008 version, in separate fasta files) if you wish to extract the actual sequence around the cleavage sites based on the provided coordinates.

All data files for this challenge are available from:

Potential research questions and challenges you can explore

You will need to do some independent exploration to figure out what to do next. You may want to read more about the MNase enzyme and how it works, or what is known about it. You probably want to get more info about the MNase-seq protocol, as described in the original paper. [1] Then you can start exploring one or multiple of the following, depending on what suits your fancy, or you may have other ideas of your own:

Investigate the bias of the MNase enzyme in digesting different nucleotides. (You can use the in vitro data to visualize and then model the distributions of read counts conditioned on the immediate nucleotides at the cutting site. However, because MNase is a protein that contacts more than one nucleotide of DNA, the interaction between MNase and DNA sequence perhaps spans more than one nucleotide, and more sophisticated models based on multiple nucleotides around the cutting site can also be explored.)
Study the distribution of MNase-seq read counts around transcription factor binding motifs to predict transcription factor binding sites: Are there specific signals (also called "footprints") left by bound transcription factors? Are these signals specific to each TF, or are there a few canonical footprint types (clusters of footprints shared by sets of TFs)? Or is there a single kind of footprint that just looks different for different TFs because of the role of MNase bias?
Study the distribution of MNase-seq read counts around nucleosomes: Can we define a nucleosome footprint to predict nucleosomal binding locations?
Visualize and compare the in vivo and in vitro MNase digestion profiles for a few transcription factors to explore the contribution of bias to the in vivo signal.
Anything else you find particularly intriguing or feel compelled to discover.

Potential tasks and approaches for data analysis

Visualize MNase digestion signals along the genome in a UCSC genome browser. This will allow you to see how they are distributed and how they compare to other tracks in the genome browser (like locations of coding regions, TSSs, nucleosomes, TFs, etc.).
Build PSSM (position-specific scoring matrix) or other more sophisticated statistical models to quantify the MNase digestion bias.
Visualize the digestion profiles (footprints) around TF motifs and nucleosomes using scientific plot-generating code (Python or R).
Anything else you find helpful in your explorations.

Good luck, and have fun on this expedition!

References (original sources of data)

Henikoff, J.G. et al. (2011) Epigenome characterization at single base-pair resolution. Proc. Natl. Acad. Sci. U.S.A. 108, 18318—18323.
Deniz, O. et al. (2011) Physical properties of naked DNA influence nucleosome positioning and correlate with transcription start and termination sites in yeast. BMC Genomics 12, 489.

COMPSCI 260: Introduction to Computational Genomics