COMPSCI 260, Fall 2016

Overview

A computational perspective on the exploration and analysis of genomic and genome-scale information. Provides an integrated introduction to genome biology, algorithm design and analysis, and probabilistic and statistical modeling. Topics include genome sequencing, genome sequence assembly, local and global sequence alignment, sequence database search, gene and motif finding, phylogenetic tree building, and gene expression analysis. Methods include dynamic programming, indexing and hashing, hidden Markov models, and elementary supervised and unsupervised machine learning. Development of practical experience with handling, analyzing, and visualizing genomic data using the computer language Python.

The course will require students to program often in Python. Students coming in to the course must already know how to program in some computer language, but it need not be Python. If it is not Python, students will be expected to come quickly up to speed in Python on their own. Additionally, students should be comfortable with mathematical thinking and formulas, and should have had some exposure to basic probability as well as molecular or cellular biology; however, the course has no formal course prerequisites, and quick refreshers of relevant background will be provided. Please speak to the instructor if you are unsure about your background. This course is a valid elective in both biology and computer science.

Staff

Professor Alex Hartemink

Webpage: http://www.cs.duke.edu/~amink
Email: amink at cs.duke.edu
Office Location: LSRC D239
Office Phone: (919) 660-6514

Kyle Moran, TA	Email: kyle.moran at duke.edu
Firas Midani, TA	Email: firas.midani at duke.edu
Brian Benesch, UTA	Email: brian.benesch at duke.edu
Jack Michuda, UTA	Email: jackson.michuda at duke.edu
Lydia Xu, UTA	Email: qiongjia.xu at duke.edu
Michael Lee, UTA	Email: guang.yi.lee at duke.edu
Michael Mendelsohn, UTA	Email: michael.mendelsohn at duke.edu
Shariq Iqbal, UTA	Email: shariq.iqbal at duke.edu
Yasminye Pettway, UTA	Email: yasminye.pettway at duke.edu

Office hours

All office hours with TAs and UTAs will be held in the Link (map).

Office hours are not held during the first week of class. We are working to finalize the schedule below, so it is subject to change over the next few days, after which time it will be set for the semester.

Sun 5:00-7:00pm: Brian Benesch
Sun 7:00-9:00pm: Yasminye Pettway
Mon 3:00-4:00pm: Michael Lee
Mon 7:00-9:00pm: Firas Midani
Tue 4:30-6:30pm: Jack Michuda
Tue 6:30-8:30pm: Lydia Xu
Wed 1:00-2:00pm: Michael Mendelsohn
Wed 5:30-7:30pm: Shariq Iqbal
Thu 3:00-5:00pm: Kyle Moran

If these office hours do not work for you, please post questions via Piazza, or send any of us an email to schedule an alternate time. In particular, if you need to speak with the instructor about anything, do not hestitate to send an email to set up a meeting.

Logistics

The class meets on Tuesdays and Thursdays 10:05–11:20AM in 111 Biological Sciences.

Course schedule

Note: The course schedule may change subtly from time to time. Always check the web page for the most up-to-date schedule.

Session	Date	Instructor	Topic	Assignment (out/due on Fridays)
1	Tue 30 Aug	AH	Course introduction; Viral genome introduction
2	Thu 01 Sep	AH	Molecular biology primer: DNA, RNA, and protein	PS1 out
3	Tue 06 Sep	AH	Gene/genome organization; Viral genome revisited
4	Thu 08 Sep	AH	Algorithm introduction; Time and space resources
5	Tue 13 Sep	AH	Analyzing algorithms; Designing efficient algorithms
6	Thu 15 Sep	AH	Divide-and-conquer introduction and exploration	PS1 due; PS2 out
7	Tue 20 Sep	AH	Divide-and-conquer fails; Memoization
8	Thu 22 Sep	AH	Dynamic programming; Heuristics and greedy algorithms
9	Tue 27 Sep	AH	DNA sequencing and the genome assembly problem
10	Thu 29 Sep	AH	Human Genome Project and Celera; Next-generation sequencing	PS2 due; PS3 out
11	Tue 04 Oct	AH	Short-read mapping; Suffix arrays; BWT; FM-index introduction
12	Thu 06 Oct	AH	All the gory details of FM-index
	Tue 11 Oct		FALL BREAK — enjoy!
13	Thu 13 Oct	AH	Sequence variation and alignment; Global alignment	PS3 due; PS4 out
14	Tue 18 Oct	AH	Traceback; Aligning sequences with affine gap scores
15	Thu 20 Oct	AH	Affine gap alignment traceback; Local alignment
16	Tue 25 Oct	AH	Local alignment traceback; FASTA and BLAST heuristics
17	Thu 27 Oct	AH	Phylogenetic trees; Time and distance	PS4 due; PS5 out
18	Tue 01 Nov	AH	Building phylogenetic trees (UPGMA and NJ)
19	Thu 03 Nov	AH	Probability; Discrete and continuous random variables
20	Tue 08 Nov	AH	Infinity; Joint, marginal, and conditional
21	Thu 10 Nov	AH	Bayes rule; Models; Parameter estimation	PS5 due; PS6 out
22	Tue 15 Nov	AH	ML, MAP, PME; Factoring; Graphical models
23	Thu 17 Nov	AH	Markov models; Hidden Markov models
24	Tue 22 Nov	AH	Viterbi decoding and traceback; Posterior decoding
	Thu 24 Nov		THANKSGIVING BREAK — give thanks!	PS6 due; PS7 out
25	Tue 29 Nov	AH	Estimating HMM parameters; Baum-Welch
26	Thu 01 Dec	AH	Advanced HMM applications
27	Tue 06 Dec	AH	HMMs for finding spliced genes; Brief machine learning intro
28	Thu 08 Dec	AH	Course summary; Course evaluations	PS7 due

AH: Alex Hartemink

Advanced HMM applications papers

GENSCAN: Burge and Karlin 1997. This paper reports the key result of Burge's Ph.D. dissertation; as an aside, the senior author, Samuel Karlin, is the same fellow that helped develop the significance statistics for BLAST database searching.
The original profile-HMM paper (1994!), which led to software tools like SAM and HMMer, protein domain databases like Pfam and InterPro, and eventually inspired the PSI-BLAST heuristic.
A review of the impact of profile-HMMs in 1996 and again in 1998. As an aside, the latter paper mentions a new database called BLOCKS, which was later used to estimate substitution frequencies in aligned protein domains, leading to BLOCKS-derived SUbstitution Matrices, which we now know as BLOSUM matrices.
TMHMM, an elegant HMM model for finding and understanding the structure of transmembrane proteins (those that are not soluble in the cytoplasm but are embedded in a lipid bilayer membrane like the cell membrane, Golgi, vacuole, ER, etc.).

Tree of life papers

Seminal papers developing sequence alignment and database search

Graph search

An overview of the DFS and BFS algorithms for visiting the nodes of a graph.

Papers debating the merits of shotgun sequencing the whole human genome

Papers reporting a newly sequenced genome

Genome sequencing technology papers

Closest pair of points

A careful description of the algorithm for finding the closest pair of points in O(n log n) time. This is from the 2nd edition of "Introduction to Algorithms" (fondly known as CLRS).

Learn more about sorting

If you'd like to learn a bit more about sorting—how different algorithms work and how they compare in practical terms for specific kinds of inputs—check out this cool demo site. Also, here are some fun videos: quickly visualizing the execution of 15 sorting algorithms and a dance version of bubble sort (if you can't get enough of that, there are many more).

Master theorem for solving certain recurrence relations

The recurrence relations that arise in analyzing divide-and-conquer algorithms commonly take on a certain form in which the running time for a problem of size n can be expressed in terms of the running time of a copies of a problem that is b times smaller (i.e., size n/b), plus some extra work (which might be a function of n). In such cases, a powerful master theorem can help you solve just such a recurrence.

Zika resources

NCBI Zika resource, including a link to the reference genome
An excellent New Yorker story about the hunt for a vaccine, describing a number of promising angles. This was written by Siddhartha Mukherjee, the author of The Emperor of All Maladies.
Nature paper cited in New Yorker, describing a DNA vaccine, which is basically a short segment of the Zika genome (but as DNA, so that the host can use it to make the viral proteins to be recognized by the immune system)
Science paper cited in New Yorker, describing different types of vaccines tested on rhesus monkeys
Nature paper on the structure of the virus
Nature paper on birth defects due to the virus
Nature Structural and Molecular Biology paper on the structural variation of a key protein
Science paper on the structure of the virus
Science paper on the structure of a key flavivirus protein
Science paper on evolution and epidemiology of the virus
Science paper on the global spread of the virus

Severe acute respiratory syndrome (SARS)

Here is the SARS genome handout from class. Here is a text file containing the SARS genome (Tor2 isolate, in its RNA form). Have fun parsing it! You can also find it, and a lot more information about it, in GenBank: visit the Genbank entry and see what else you can learn.

Python tutorial slides

Python tutorial 1 slides; the examples and their solutions can be snarfed directly into Eclipse.
Python tutorial 2 slides; the examples and their solutions can be snarfed directly into Eclipse.

Cool site for visualizing execution of Python code

If you want a little more clarity about how Python creates variables, populates them, and passes them around, or if you want to visualize your code in action for debugging purposes, check out Python Tutor. You can study their code examples as they execute, or paste in your own code.

For shoring up your biology background

Here are a few different kinds of resources for those with less biology background, ranging from the comprehensive to a basic overview:

Molecular Biology for Computer Scientists, a short primer by Larry Hunter.
The Chemistry of Life, a short primer by Michael Behe.
Molecular Biology of the Cell, a huge classic textbook by Bruce Alberts, et al., is available for free online at NCBI. It is a good resource for biological questions and background, but can only by accessed via a search interface. If you don't understand a term or concept, you can try searching here.

Textbooks mentioned in class

The various books mentioned in class are summarized here; each is linked to Amazon where you can read more (these are not affiliate links). Note that none of these books is compulsory for the class, though you may benefit from one or more. As for the books on Python, many resources are now available free online, even complete textbooks downloadable as PDFs (you'll save trees (unless you print them)).

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids: This book covers a good chunk of the material that will be covered in the course with a distinctly probabilistic focus. It is very well written and at a fairly advanced level. It is an excellent reference for folks continuing on in this area.
http://www.amazon.com/exec/obidos/ASIN/0521629713/
An Introduction to Bioinformatics Algorithms: This book by Jones and Pevzner covers many of the topics that will be covered in class, but organizes the material around algorithmic methods rather than biological problems. So one chapter presents many problems across bioinformatics that exploit string matching, etc. While not complete for this course, it is a nice, readable, undergraduate textbook.
http://www.amazon.com/exec/obidos/ASIN/0262101068/
Bioinformatics Algorithms: An Active Learning Approach: An more recent book by Pevzner and Compeau, this time an attempt to capture/complement the teaching that they did together as part of a Coursera course. I have not yet had a chance to examine it too closely, but the course received high praise.
http://www.amazon.com/exec/obidos/ASIN/0990374602/
Introduction to Computational Genomics: A Case Studies Approach: Hahn was a grad student at Duke in the lab of Greg Wray and took this very class once upon a time. I love the case study approach: it makes for very interesting reading. At the end of the day, I decided not to require the book because it's not a perfect match for the course, but simply in terms of overlap with the course content, it comes pretty close so it may well be useful for some.
http://www.amazon.com/exec/obidos/ASIN/0521671914/

Directions for setting up Java, Eclipse, Python, and Eclipse plugins

ensure you have Java installed (so you can run Eclipse)
- as a general rule, you should be running the latest version of Java for security reasons; currently, this is Java 8, update 101 or 102 (either one is fine); you can check your current version from a terminal by typing java -version; it should report either build 1.8.0_101-b13 or build 1.8.0_102-b14
- if you don't have the latest version installed, download the latest JDK (Java SE Development Kit) by selecting the option "JDK" from here; you might be able to get away with downloading only the JRE (Java Runtime Environment, a subset of the JDK) if all you want to do is run Eclipse, but I'm not sure whether or not that works on a Mac, and the full JDK is nice to have in case you want to write Java programs later
- in either case, after downloading, double-click the resulting file and follow the default options to install it
- as an aside, you do not need to allow Java to run in your browser to use it for Eclipse; feel free to disable Java in the browser for security
- to summarize, by the end of this step, you should have a fully-updated version of Java 8 running, update 101 or 102; use java -version to confirm
install Eclipse (an environment for writing and running your Python programs)
- the latest version is 4.6 (nicknamed Neon), and an installer for Eclipse is available here
- download the installer, and follow the directions; when presented with the option, you should select the "Eclipse IDE for Java Developers" package (typically this is the first option); you may install it in a folder of your choice
- after it installs itself, you can choose the "LAUNCH" option to confirm that it runs and to select a location for your workspace (where your files for class will be stored)
install Python (so you can run the programs you write)
- IMPORTANT: though the latest stable version is 3.5, we will be using 2.7 in this class
- though Python 2.7 is already installed on Macs and most other Unix machines, we recommend everyone install a pre-compiled Python distribution for consistency; this further ensures that everyone has access to a similar set of Python packages
- download Enthought's "Canopy 1.7.4" for your operating system from here
- double-click the resulting file and follow the default options to install it
- after installing it, you will need to run the Canopy application once to set up Python properly on your machine; you won't need to run it again after this if you write your programs in Eclipse; full details for Macs here; NOTE: on Macs, you should get a warning the first time you run the application because it's unsigned; in this case, rather than double-clicking it to run, right-click and choose "Open" and you'll be asked if you really want to open the file; once you accept, you won't need to do this again: it will run by double-clicking from here on out
install the PyDev plugin from within Eclipse (so Eclipse can help you develop and run your Python programs)
- open Eclipse and access the Help menu
- select "Install New Software..."
- in the "Work with:" box, type http://pydev.org/updates and press Enter
- you may need to wait up to a minute until the "Pending..." is replaced by "PyDev" and "PyDev Mylyn Integration (Optional)"
- select only "PyDev" (by checking the box next to it) and click "Next >" down at the bottom
- follow the next steps to finish the installation using the defaults and agreeing to terms and conditions; IMPORTANT: you should see a window asking you to approve/trust a self-signed certificate from the PyDev publisher Brainwy; you will need to select the certificate's check-box and then approve it or Eclipse will not finish installing properly (though it might seem like it has)
- at the end, agree to restart Eclipse for changes to take effect
install the Ambient plugin from within Eclipse (so you can snarf and submit files for class)
- open Eclipse and access the Help menu
- select "Install New Software..."
- in the "Work with:" box, type http://www.cs.duke.edu/csed/ambient/update and press Enter
- you may need to wait a number of seconds until the "Pending..." is replaced by "Ambient"
- select "Ambient" (by checking the box next to it) and click "Next >" down at the bottom
- follow the next steps to finish the installation using the defaults and agreeing to terms and conditions; if you receive a warning about unsigned content, proceed anyway
- at the end, agree to restart Eclipse for changes to take effect
connect Eclipse to your version of Python (so you can run Python programs within Eclipse)
- open Eclipse and access the Preferences Box (under "Window > Preferences" on Windows or "Eclipse > Preferences..." on Mac)
- choose "PyDev", "Interpreters", and "Python Interpreters" from the sidebar
- press the "New ..." button to tell Eclipse about Python
- in the resulting dialog box, for the "Interpreter Name" type "Enthought", and for the "Interpreter Executable" type
  - for Windows: C:\Users\UUUU\AppData\Local\Enthought\Canopy\User\python.exe
  - for Mac: /Users/UUUU/Library/Enthought/Canopy_64bit/User/bin/python
  where UUUU is your user name on your machine
- choose "Select All" and then click "OK" at the bottom of the resulting dialog box
- click "OK" at the bottom of the Preferences Box and wait for the changes to take effect when the dialog box to close (you do not need to restart Eclipse)

Snarfing and running a sample Python program

Let's try snarfing and running your first Python program in Eclipse.

First, ensure that you have the right perspective in Eclipse. The Python perspective will give you a less cluttered set of windows with the smaller PyDev Package Explorer on the left and the main editor window on the right.

select "Window > Perspective > Open Perspective > Other..."
select "PyDev", then hit OK
you should now see a button with a Python icon that is highlighted in the upper right corner of your window (if you hover over it, it should say "PyDev"); if in the future, your screen setup looks odd, you can ensure you are in the Python perspective by clicking this button

The Ambient plug-in allows you to browse for and download code online using a tool called Snarf. For each problem set, we will provide you with some code as a framework and possibly some data files, and Snarf will allow you to import these files into your local copy of Eclipse. To snarf your first program, follow the directions below.

snarf in the Snarfing Sample project
- open Eclipse
- select "Ambient > Download (Snarf) a Project..."
- this should open a new tab at the bottom called "Snarfer Site Browser"; if it does not:
  - select "Window > Show View > Other..."
  - click "Ambient" then select "Snarfer Site Browser" and hit OK
- right-click within the "Snarfer Site Browser" window, and select "New Site"
- in the window, type http://www.cs.duke.edu/courses/fall16/compsci260/snarf/
- expand the project site "COMPSCI 260, Fall 2016" and then the "Samples" folder to find "Snarfing Sample (1.0)", and double click on it
- click the "Install Project" button, and in the window that pops up, check the "use default workspace location" box, and click "Finish"
- the "Import project" window will come up; leave the fields unchanged (in particular, leave "Use the downloaded .project file" selected) and click "Finish"
- expand the "Snarfing Sample" project in the "Project Explorer" pane on the left side, and double-click "python_intro.py" to open it up in the editor pane
try running the simple Python program that you snarfed
- click on the "Run" icon on the toolbar (the green circle with the white triangle pointing right) to run the program; this should create a "Console" tab in the bottom right pane and the results of the program should be printed in it; if the console does not appear:
  - select "Window > Show View > Other..."
  - click "General" then select "Console" and hit OK
- alternatively, select "Run" from the Run menu
- alternatively, right-click anywhere within the body of the program to see the context menu and then click on "Run As > Python Run"

Editing a sample Python program and submitting a project

Now modify the program and then submit the code from within Eclipse.

NOTE: simple Python documentation is available from within Eclipse—just hover over a Python keyword and a tooltip will pop up with a short description.

modify the file "python_intro.py" to change the output in some way
save and re-run the program to confirm the output changed as expected
now test submitting your new program along with the other files in the project:
- select "Ambient > Submit a Project for Grading..." to bring up the submit window
- first, you must choose the class and assignment folder you wish to submit to, so click on "compsci260" and select the "test" folder as your destination; then click "Next"
- select "Submit a single project", choose the project you wish to submit (in this case "Snarfing Sample"), and then hit "Next"; alternatively, if you do not want to submit the entire project, you can choose "Submit from the file system" and then you'll have an option to select and deselect the various files in the project (or elsewhere)
- once you've got the project and/or files that you want selected, choose "Finish"; you will be asked to enter your Duke NetID and password to authenticate
to confirm that your files were properly loaded onto our server, select "Ambient > Submit History..."; click on "compsci260", then the folder you've submitted to, and then hit the "Get History" button; enter your Duke NetID and password, and you'll see the number of times you've submitted, the time of your last submission, and a list of all the files in your last submission
congratulations, you have submitted your project!

You can submit as many times as you like, and everything will be stored on the server each time. Submitting partial solutions is fine -- it can even help us recover your work if anything happens on your computer. Even resubmitting a complete solution is fine: if you realize that you did something wrong at the last minute, you can simply resubmit a new version. In general, we will only look at your last submission, so when you resubmit a project, please resubmit all the relevant files, not just the ones you modified.

Academic integrity

All students are expected to abide by generally accepted standards of academic integrity. This includes all the various aspects of Duke's Community Standard. In particular, be reminded that it is not acceptable to take the ideas/work of another and represent it as one's own, even if paraphrased. Ideas/work taken from others—including Internet sources, peers in the class, peers from outside the class—must always be appropriately cited.

Violations of academic integrity will be taken very seriously. At a minimum, assignments in which a student either receives inappropriate input from others or provides inappropriate input to others will be graded as 0. In addition, violations will be discussed with the Dean who directs the Student Conduct Office.

Collaboration policy

Unless expressly granted in the problem set, all problems should be completed individually; no collaboration is permitted. However, if you have worked for a while on a particular problem and have encountered a mental wall, and if you have banged your head against said wall for a while, we provide mechanisms where you can consult others to make progress, rather than giving up entirely. Your first course of action is to post a question on Piazza, or to speak to the instructor or TAs.

If for any reason you consult your peers outside of Piazza, it should remain understood that such an interaction must be one of consultation and not collaboration: hints to help overcome a small obstacle rather than answers—after consultation, it is expected that you should still have plenty of thinking to do. In addition, if you do happen to consult with another student, both of you must cite this.

Extension policy

Students generally have two weeks to work on problem sets—not because two weeks are generally required to finish, but 1) to allow students who start early sufficient time to reflect/ruminate on problems where an impasse has been reached (the thought process through which students go while solving a problem often includes some gestation period before things become clear) and 2) to provide flexibility as to when students complete their work while they juggle other requirements and commitments during the semester.

Given this latter point, students should not request extensions for turning in their work beyond the two weeks already allotted. However, this rule has two exceptions:

Everyone invariably has some two-week interval that is especially tough, so students are allowed, once during the semester, to use an extra 48 hours to turn in their work. If you are exercising this option for a specific problem set, please indicate such when you turn in your work. It is entirely up to you when you want to use this one free extension; when you do, you are trusted to not consult the solutions if they happen to be posted before you turn in your work.
If you are quite ill for a non-trivial length of time, you may choose to submit a short-term illness notification to the deans; I am notified during this process, at which point we can work out a possible extension if one is necessary.

If you turn in work after the deadline but have already used your free extension, or if you are using your free extension but still turn in work after the extension deadline itself, we will take off 10 points for every 12 hours late (rounded up). So if you are 0–12 hours late, that would be –10; if you are 12–24 hours late, that would be –20, etc. Also, because we send work out to be graded using automated scripts, you must email the instructor any time you turn in late work, and work will not be graded if it is turned in after 5pm on Monday (more than 72 hours late without an extension, or more than 24 hours late with an extension).

Grading policy

Problem sets: 84%: All students are expected to complete seven problem sets over the semester, each contributing about equally to this component of the grade.
Participation and engagement: 16%: All students are expected to attend class regularly and participate in discussions. They are also expected to be engaged via Piazza, posting questions or notes, as well as helping each other as questions arise, or raising interesting points for further conversation. The instructor encourages an interactive classroom, and the hope is that students feel comfortable asking questions at any point in class—whether the material is unclear, or simply if it leads you to wonder about a new connection. So if something is either troubling or exciting you, please do not hesitate to speak up about it.

Approach to grading

We have designed problem sets in the class to permit you to explore the material, and to develop deeper understanding of the material through that exploration. I ask you to focus on the ideas and the learning rather than on the points and the credit; put another way, consider adopting a perspective of how you can work to satisfy yourself rather than work to satisfy me.

All that said, when it comes to grading, we still need to assign points and credit: that is unfortunately unavoidable. However, we have designed our approach to assigning credit in an attempt to be consistent with the perspective of the previous paragraph, and the approach is perhaps a little different from what you may be familiar with in other classes. Specifically, I have asked the TAs to frame their grading in terms of ‘positive earning’ rather than ‘negative error’.

What do I mean by this? Well, a ‘negative error’ approach is one in which one assumes one's work will earn 100 points unless there are mistakes present. Under such an approach, graders are negatively tasked with finding mistakes and errors, and taking away points for any they find.

I have inverted this by choosing to adopt a ‘positive earning’ approach in which an empty problem set earns 0 points, and students earn more points as they demonstrate deeper levels of mastery of the material and challenge. Under such an approach, graders are instead positively tasked with finding ways that students should earn credit for deeply engaging the material.

A corollary of the ‘negative error’ approach is that unless a student makes a mistake, they are entitled to earn the full 100 points. Conversely, a corollary of the ‘positive earning’ approach is that it is possible for a student to not make any mistakes yet still not earn the full 100 points. For example, this can happen if a student does the bare minimum to engage the material, and while not making any mistakes, never demonstrates mastery or depth of understanding. Our ‘positive earning’ approach not only focuses on the positive instead of the negative, but it also leaves room to grant more credit to students who engage the material more deeply.

I write all this because if you find that you did a problem without making a mistake, but got only +16 when some other student may have gotten +18, it doesn't necessarily mean that something is wrong (though it might be). It could mean that there were some interesting ways to engage the problem you didn't explore that the other student did. An analogy might be from a video game like Mario Brothers: you can successfully rescue the princess but still not end up with the highest score because someone can score higher if they take the time to flip more turtles or punch more coins along the way. Analogously, earning the full 100 points usually requires more than just ‘no mistakes’; it also requires demonstration of mastery and engagement. We use rubrics to apply these judgments as consistently as we can across the class.

Distributing grades

Grades for all work will be recorded and available to students online.

Piazza

This term we will be using Piazza for course announcements, communication, and discussion. The Piazza system is highly catered to getting you help quickly and efficiently from classmates, the TAs, and the instructor. Rather than emailing questions to the teaching staff, please post your questions on Piazza so everyone can benefit from the responses.

You can find our class posting page at: https://piazza.com/duke/fall2016/compsci260/home.

Data expedition challenge: Signal, noise, and bias in yeast MNase-seq data

This is an optional challenge for students interested in applying what we have learned in class to a real computational genomics research problem; practicing the skills of using Python or R (or any other tool you wish) to visualize, analyze, model, and interpret real genomic data; and exploring the science linking chromatin structure and transcriptional regulation. Since this problem represents an open challenge for the genomics community, you are free to choose the approaches you use to analyze the data, as well as the questions you explore. Creative projects are highly encouraged. You may work in small teams (2-3 is ideal). For all submissions we receive by the deadline of 15 Dec 16, we will provide feedback, and will also designate a best project as well as a most creative project. There will be (simple) prizes!

Data description

In this data expedition challenge, we will explore next-generation sequencing reads from MNase-seq experiments in yeast. The data was generated to detect genome-wide binding locations of various kinds of DNA-binding proteins. The MNase-seq data sets were collected at Duke as part of our ongoing computational genomic research collaboration with the lab of Prof. David MacAlpine in the Department of Pharmacology and Cancer Biology.

Biological background

DNA-binding proteins, including nucleosomes and transcription factors (TFs), play essential roles in gene regulation, and their locations along the genome help give us clues about how genes are regulated. Recently, a new MNase-seq protocol was developed by the MacAlpine group at Duke in conjunction with the Henikoff group at the University of Washington. [1] The basic idea is that genomic locations not bound by proteins are accessible to micrococcal nuclease (MNase) and are therefore more sensitive to MNase digestion. Conversely, genomic locations bound by proteins are less sensitive to MNase digestion.

Consequently, if we sequence the ends of the fragments that remain after MNase digestion, and map the paired sequencing reads that arise, we should be able to see where MNase was able to digest/cut the genome, revealing something about the binding locations of DNA-binding proteins along the genome. It is important to note that the genome of each individual cell in a population may be in a slightly different occupancy/protection state. We collect data from a population of cells so this experiment is sampling the different protection states present in the cell population.

Complicating the issue further, MNase is also known to have a nucleotide-specific bias as it digests DNA, meaning that it tends to cleave/digest certain sequences more than others. For example, it prefers to digest A/T nucleotides compared to G/C (its bias is actually a bit more subtle/complex than that, which is a nice model selection challenge you can explore: what is the simplest model that captures well this bias?). To give you further information about this sequence bias, we are also providing MNase digestion data of naked (deproteinized) DNA in vitro which will allow for the development of models to quantify such bias (because with this data, the variation in cutting that you see is only the result of the MNase interacting with the naked DNA and is not influenced by protein protection).

Data sets

Usually, sequencing reads are stored in files of fastq format. In this case, we downloaded two large yeast MNase-seq read fastq files: in vivo yeast MNase-seq read files generated by Henikoff et al. [1], and in vitro yeast MNase-seq read files generated by Deniz et al. [2] for use in quantifying MNase digestion bias. Both files contain short sequencing reads, of length 25 and 54 base pairs, respectively. The total number of reads in each file is on the order of 100 million. For reference, the yeast genome contains 16 chromosomes whose total size is approximately 12.5 million base pairs.

To analyze those sequencing reads, you would typically first need to map the reads to a reference genome, using tools like BOWTIE. However, to simplify this challenge, we have already performed this mapping step for you. We are thus providing you one tab-delimited text file for each of the first 12 yeast chromosomes, named with ChrI to ChrXII (yeast geneticists like Roman numerals); we will reserve the remaining 4 yeast chromosomes to evaluate your submitted results. Each file contains the start and end genome coordinates of all the reads mapped to that chromosome, one read per line. You may notice that the distances between the start and end coordinates are larger than 25 or 54 base pairs. That is because the MNase-seq experiments produce paired-end reads and we are indicating the coordinates of the spanned fragment from which the two reads come. So the provided coordinates are the start coordinate of one read, along with the start coordinate of its mated read on the opposite strand; or, put another way, the first and last nucleotide of the fragment.

It is reasonable to think of the start and end coordinates as nucleotides just beyond which MNase cleaved the DNA, while the sequence between the start and end coordinates was not digested by MNase. We also provide the whole yeast genome sequence (sacCer2 2008 version, in separate fasta files) if you wish to extract the actual sequence around the cleavage sites based on the provided coordinates.

All data files for this challenge are available from:

Potential research questions and challenges you can explore

You will need to do some independent exploration to figure out what to do next. You may want to read more about the MNase enzyme and how it works, or what is known about it. You probably want to get more info about the MNase-seq protocol, as described in the original paper. [1] Then you can start exploring one or multiple of the following, depending on what suits your fancy, or you may have other ideas of your own:

Investigate the bias of the MNase enzyme in digesting different nucleotides. (You can use the in vitro data to visualize and then model the distributions of read counts conditioned on the immediate nucleotides at the cutting site. However, because MNase is a protein that contacts more than one nucleotide of DNA, the interaction between MNase and DNA sequence perhaps spans more than one nucleotide, and more sophisticated models based on multiple nucleotides around the cutting site can also be explored.)
Study the distribution of MNase-seq read counts around transcription factor binding motifs to predict transcription factor binding sites: Are there specific signals (also called "footprints") left by bound transcription factors? Are these signals specific to each TF, or are there a few canonical footprint types (clusters of footprints shared by sets of TFs)? Or is there a single kind of footprint that just looks different for different TFs because of the role of MNase bias?
Study the distribution of MNase-seq read counts around nucleosomes: Can we define a nucleosome footprint to predict nucleosomal binding locations?
Visualize and compare the in vivo and in vitro MNase digestion profiles for a few transcription factors to explore the contribution of bias to the in vivo signal.
Anything else you find particularly intriguing or feel compelled to discover.

Potential tasks and approaches for data analysis

Visualize MNase digestion signals along the genome in a UCSC genome browser. This will allow you to see how they are distributed and how they compare to other tracks in the genome browser (like locations of coding regions, TSSs, nucleosomes, TFs, etc.).
Build PSSM (position-specific scoring matrix) or other more sophisticated statistical models to quantify the MNase digestion bias.
Visualize the digestion profiles (footprints) around TF motifs and nucleosomes using scientific plot-generating code (Python or R).
Anything else you find helpful in your explorations.

Good luck, and have fun on this expedition!

References (original sources of data)

Henikoff, J.G. et al. (2011) Epigenome characterization at single base-pair resolution. Proc. Natl. Acad. Sci. U.S.A. 108, 18318—18323.
Deniz, O. et al. (2011) Physical properties of naked DNA influence nucleosome positioning and correlate with transcription start and termination sites in yeast. BMC Genomics 12, 489.

COMPSCI 260: Introduction to Computational Genomics