Duke Computer Science Shield

Assignment 3: DNA

Due October 26th

Computer Science 201

The Big Picture

See The Medium Pictures and The Little Pictures for the specifics of what you're going to do and how to proceed. The Big Picture is to give you a bird's-eye view of what's going to happen. I recommend you read this page through, start to finish, before you snarf any code!

Background
In this assignment you'll experiment with different implementations of a simulated restriction enzyme cutting (or cleaving) a DNA molecule. Three scientists shared the Nobel Prize in 1978 for the discover of restriction enzymes. They're also an essential part of the process called PCR ("Polymerase Chain Reaction") which is one of the most significant discoveries/inventions in chemistry and for which Kary Mullis won the Nobel Prize in 1993. As cool as it is, this has nothing to do with the Nobel Prize that a Duke researcher just won.

Manipulations of this sort are exactly what linked lists are best at. Specifically, you'll be working with a chunk list, which fills a niche halfway between array-based lists (ArrayList) and linked lists. There's some extra credit available for doing more things with chunks. As you might have guessed, this assignment is a simplification of the chemical process, but still gets at something biologically sound.

DNA has two goals:

  1. Building familiarity with linked lists and how to use them. You'll do this by actually writing a special-purpose linked list (the chunk list).
  2. Seeing the effects, both in Big-O and real time, of various ways of implementing algorithms on long sequences of data. (In other words: what sort of sequences favor what sort of data structures?)

The Medium Pictures

DNA
In this assignment, you'll be given an interface (IDnaStrand) that represents a strand of DNA (see below, and the class notes, for more on Java interfaces). You're also given a simple implementation (SimpleStrand) of that interface that relies on the built-in Java classes String and StringBuilder. You'll be writing a new implementation of that interface that uses a linked list, rather than String and StringBuilder.

The interface, and implementations, simulate two ideas from genomics: reversing a strand of DNA, and recombination. The details of recombination are explained in the little pictures; the code implements it. You'll experiment with the provided implementation, and then implement a new version that's more efficient. Your experiments will evaluate the tradeoffs in doing things in these two different ways.

Interfaces
In this class, we've seen many built-in Java classes: ArrayList, LinkedList (briefly before now), HashSet, HashMap, TreeSet, TreeMap (both briefly), and so on. Let's consider ArrayList and LinkedList. Both provide .size(), .get(), .add(), and so on. In short, they can do the same things, although (as we've mentioned, and as you'll see in this assignment) they do them differently. Because their methods are the same, the two should be interchangeable, in some sense.

This leads to an interesting idea: an abstract idea of a "list" that's independent of the details of its implementation. We call this an abstract data type, often abbreviated to ADT. ADTs define a set of methods according to their names, arguments, and return types, and say nothing about how those methods are actually implemented. "List" is an ADT, as is "Set", and "Map" (and there are others).

In Java, ADTs are called "Interfaces." A class is said to "implement" an Interface if it provides all the required methods. Java has a great many built-in Interfaces: List, Set, and Map are examples you've seen before. Take a look at IDnaStrand.java to see how an interface is defined. (By convention, not-built-in Interfaces start with a capital I). The code in IDnaStrand.java looks very much like a class, but with the body of each method missing. It looks that way because that's exactly what it is. (See the October 17th class notes for more.)

The other side of the coin is a "Concrete Data Type" (almost always just called an "implementation"). Map and Set are Interfaces; HashMap and TreeSet are implementations. In this assignment, IDnaStrand is an interface, SimpleStrand is one implementation, and you'll be writing another implementation (called LinkStrand).

The distinction between interface and implementation, believe it or not, will make your life easier. Fundamentally, it makes your code more general. For example, the testing code we provide (more on that later) doesn't care about whether you give it a SimpleStrand or a LinkStrand; it only cares that you provide something that implements IDnaStrand. This generality is powerful.

Vitally, the same way writing a class defines a new datatype, writing an interface defines a new datatype: one that can take on a value from any class that implements the interface. All of the following are completely legal:

List<Integer> a = new ArrayList<Integer>();
List<Integer> b = new LinkedList<Integer>();
Set<String> s = new TreeSet<String>();
Set<String> t = new HashSet<String>();
Map<String, Integer> m = new TreeMap<String, Integer>();
Map<String, Integer> n = new HashMap<String, Integer>();

Restriction Enzyme Cleaving
For the coding in this assignment to make sense, you need to know a bit about restriction enzyme cleaving. Restriction enzymes cut a strand of DNA into two pieces at a specific location (the "binding site"). In chemical processes, a strand can be split in several places by the same enzyme; we'll be doing that too.

Conisder the DNA strand "aatccgaattcgtatc", and the restriction enzyme "gaattc". The restriction enzyme specifices where in the DNA sequence the split should occur; you'll also be told where in the restriction enzyme the split should occur. Consider the example below: Here, "gaattcc" (in red) matched the DNA sequence, and cut it with one nucleotide on the left, and the rest on the right.

In the code you'll be running, another strand of DNA will be spliced into the gap. Note how the splice also matches the restriction enzyme in the following picture: After the join, we have this sequence: The heavily-shadowed areas show the original restriction enzyme. Your code is a software simulation of this process: the restriction enzyme will cut a strand of DNA and new material will be spliced in to create a recombinant strand. In the provided code, this is implemented in the cutAndSplice method of SimpleStrand.

LinkStrand
The fundamental coding exercise in this assignment is the implementation of the LinkStrand class, which implements the IDnaStrand interface. LinkStrand implements the cutAndSplice using a linked list; as a result, it can splice new DNA into the original DNA in constant time, rather than depending on the length of the splicee. The details of how this will be done are in the little pictures.

The Little Pictures

At this point, you should have a pretty good bird's-eye view of Assignment 3. The Little Pictures explain the steps you're going to take to complete it and how it will be graded.

You'll be submitting three things: your implementation of LinkStrand, a detailed experimental analysis, and a README. All of these files should be submitted via Ambient or websubmit. Your analysis should be in .pdf format, and your README in .txt format.

Step 0:
Pause to collect your thoughts. Snarf the code. Read through IDnaStrand.java and SimpleStrand.java.

Step 1: Analysis of SimpleStrand
Take a close look at cutAndSplice in SimpleStrand. With that method, generating a new recombinant strand is O(n), where n is the length of the resulting recombinant strand. (This ignores the cost of finding the breaks, which is O(t) for a strand with t breaks). The DNABenchmark program does timing benchmarking for you; give it a run. (If it runs out of memory; don't worry. That's expected.) The timing data it provides probably aren't enough to complete this part of the assignment, though: you'll need to add some code to it to take more. (However, you'll need to use the original behavior in the next section; therefore, don't remove anything.)

In your assignment writeup, rigorously justify (using both Big-O analysis and empirical benchmarking) that generating recombinant DNA using SimpleStrand is O(n), where n is the length of the resulting strand.

Some things to think about for this analysis:

  1. For purposes of computing timings, it may be helpful to generate smaller DNA sequences. The two provided sequences (ecoli.txt and ecoli_small.txt) could be truncated, for example. If you generate a new sequence for your benchmarking, be sure to include your new file in your submission.
  2. The discussion of increasing Java's heap space (Step 2) may come in handy for dealing with large sequences, as well. If you do this, be sure to document it in your analysis.
  3. Plots (especially with functions fit to them) are a very compelling way to demonstrate how code's runtime changes as a function of n. Note that plots are not (at all) the way to make a compelling argument about Big-O behavior.

Step 2: Memory, and the lack of it.
As you noticed above, the benchmarking program ran out of memory. DNA sequences are (potentially) gigantic, so this isn't too surprising. The next step is to figure out (on your computer) what the longest value of splicee is that doesn't generate a Java Out of Memory exception, and how that length changes as you allow Java more memory. The benchmarking program works with Strings whose lengths are a power of two; start by reporting the largest value that doesn't run out of memory.

Next, increase the amount of heap space that Java is allowed. Roughly, heap space corresponds to the total amount of memory available to your program; more precisely, the heap is the part of memory where anything created using new is stored. (Other variables are on the stack.)

To do this, open Eclipse's Run menu, and select "Run Configurations", and then the "Arguments" tab. Then, add the line "-Xmx1024M" to the "VM arguments", as seen below: This changes the maximum heap space to 1024 megabytes (1 gigabyte). Report what this does to the maximum-size benchmark you can run. Then, try multiplying the maximum heap size by 2 (to 2048M), and report that. Keep increasing by powers of two until your computer can't handle it anymore. You should indicate when your computer runs out of memory, and what behavior the program exhibited as you increased the maximum heap size.

Step 3: LinkStrand
This step is full of very small pictures indeed. It should give you a fairly precise view of what, exactly, you'll need to code.

This assignment is a microcosm of a very common programming scenario: code that already works, but isn't really efficient enough. This is a convenient place to be, because the (correct, if too slow) implementation gives you something to test against. In DNA, you're given SimpleStrand, which implements the IDnaStrand interface. However, you've benchmarked SimpleStrand, determined its efficiency, and you're confident that you can do it in better than O(n). (You're confident because...we told you so!)

You'll be implementing a class called LinkStrand that also implements the IDnaStrand interface. Rather than the complete copy made by SimpleStrand, you'll be using the power of linked lists to splice pieces into the strand.

One might ask what is being stored in a single node of the linked list stored by LinkStrand. Many choices are possible: one nucleotide, two, ten, some arbitrary k, or something else. We're going to store variable-length chunks in each node. The size of a particular chunk isn't fixed: it depends on what splices have taken place already. In short, the algorithm is this: find a place a splice needs to take place. Break that node into two pieces: one before, and one after. Splice a new node (with the enzyme) into the middle. This operation takes O(1) time, because the necessary String operations are O(1), and linked-list splicing is O(1).

Begin by creating a class LinkStrand that implements IDnaStrand. You'll need to store a linked list of Strings, so your class is going to start out like this:

class LinkStrand implements IDnaStrand {
  // Inner class representing a single node in our linked list.
  private class Node {
    // Because the entire Node class is private, we can reasonably make its
    // member variables public.
    public Node myNext;
    public String myData;

    // You'll want a constructor for Node, too.
  }

  // Note that we're now in LinkStrand, but no longer in Node. We need to 
  // store pointers to the first and last nodes in your list:
  private Node myHead;
  private Node myTail;
  // And this is handy, too. 
  private long mySize;  // How many nucleotides does this strand represent?

  // You'll want (at least one) constructor, plus all of the methods required
  // by IDnaStrand.
}

LinkStrand Implementation Details
When you build a LinkStrand (either using the constructor, or the initializeFrom method required by IDnaStrand), it should start out as a one-node linked list, where the node's myData is the entire String. When you call cutAndSplice, you'll go from one node to (at least) three: the data before the spliced-in text, the spliced-in text, and the data after the spliced-in text. See the image below for an example of what the list might look like after a cutAndSplice.

In the figure above, you ended up with more than three nodes, because the enzyme occured in multiple places. The image below explains why doing it this way saves memory: because of the way Strings work, their data is shared; therefore, the total memory requirements of LinkStrand should be lower than the memory requirements of SimpleStrand.

Note: Java's .substring is O(1), and doesn't copy the data. That's why this works.

Some things to keep in mind:

  • Your goal is to match, exactly, the behavior of SimpleStrand, but with better performance characteristics. When in doubt about how something should work, match how SimpleStrand does it.
  • Both your constructor and initializeFrom methods should use the same code. The best way to do this is to write a third, helper, method, and then have both the constructor and initializeFrom call the helper. Since the helper isn't part of the interface, it should be private.
  • Implementing append should not concatenate Strings; it should create a new node. Note that (because append takes in an IDnaStrand), it might get a SimpleStrand or a LinkStrand as a parameter. If you're appending a LinkStrand, you should create new nodes; if you're appending a SimpleStrand, convert it to a String, and then use the append that takes a String. Look at SimpleStrand to see how to figure out the type of the argument using Java's instanceof operator.
  • When implementing cutAndSplice, you may assume that your LinkStrand contains exactly one node. If it doesn't, you should throw an exception:
    if (...LinkStrand has multiple nodes...) {
      throw new RuntimeException("LinkStrand has multiple nodes!");
    }
      
    We won't be getting into more detail on exceptions; suffice to say that this will crash your program with the provided message. We permit this only-one-node simplification because finding the breakpoints without it is much trickier.
  • The code for cutAndSplice will be almost completely identical to the code in SimpleStrand. However, because LinkStrand is implemented differently, you'll have better performance characteristics.
  • We provide JUnit tests for implementations of IDnaStrand. SimpleStrand passes those tests; your LinkStrand should too. Note that passing all of these tests does not guarantee full credit, as they measure correctness, not efficiency.

Step 4: LinkStrand Analysis
A correctly-implemented LinkStrand will run in O(b) time, where b is the number of breaks being introduced into the strand. Repeat your analysis from Part 1, but replace SimpleStrand with LinkStrand in the main method of DNABenchmark.java.

Note that here (and in Part 1) we're assuming that the cost of finding the breaking points is negligible, and can be ignored. Another way to think about it is that it's equal for the two implementations; we're speeding up the other part of the process.

Reversing and Extra Credit
In the lab, DNA doesn't have a fixed direction; reversing it is necessary. SimpleStrand relies on a built-in method of StringBuilder to do this. For full credit in LinkStrand, an N-node list, when reversed, should still have N nodes. (That is, you should reverse each node individually). For extra credit, save some reversals: reverse each String only once (rather than each node). (Recall from the picture above that String data is shared.)

Grading
The entire assignment is graded on a ten-point scale. This assignment is weighted by a factor of 2 (Markov was weighted by 1.5, and the previous assignments were weighted by 1.0). Your analysis (both parts), which should be extensive, is worth five points. Your implementation of LinkStrand is worth four points, and you get one point for a complete README and good code style.

Submission
Submit your code, analysis, and README using Ambient or websubmit. Please make sure you submit the entire assignment each time you submit, including any code we provided, even if you haven't changed it. You can submit as many times as you'd like; we'll take the last one.

The README
Like every assignment (but not APTs), you should include a README text file in your submission. It should include:

  • Your name.
  • When you started and finished the assignment, and your best estimate of how long it took.
  • The credits: who helped you. Recalling the collaboration policy from the syllabus, be sure to give credit where credit is due. Also, take credit where it's due: if you helped someone else, say so! If you didn't talk to anybody about this assignment, say that.
  • Anything we've specifically said should be included in the README.
  • Any feedback you have about the assignment itself; we're always trying to make these assignments better. If you'd rather that feedback be anonymous, there's a link for that on the course page.
You can add a text file to your code using File -> New -> Untitled Text File in Eclipse.

Code Style
Fact: code is hard to read.
How you lay out your code, and your program, make a big difference in how easy it is to understand. There are (at least!) four people who need to understand your code: you (while writing it), anybody you ask for help, your grader, and you (six months from now, when you look at it to figure out how you did something). While the following rules are not set in stone, they provide a good place to start:

  • Think about how your code is going to be organized before you've written it. Good design matters!
  • Give your classes, variables, and methods descriptive names. Naming your variables foo and bar doesn't say much; xPosition and yPosition are much better.
  • Use Java's conventions on naming. ClassesAreNamedLikeThis: words are run together, and each word is capitalized. Similarly, methodsAreNamedLikeThis: words run together, and all but the first word are capitalized. Variables are named like methods.
  • Indent your code properly. Roughly speaking, this means "everything inside a curly brace gets indented one extra level." See the code we provide for how this should look. To make this easy, Eclipse can do it for you: select your code with the mouse, and then do Source -> Correct Indentation.
  • Don't let your lines get too long. Although Java doesn't care how long your lines are, long lines are much harder to read. The closest thing there is to a universal standard is 80 characters. Eclipse makes this easy, too: in Preferences -> Text Editors, turn on "Print Margin" and set it to 80. That will add a vertical line at the 80-character mark.

Remember: easy-to-understand code makes your grader happy. Happy graders are friendly graders!