Compsci 4g, Fall 2005, Simple Shotgun

In this assignment you'll write a simple version of a program that reconstructs the DNA/Genome from the sequenced strands that result from shotgun sequencing. In the description below, you can assume the original DNA to be sequenced was cloned several times and that all of the clones were broken into sequenceable-sized chunks via various techniques. Essentially the program you'll work with first receives a list of different-sized sequences that are the results of the laboratory shotgun process. Your program must construct the original sequence from the pieces.

You should snarf the shotgunI project using Eclipse. You're to complete the method containReduce that performs the containing match part of the basic operations (see below) you'll need to do to reconstruct the genome. The class you'll check out is ShotgunReconstructor.java and accessible here.

The basic idea is to remove any string that is completely contained in another string, leaving the other strings in the ArrayList passed to the method. You'll want to look at the ArrayList API to see if there are useful methods to help in this process, and you'll need methods from the String class as well.

Basic Operation

The basic operation that's essential in the computational side of shotgun sequencing involves matching two strands from a collection and merging them into a new strand (added to the collection). This process decreases the number of strands by one since two strands are merged into one strand. The match/merge operation is repeated until there is only one strand left in the collection---this strand will be the original DNA; or until no matches are possible. Every program that generates a single DNA strand should generate the same strand. When no single strand can be produced, e.g., because the matching threshold (defined below) is too low, programs may generate different results depending on the order in which matches are attempted. So, either one strand is left, which will be the same in every program, or several strands are left and the leftover strands may be different.

Match and Merge

Two strands X and Y can match by overlapping, or Y can be completely contained in X (or vice versa, X can be contained in Y).

Overlapped Match

For example, if X is the strand acggtcac and Y is gtcacatta then X and Y overlap as diagrammed below.

0 1 2 3 4 5 6 7
a c g g t c a c
      g t c a c a t t a
      0 1 2 3 4 5 6 7 8

This example shows how a match between X (called the first strand of the match) and Y (called the second strand) is formed by finding an index in X (3 in the example above) so that all the bases from that index to the end of X match the corresponding bases at the beginning of Y. The number of bases forming the match, five in the example above, is the size of the match. The size of a match must exceed some threshold to count as a match. For example, if the threshold were six, the match between X and Y above would not be considered a match.

When two strands match, a merged strand is formed by copying the first strand, then appending the bases of the second strand that come after the matched bases. In the example above, the merged strand is formed by first copying X to get acggtcac then appending atta (the bases in the second strand Y from indexes 5--8) to yield the merged strand acggtcacatta. A merged strand also joins the two labels associated with a strand, see the section on file formats below for label information.

Containing Match

If strand X acggtcac completely contains a strand Y, for example cggt as shown below, then a containing match is found.

0 1 2 3 4 5 6 7
a c g g t c a c
  c g g t
  0 1 2 3

The merge of this match is simply the strand X. For a containing match no threshold is involved, all containing matches are good matches and decrease the number of strands by one when the merge is formed.

Finding Matches

For most strand pairs, a match won't be found, or if a match is found it will not exceed the desired threshold. The program you write will begin with all the strands (ultimately read from a file, for now stored in an ArrayList) then try to find matches by considering every possible pair of strands in the collection. Whenever a match is found, the strands are removed from future consideration in matching, but their merge is inserted into the collection of strands being considered. As noted previously, this process continues until there is a single strand or until no more matches can be found.


Owen L. Astrachan
Last modified: Mon Nov 14 13:19:17 EST 2005