CPS 006G, Fall 2004, Shotgun Operation

Basic Operation

The basic operation that's essential in the computational side of shotgun sequencing involves matching two strands from a collection and merging them into a new strand (added to the collection). This process decreases the number of strands by one since two strands are merged into one strand. The match/merge operation is repeated until there is only one strand left in the collection---this strand will be the original DNA; or until no matches are possible. Every program that generates a single DNA strand should generate the same strand. When no single strand can be produced, e.g., because the matching threshold (defined below) is too low, programs may generate different results depending on the order in which matches are attempted. So, either one strand is left, which will be the same in every program, or several strands are left and the leftover strands may be different.

Match and Merge

Two strands X and Y can match by overlapping, or Y can be completely contained in X (or vice versa, X can be contained in Y).

Overlapped Match

For example, if X is the strand acggtcac and Y is gtcacatta then X and Y overlap as diagrammed below.

0 1 2 3 4 5 6 7
a c g g t c a c
      g t c a c a t t a
      0 1 2 3 4 5 6 7 8

This example shows how a match between X (called the first strand of the match) and Y (called the second strand) is formed by finding an index in X (3 in the example above) so that all the bases from that index to the end of X match the corresponding bases at the beginning of Y. The number of bases forming the match, five in the example above, is the size of the match. The size of a match must exceed some threshold to count as a match. For example, if the threshold were six, the match between X and Y above would not be considered a match.

When two strands match, a merged strand is formed by copying the first strand, then appending the bases of the second strand that come after the matched bases. In the example above, the merged strand is formed by first copying X to get acggtcac then appending atta (the bases in the second strand Y from indexes 5--8) to yield the merged strand acggtcacatta. A merged strand also joins the two labels associated with a strand, see the section on file formats below for label information.

Containing Match

If strand X acggtcac completely contains a strand Y, for example cggt as shown below, then a containing match is found.

0 1 2 3 4 5 6 7
a c g g t c a c
  c g g t
  0 1 2 3

The merge of this match is simply the strand X. For a containing match no threshold is involved, all containing matches are good matches and decrease the number of strands by one when the merge is formed.

Finding Matches

For most strand pairs, a match won't be found, or if a match is found it will not exceed the desired threshold. The program you write will begin with all the strands (ultimately read from a file, for now stored in an ArrayList) then try to find matches by considering every possible pair of strands in the collection. Whenever a match is found, the strands are removed from future consideration in matching, but their merge is inserted into the collection of strands being considered. As noted previously, this process continues until there is a single strand or until no more matches can be found.


Owen L. Astrachan
Last modified: Wed Oct 13 09:18:35 EDT 2004