CPS 006G, Fall 2004, Shotgun Operation
Basic Operation
The basic operation that's essential in the computational side of
shotgun sequencing involves matching two strands from a
collection and merging them into a new strand (added to the
collection). This process decreases the number of strands by one since
two strands are merged into one strand. The match/merge operation is
repeated until there is only one strand left in the collection---this
strand will be the original DNA; or until no matches are possible.
Every program that generates a single DNA strand should generate the
same strand. When no single strand can be produced, e.g., because the
matching threshold (defined below) is too low, programs may
generate different results depending on the order in which matches are
attempted. So, either one strand is left, which will be the same in
every program, or several strands are left and the leftover strands may
be different.
Match and Merge
Two strands X and Y can match by overlapping, or Y can be completely
contained in X (or vice versa, X can be contained in Y).
Overlapped Match
For example, if X is the strand acggtcac and Y is
gtcacatta then X and Y overlap as diagrammed below.
0
| 1
| 2
| 3
| 4
| 5
| 6
| 7
|
a
| c
| g
| g
| t
| c
| a
| c
|
|
|
| g
| t
| c
| a
| c
| a
| t
| t
| a
|
|
|
| 0
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
|
This example shows how a match between X (called the first strand
of the match) and Y (called the second strand)
is formed by finding an index in X (3 in
the example above) so that all the bases from that index to the end of X
match the corresponding bases at the beginning of Y. The number of bases
forming the match, five in the example above, is the
size of the match. The size of a match must exceed
some threshold to count as a match. For example, if the threshold were
six, the match between X and Y above would not be considered a
match.
When two strands match, a merged strand is formed
by copying the first strand, then appending
the bases of the second strand that come after the matched bases. In
the example above, the merged strand is formed by first copying
X to get acggtcac then appending atta (the bases in
the second strand Y from indexes 5--8) to yield the merged strand
acggtcacatta. A merged strand also joins the two labels
associated with a strand, see the section on file formats below for
label information.
Containing Match
If strand X acggtcac completely contains a strand Y, for
example cggt as shown below, then a containing match is found.
0
| 1
| 2
| 3
| 4
| 5
| 6
| 7
|
a
| c
| g
| g
| t
| c
| a
| c
|
| c
| g
| g
| t
|
| 0
| 1
| 2
| 3
|
The merge of this match is simply the strand X. For a containing match
no threshold is involved, all containing matches are good matches and
decrease the number of strands by one when the merge is formed.
Finding Matches
For most strand pairs, a match won't be found, or if a match is found it
will not exceed the desired threshold. The program you write will begin
with all the strands (ultimately read from a file, for now stored
in an ArrayList)
then try to
find matches by considering every possible pair of strands in the collection.
Whenever a match is found, the strands are removed from future
consideration in matching, but their merge is inserted into the
collection of strands being considered. As noted previously, this
process continues until there is a single strand or until no more
matches can be found.
Owen L. Astrachan
Last modified: Wed Oct 13 09:18:35 EDT 2004