CPS 006G, Fall 2004, Shotgun IV

See the FAQ

Part IV

Please check out gunner. You'll need to add code in the class ShotgunReconstructor in the merge method so that strands are correctly removed and added to the strand collection when merging occurs.

After you've added this code correctly, you can test your code with the simplefasta.txt file which should create this strand:

    tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaa
You can also test with the largefasta.txt file which should run quickly. However, if you test with the hugefasta.txt file, you'll find that your program takes a long time to run. If you look at the code in ShotgunReconstructor you'll see that the threshold is hard-wired to be 12. Eventually you'll need to change this so that the value can be set in the main method from Shotgun.

You'll need to make two changes in this version of the program.

  1. Change the classes as needed so that the threshold for determining matches is not hard-wired to 12, but is determined in the main method that runs the shotgun simulation. You'll need to make changes in ShotgunReconstructor for this to happen.

  2. Develop one (or two, see below) new IStrand implementations that are faster. The first should use regionMatches and/or other string methods to determine if two strands overlap. Your goal is to make to develop an implementation that is correct, but which yields much faster shotgun reconstruction. The points/score you earn on this part are based 80% on correctness and 20% on the efficiency of your code (it should be faster).

    You'll need to create a new IStrandFactory implementation as well, and configure the reading class appropriately.

  3. You should experiment with your new implementation by making the contains method always return false and moving its functionality into the merge method. (This is a code-refactoring). In this case you'll want to change code in ShotgunReconstructor too so that the reduce method doesn't iterator over all strands, but simply returns false.

    This should save lots of time since you'll avoid iterating over the strings twice: once for contains and once for merge. Since the merge method already looks for matches, you could change it to look for containing matches too. In this case you can return the containing strand rather than creating a new one.

    In your final README/write-up you should discuss if your refactoring was successful, and whether it reduced the time for completing a shotgun reconstruction.

  4. For A or A+ credit you should develop an implementation based on the biojava Sequence, SymbolList, and Edit classes. These can be found in the biojava API, in the org.biojava.bio.seq package. The idea is to create an IStrand implementation that is based on the Sequence returned when reading, this will avoid, perhaps, creating strings for storing the DNA strands.

    The new IStrand implementation will store the Sequence, and use it and other biojava objects rather than Strings to do the merging and containing. You'll need to examine base-pairs in a sequence one-at-a-time, you might want to develop a method similar to the String method regionMatches for this purpose.


Owen L. Astrachan
Last modified: Sun Nov 21 10:58:50 EST 2004