You'll need to create a CVS directory for each group (see the help/resources page). You'll also need to ad the biojava jar file to your project's classpath
You can test the program by running MainShotgun.java and reading the file simplefasta.txt which should create this strand:
tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaaThe strand has a label as well that is printed when you run the program.
You can also test with the largefasta.txt file which should run quickly. However, if you test with the hugefasta.txt file, you'll find that your program takes a long time to run. On my dual-processor G5 Mac, the current program constructs a single strand from this large data set in 637 seconds. You'll work to improve this time.
In general, you'll need to keep a careful log of the kind of machine you're running on and how long each run takes.
If you look at the code in ShotgunReconstructor you'll see that the merge threshold is hard-wired to be 12. You'll need to refactor code (introduce parameters, instance variables, etc.) so that the threshold value can be set in the main method from MainShotgun.java and then be passed appropriately when the genome is reconstructed.
Then you should make several runs using the largefasta.txt data file to determine the minimal and maximal thresholds that result in one strand being reconstructed. In the README/final report you produce you should indicate how you determined these numbers. Do the same for the simplefasta.txt data file.
regionMatchesand/or other string methods to determine if two strands overlap. Your goal is to make to develop an implementation that is correct, but which yields much faster shotgun reconstruction. The points/score you earn on this part are based 80% on correctness and 20% on the efficiency of your code (it should be much faster).
In this new version you will not use regular expressions to determine if
two strings overlap, you'll check if the end of one strand matches the
beginning of another strand by using the String
regionMatches method. You'll need to be careful to ensure
that you handle the threshold properly. You can use the results of your
minimal and maximal thresholds from the previous section in your new
implementation. On my dual G5 mac that took 637 seconds for the slow
implementation the region-matches implementation took 9.686
seconds. That's a lot faster!.
You'll need to create a new IStrandFactory implementation as well, and configure the reading class appropriately from the main launching program. You should re-run all your previous simulations to see that your new implementation is correct. Be sure to include in your README/final report all the results you obtain.
NonReducingReconstructor.java. In this class, the
doGunmethod will not call
reduce, it will only call
merge. You'll create a new implementation of
IStrandbased on the
FastStrandclass you created earlier. Call this new class
In this class, you'll move the
contains logic/code into the
merge method of the new
IStrand object you're
creating. The idea is that if the entire other strand matches
somewhere in this strand you should not create a new strand, but should
this, the strand containing the other
strand. You'll need to think carefully about what the preceding
All String matching should be done with
With the new
In your final README/write-up you should discuss if your refactoring was
successful, and whether it reduced the time for completing a shotgun
reconstruction. Be sure to rerun all experiments and report on what
times you observe.
should not create substrings nor should you call
NonReducingReconstructor combined with
NoReduceFastStrand class you'll see
lots of time saved since you'll avoid iterating over
the strings twice: once for contains and once for merge.
With the new
In your final README/write-up you should discuss if your refactoring was successful, and whether it reduced the time for completing a shotgun reconstruction. Be sure to rerun all experiments and report on what times you observe.
IStrandimplementation that is based on the Sequence returned when reading, this will avoid, perhaps, creating strings for storing the DNA strands.
IStrand implementation will store
the Sequence, and use it and other biojava objects rather than
Strings to do the merging and containing. You'll need to
examine base-pairs in a sequence one-at-a-time, you might want
to develop a method similar to the String method
regionMatches for this purpose.
I'll help with this, I haven't done it (as of Nov 2005) to know how hard it is.