CPS 006G, Fall 2004, DNA->Proteins

Groups for Part II

Generally three base-pairs in DNA code for proteins (the process is complex, but the general idea is simple). To help build your nascent Java programming skills, you'll work on the process of generating the linear protein structure from a digital-representation of DNA. Your task is to do this in a straightforward way, without using complicated Java data structures or code --- basically using the language tools we've talked about and seen in class. This will involve some code-drudgery, but we'll use this drudgery as a motivator for exploring better (and more exciting!) ways of solving the problem.

General Idea

Given a strand of DNA represented as a sequence of base-pairs (A,T,C,G) find the location of the first start codon ATG in the strand and code the protein that begins after the start codon and ends at a stop codon: one of TAA, TAG, or TGA. Your code will return a protein sequence, not including the start and stop codons, based on the codon-triples found. Use the standard letters for the amino acids found, e.g., in Lesk, pages 6-7, this wikipedia defintion (which uses mRNA, hence U rather than T), this Nova/PBS demo --- explore a stretch of code, and this page of abbreviations.

For example, if your program processes this strand:

   CGATGCATCCCTTTAATTAA
it should return
   HPFN
which represents the protein sequence Histidine, Proline, Phenylalanine, Asparagine.

Programming Requirements

Please use BlueJ for this assignment. You'll submit using Eclipse, but you'll use BlueJ to write, compile, and test your code. You'll be using Eclipse soon, and will need it to submit, but not for coding with this assignment.

You'll code by yourself for one week, then have a partner for one week. You'll need to turn in Part 1 working by yourself.

You must write a class named DnaToProtein with a method named convert that converts a String representing a Strand of DNA to a String representing the first-found protein as describe above. In brief, find the first codon, find a stop codon after this, and convert the codons between these. If no start or stop codons are found return an empty string: "".

In implementing DnaToProtein you must write and use another class named CodonToProtein which is started here. You'll need to complete the method convert shown in the code. The comments are intended to be enough of a specification, if you have questions, use the class bulletin board.

To find the start and stop codons you must use one of two String methods indexOf that search for one String in another. The specification for these can be found in in the online javadoc for java.lang.String which is part of Javadoc for all Java classes. You'll need the two-parameter indexOf method to find a stop codon that occurs after the start codon.

Testing Your Code

A testing class DnaToProteinTester has been started for you. You should fill in the main method with as many test cases as you think are needed to test your code (and to test anyone else's code). When testing the code you write to convert DNA to a protein, you can use the testing class as well as the class that actually does the conversion. Ultimately you'll want to run just the testing class since it will thoroughly test your code every time it's run.

Refactor for Full Credit

Before starting this part of the program you should submit your working code. Use dnaprotein-1 as the assignment name. When you're confident of your code and your tests, you can begin to refactor your code for the next part of the assignment, described below. When submitting, you'll need to submit all your .java files and a README as described in the general instructions for the course (it includes the time you spent, the names of people you talked to, and your impressions of the program).

After you've tested your program and are confident that it works you should change the method signature of the method convert in CodonToProtein. Currently it takes a String parameter. This means you'll have used the substring method of the String class to find codon triplets. As we'll discuss, this isn't very efficient and for large strands of DNA this efficiency could be important. Instead, the code in CodonToProtein should be rewritten to use the String method regionMatches. You'll need to pass the entire String/DNA-strand to the convert method in CodonToProtein and an index at which the codon being checked starts. The general idea is to avoid creating substrings in the DnaToProtein code you write. Instead, pass an entire String and an index for each codon converted in the process of converting a region of DNA to protein. As your code in DnaToProtein loops over codons between the start and stop codon, the index passed will change, but the DNA strand passed remains the same.

Submit

When the refactored version works submit all your .java files and a new README using the assignment dnaprotein-2 as the assignment name.

A+/Extra Credit

There are two ways to earn A+/extra credit. If your testing code includes valid tests that cause other solutions to fail, you'll earn 2 points for each such test.

For more extra credit you'll write and test new methods to find all proteins. For A+/Extra credit, create a new method in DnaToProtein that returns all the proteins found in a string of DNA. This means you'll need to find every start/corresponding stop codon and convert the codon-regions to proteins. Your new method should return an array of String object, where the String stored at index zero is the first protein found, the String stored at index one is the second protein found, etc. If no proteins are found return an array with no elements.

Your new method should be named convertAll, its signature follows.

String[] convertAll(String dna) In writing this method, you'll want to call the convert method in DnaToProtein that you've already written and tested. In addition to writing the convertAll code you'll need to write testing code. You should add a new series of tests in the class DnaToProteinTester. To facilitate passing an array representing the correct proteins, see the valid Java code fragment that follows for testing a DNA strand with no proteins and one with two proteins. testAll("AGTA", new String[]{}); testAll("ATGAAAGATCATTAAATCCATGTGTGAATAGCCT",new String[]{"KDH", "CE"}); For extra-extra credit your convertAll code should not create any substrings. Don't worry about this part at first. You should write and test convertAll and when you're confident it works properly, try to think of a refactoring that avoids substring creation.

Submit

Submit all your .java files and a README using dnaprotein-xtra as the assignment name.

Grading

For each of two required and one optional parts to this assignment you'll be submitting three .java files and a README. The Java files are DnaToProtein, DnaToCodon, and DnaToProteinTester. Grading criteria follows.

First Code
DnaToProtein 10 points Works correctly (7 points), well-structured and written (3 points)
DnaToCodon 4 points All cases covered well-structured
testing 10 points thorough and complete
README 2 points exists, complete
Refactored Code
DnaToProtein 6 points Works correctly (4 points), well-structured and written (2 points)
DnaToCodon 2 points All cases covered well-structured
README 2 points exists and complete
Extra Credit
DnaToProtein/all 10 points Works correctly (7 points), well-structed and written (3 points)
testing 10 points thorough and complete
testing breaks other code +2 per valid test if valid tests causes other code to fail


Owen L. Astrachan
Last modified: Sat Sep 18 11:54:37 EDT 2004