Compsci 004G, Fall 2005, Codons

Part I: Coding for Codons

Generally three base-pairs in DNA code for proteins (the process is complex, but the general idea is simple). To help build your nascent Java programming skills, you'll work on the process of generating the linear protein structure from a digital-representation of DNA. Your task is to do this in a straightforward way, without using complicated Java data structures or code --- basically using the language tools we've talked about and seen in class. This will involve some code-drudgery, but we'll use this drudgery as a motivator for exploring better (and more exciting!) ways of solving the problem.

General Idea

Given a strand of DNA represented as a sequence of base-pairs (A,T,C,G) find the location of the first start codon ATG in the strand and code the protein that begins after the start codon and ends at a stop codon: one of TAA, TAG, or TGA. Your code will return a protein sequence, not including the start and stop codons, based on the codon-triples found. Use the standard letters for the amino acids found, e.g., in Lesk, pages 6-7, this wikipedia defintion (which uses mRNA, hence U rather than T), this Nova/PBS demo --- explore a stretch of code, and this page of abbreviations.

For example, if your program processes this strand:

   CGATGCATCCCTTTAATTAA
it should return
   HPFN
which represents the protein sequence Histidine, Proline, Phenylalanine, Asparagine.

Coding Details

You must write a class named DnaToProtein with a method named convert that converts a String representing a Strand of DNA to a String representing the first-found protein as describe above. In brief, find the first codon, find a stop codon after this, and convert the codons between these. If no start or stop codons are found return an empty string: "".

In implementing DnaToProtein you must write and use another class named CodonToProtein which is started here. You'll need to complete the method getCodonLabel shown in the code. The comments are intended to be enough of a specification, if you have questions, use the class bulletin board.

To find the start and stop codons you must use one of two String methods indexOf that search for one String in another. The specification for these can be found in in the online javadoc for java.lang.String which is part of Javadoc for all Java classes. You'll need the two-parameter indexOf method to find a stop codon that occurs after the start codon.

Testing Your Code

A testing class DnaToProteinTester has been started for you. You should fill in the main method with as many test cases as you think are needed to test your code (and to test anyone else's code). When testing the code you write to convert DNA to a protein, you can use the testing class as well as the class that actually does the conversion. Ultimately you'll want to run just the testing class since it will thoroughly test your code every time it's run.

Part II: Forensic Discovery

You've discovered a torn page in a lab notebook containing the labels/identifiers NM_001618 and U87459. You're fairly confident the lab notebook belongs to someone doing cancer-related research, but you're not sure exactly what the research is or to what the identifiers refer.

You can paste either into the NCBI ORF Finder although this probably won't help you determine exactly what was being done in the lab whose notebook has been lost.

Interpret the results of the ORF finder and make a best-guess as to what the research being done by the lab whose notebook is lost. Justify your answer appropriately --- your grade will be based on the quality of your justification (not the correctness of your result which can be argued).


Owen L. Astrachan
Last modified: Mon Sep 19 20:56:57 EDT 2005