Introduction to Computer Science
CompSci 101 : Spring 2014

HOWTO: Bioinformatics

Conceptually you will need to write code to do each of the following tasks:

  1. Given a string of DNA return the first valid protein-coding region.
  2. Given a codon, three-character string, return the corresponding amino acid, one-character string, using the two lists provided in the code.
  3. Given a substring of DNA that encodes a protein return the string that represents the protein, the letters corresponding to amino acids.
  4. Given a string of DNA return its reverse complement.
  5. Given a string of DNA return the longest protein it or its reverse complement encodes.

You may decide to write separate functions for each task or to write code for the tasks as part of another function. Making each task a separate function can help test and debug the tasks separately. The only function you are required to write is the following:

def translateDNAtoProtein (dna):
    """
     given a string composed only of lowercased letters 'gcta', 
     return a string of uppercase letters that represents the 
     longest first protein found with that string or its reverse 
     complement, or the empty string if no protein can be found
    """

This function can be found in the Python module Bioinformatics in the code given to start the project. Two other modules are given to help you test your project, InputGUI pops up a window to run your program and InputConsole runs your program interactively from within the console. These programs both just call the function above with your given input. Additionally, several example files of DNA and their associated proteins are given in the examples folder.

Find Protein Coding Region

The first step is to find the first location of a start codon in a string of DNA. If there is no start codon, the corresponding protein coding region is the empty string “”. If there is a start codon, then find the closest one of the three end codons, i.e., the one that has the lowest index after the start codon. If there are no stop codons, the corresponding protein coding region is the empty string, “”.

To find the closest end codon, you cannot simply use the string function find because that may find a triplet of characters that is not a multiple of three away from the start codon, i.e., something that should not be interpreted as a stop codon because part of it is already included in a previous codon. Thus, you should look for the closest end codon by directly checking each DNA triplet after the start codon until one matches an end codon.

For example in the DNA string

aattatgcccgggtttaaataaatagccctgattt

the start codon is shown in red and four possible stop codons are shown in blue. The first possible stop codon, ‘taa’, is 8 characters away from the start codon so, technically, its 't' is part of the previous codon, 'ttt', and it should not be considered as a stop codon. The second possible stop codon, also 'taa', is 12 characters away from the start codon, so it is a complete codon in this region and stop the process. No other later stop codons are considered and the substring found is 'cccgggtttaaa', the string between, but not including, the start and stop codons. 

In the given code, two variables are defined for you:

START_CODON = 'atg'
STOP_CODONS = [ 'taa', 'tag', 'tga' ]

These provide names for the start and end codon values making your code easier to read and to change in the future if something changes about these values, so we strongly suggest you use them in your program.

Codon to Amino Acid

In the given code, two list variables are defined: one of codon triplets and the other of corresponding proteins (single letters). Part of these lists are shown here:

CODONS = [ 'gct', 'gcc', 'gca', 'gcg', …, 'tgg', 'tat', 'tac']
AMINO_ACIDS = ['A', 'A', 'A', 'A', …, 'W', 'Y', 'Y']
  

Both lists have 61 elements and defined in parallel so that the three letter codon in CODONS[k] corresponds to the amino acid whose letter is found in AMINO_ACIDS[k]. For example, CODONS[14] is ‘gga’ while AMINO_ACIDS[14] is ‘G’ since ‘gga’ codes for the amino acid Glycine or ‘G’ as seen in these resources (as a table and as a wheel)

The main step is, given one codon, DNA triplet, use its index in CODONS to find the corresponding amino acid in AMINO_ACIDS. If a codon does not code for an amino acid the codon will not be found in CODONS; return the string ‘X’ in this case so the error can be easily found later. However, getting an 'X' would indicate an error in your code or in the data because it should never happen. All 64 possible combinations of 'a', 't', 'g', and 'c' are included in either the STOP_CODONS list or the CODONS list.

Region to Protein

Once you have a string representing a valid region, i.e., the one returned from the first task, repeatedly convert each triplet to an amino acid, i.e., use the code you wrote for the second task, to create a protein string: a sequence of amino acids.

For example, given the string returned from the example above (with spaces added for clarity)

ccc ggg ttt aaa

this function will return ‘PGFK’ since ‘ccc’ codes ‘P’ (Proline), ‘ggg’ codes for ‘G’ (Glycin), ‘ttt’ codes for ‘F’ (Phenylalanine), and ‘aaa’ codes for ‘K’ (Lysine).

Reverse Complement

You wrote code in lab to solve this problem — you can reuse that code directly here.

In a genome, adenine, 'a', and thymine, 't', are complements and cytosine, 'c', and guanine, 'g', are complements. Thus you should build a new string where each character in the original DNA string is replaced by its complement. Since DNA is a structure in which two halves are bound together, where one half is the original string and the other is its complement, running in the opposite direction, the value returned should be reversed.

Longest Protein

Finally, you need to repeat the first three steps of this process using the DNA's reverse complement string, i.e., the one returned from the fourth task, to find another protein. Your primary function should return the longest of these two protein strings. If both are the same length, return the one found in the original DNA string instead of the ts reverse complement.

Testing your Program

Included in the project files is a folder, examples, that contains ten test files starting with the prefix dna_. For each of these files, there is a corresponding file that starts with the prefix protein_ that includes the expected protein. For example, the solution for the input string in dna_handout.txt can be found in the file protein_handout.txt. All of these files are one line long.

Two modules are also provided to help you test your program: