Deciphering the Genetic Code
Software and algorithms are extremely important in analyzing and understanding genomic data. Genome scientists are often interested in proteins, which are typically more important than raw DNA in understanding the function of different genomes (human, fruit-fly, etc.)
DNA is comprised of four nucleotides, or bases, adenine, cytosine, guanine, and thymine, which represented by the letters: ‘a’, ‘c’,’g’,’t’respectively (see Wikipedia for details). This encoding of DNA, called the genetic code, was discovered by James Watson and Francis Crick in 1953. The string ‘gtattacccggcca’ and any string consisting of these four letters might represent genomic data.
DNA (and RNA) are interesting for medical and scientific research and understanding because they provide a genetic blueprint. The blueprint might represent hair-color, or cell-width, or dancing ability (so far an as yet undiscovered gene). Genes are parts of DNA that code for proteins, finding the proteins that represent a gene in a strand of DNA is what this assignment is about.
Proteins are constructed from amino acids and each amino acid is constructed from three nucleotides. Three consecutive nucleotides are called a codon, each codon represents an amino acid. For example ‘gca’ codes for the amino acid Alanine. Each codon (triple of nucleotides) codes for an amino acid. There are 4x4x4 = 64 different DNA triplets, but only about 24 amino acids — so some of the amino acids are coded by more than one codon, providing DNA with redundancy. For example, Alanine is coded by the codons 'gct', 'gcc', 'gca', and 'gcg'. Each amino acid has a one-letter abbreviation, Alanine is A.
A string of amino acids makes up a protein. Within DNA, proteins are delimited by two control codons: a start codon that marks the start of a protein; and three stop codons, which mark the end of the protein whenever one of the three is found. Given such a limited vocabulary to drive the wide variety of biological functions that proteins serve, most proteins are long, containing about 300 amino acids. A visual explanation of this process can be found here, which includes an interactive flash version.
Review of terminology:
- DNA: string, possibly very long, composed of four nucleotides represented by letters ‘c’, ‘g’, ‘t’, ‘a’.
- Codon: string composed of three consecutive DNA nucleotides, such as 'tgc'
- Amino Acid: string of a single letter that corresponds to one or more codons, such as 'C'
- Protein: string composed of a sequence of amino acids, such as 'PPAMA'
Basic Specification
Write a program that takes a string, representing a DNA strand, and produces another string, representing a protein. That is, science aside, given a string containing only the letters, 'a', 'c', 'g', and 't', find a substring, if possible, that can be divided into groups of three letters, each of which is translated into a single letter. This string, one-third as long as the found substring, represents the resulting protein.
In order to produce a protein, you will need to find the region, substring, of a string of DNA that could code for a protein. This substring is marked by the first found start codon, the three letter group 'atg' and ends with the first later occurrence of one of the three stop codons, 'taa', 'tag', or 'tga'. In order to be a valid protein region, the length of this substring must be a multiple of three. If no valid protein region is found, return the empty string, ''.
Since proteins are often long, the longest one is usually the more important one to look for. Thus your function should search for a protein region in both the given DNA string as well as its reverse complement (the string in reverse order with its letters replaced by their complement letters).
Once the longest valid substring is found, it can be translated by sequentially substituting each group of three letters for its appropriate one-letter protein abbreviation, as given in this table.
For example, the DNA strand
aaatggtttatggtctctagcctga
includes both a start codon, 'atg' and the first stop codon, 'tag'. The substring between them is read as the following sequence of codons (spaces added for clarity)
gtt tat ggt ctc
which is valid because is length is 12 (counting only letters), a multiple of three. Thus, it will be translated into the following protein:
VYGL
because 'gtt' codes for Valine (V), 'tat', for Tyrosine (Y), and 'ggt', for Glycine (G), and 'ctc', for Leucine (L).
For more details, see this assignment's HOWTO. To get started, download this code using Ambient's snarf tool.
Bonus Specification
Often in a research setting, many DNA strings are produced for experimentation. Additionally, DNA strings are often extremely long and code for many different proteins. The program's basic specification only finds one protein region within one DNA string.
For bonus credit, write two different versions of your functions in a separate module called BioinformaticsBonus:
- given a string, representing a single DNA, returns the longest protein contained anywhere in the original string and its reverse complement, not just the first ones found
- given a string, containing multiple DNA strings each separated by a newline, '\n', returns a list of the longest protein found in each one. The empty string, '', should not be included in the protein list, i.e., if no protein is found in a DNA string, that result should not be included in the returned list.
Note, the original module Bioinformatics should still complete the basic specifications.
Submission
Submit your entire PyDev project electronically from within Eclipse or on the web to the assignment name 02_bioinformatics.