Introduction to Computer Science
CompSci 101 : Spring 2014

Bioinformatics: Deciphering the Genetic Code

Software and algorithms are extremely important in analyzing and understanding genomic data. Genome scientists are often interested in proteins, which are typically more important than raw DNA in understanding the function of different genomes (human, fruit-fly, etc.).

DNA is comprised of four nucleotides, or bases, adenine, cytosine, guanine, and thymine, which represented by the letters: ‘a’, ‘c’,’g’,’t’respectively (see Wikipedia for details). This encoding of DNA, called the genetic code, was discovered by James Watson and Francis Crick in 1953. The string ‘gtattacccggcca’ and any string consisting of these four letters might represent genomic data.

DNA (and RNA) are interesting for medical and scientific research and understanding because they provide a genetic blueprint. The blueprint might represent hair-color, or cell-width, or dancing ability (so far an as yet undiscovered gene). Genes are parts of DNA that code for proteins, finding the proteins that represent a gene in a strand of DNA is what this assignment is about.

Proteins are constructed from amino acids and each amino acid is constructed from three nucleotides. Three consecutive nucleotides are called a codon, each codon represents an amino acid. For example ‘gca’ codes for the amino acid Alanine. Each codon (triple of nucleotides) codes for an amino acid. There are 4x4x4 = 64 different DNA triplets, but only about 24 amino acids — so some of the amino acids are coded by more than one codon, providing DNA with redundancy. For example, Alanine is coded by the codons 'gct', 'gcc', 'gca', and 'gcg'. Each amino acid has a one-letter abbreviation, Alanine is A.

A string of amino acids makes up a protein. Within DNA, proteins are delimited by two control codons: a start codon that marks the start of a protein; and three stop codons, which mark the end of the protein whenever one of the three is found. Given such a limited vocabulary to drive the wide variety of biological functions that proteins serve, most proteins are long, containing about 300 amino acids. A visual explanation of this process can be found here, which includes an interactive flash version.

Review of terminology:

To get started, download this code using Ambient's snarf tool.

Basic Specification

Write a program that takes a string, representing a DNA strand, and produces another string, representing a protein. That is, science aside, given a string containing only the letters, 'a', 'c', 'g', and 't', find a substring, if possible, that can be divided into groups of three letters, each of which is translated into a single letter. This string, one-third as long as the found substring, represents the resulting protein.

In order to produce a protein, you will need to find the region, substring, of a string of DNA that could code for a protein. This substring is marked by the first found start codon, the three letter group 'atg' and ends with the first later occurrence of one of the three stop codons, 'taa', 'tag', or 'tga'. In order to be a valid protein region, the length of this substring must be a multiple of three. If no valid protein region is found, return the empty string, “”.

Since proteins are often long, the longest one is usually the more important one to look for. Thus your function should search for a protein region in both the given DNA string as well as its reverse complement (the string in reverse order with its letters replaced by their complement letters) and return the longer one.

To simplify the problem, your program only needs to look for a region that starts from the first occurrence of the start codon in the given strand of DNA (either in the forward or reverse direction), even if it does not lead to valid region. Other occurrences of the start codon can be ignored, even if they lead to a valid region.

Once a protein region, substring, is found it can be translated by sequentially substituting each group of three letters for its appropriate one-letter protein abbreviation, as given in this table.

For example, the DNA strand

aaatggtttatggtctctagcctga

includes both a start codon, 'atg' and a stop codon, 'tag'. The substring between them is read as the following sequence of codons (spaces added for clarity)

gtt tat ggt ctc

which is valid because its length is 12 (counting only letters), a multiple of three. Thus, it will be translated into the following protein:

VYGL

because 'gtt' codes for Valine (V), 'tat', for Tyrosine (Y), and 'ggt', for Glycine (G), and 'ctc', for Leucine (L).

Your program must be written in a module, named Bioinformatics, that includes a function, named translateDNAtoProtein, that takes a string and returns a string. For more details, see this assignment's HOWTO.

Bonus Specification

DNA strings are often extremely long and code for many different proteins. The program's basic specification finds only one protein region, starting from the first occurrence of the start codon, within a DNA string.

For bonus credit, write a version of the function described above that, when given a string representing a single DNA strand, returns the longest protein contained anywhere in the original string or its reverse complement.

This version of the function should be in a separate module, BioinformaticsBonus, and the original module Bioinformatics should still complete the basic specifications. You will not get extra credit unless the basic specification also works (i.e., you should complete it first before moving on to the bonus specification).

Submission

Submit your entire PyDev project and a plain text README file electronically from within Eclipse or on the web to the assignment name assign03_bioinformatics.

Please double check that you submitted the correct files. Within Eclipse, this can be done using the Ambient menu item Submit History... and, on the web, the files submitted are printed at the bottom of the page after a successful submission.