CompSci 18S – Disease, DNA, and Proteins – FALL 2009

Classwork 12: 10 pts

Disease Hunting

You are a computational disease detective. At your access are sequencers, alignment tools (BLAST), and a knowledge of sickle-cell anemia. You want to figure out if your patient has the mutation for sickle cell.

PART ONE: Hunting Down the Gene--starting at 3 billion base pairs

From what you have learned in class, one simple change in a nucleotide changes everything. The codon that usually codes for glutamate, is instead changed to a valine. What results is a drastic change from a normal looking red blood cell to a sickle-shaped red blood cell. Let's try to find the gene.

You, the investigator, currently only has two pieces of data. You know that (a) sickle cell is a point mutation that affects red blood cells, and that (b) the following sequence:

tgggggatat tatgaagggc cttgagcatt tggattctgc

is located within the gene that sickle cell affects. You have no idea where sickle cell begins (e.g. in which protein), except for the fact that it is found in the red blood cell.

To figure out what gene you need to target, blast the sequence by going to:

http://www.ncbi.nlm.nih.gov/BLAST.

and click on “nucleotide blast”— then click on the blastn tab, then under Database choose Reference mRNA sequences. Then click BLAST! It make take a while, since there are many requests being made at any time. Look at the top hit that is human (homo sapiens).

what does this gene code for? _______________________________________________________________________

Notice that this sequence is mRNA, which is ultimately turned into protein. Notice, that the BLAST query also supplies you with the amino acid sequence. Write down the first 15 amino acids from your search.

First 15 amino acids: ____________________________________

PART TWO: Hunting Down the Mutation--searching strings by hand

What occurs during sickle cell is that there is a change from a glutamate to a valine. Notice that the first string below is a substring of the 15 amino acids you wrote above.

TPEEK
TPVEK

Your job is to hunt down whether or not a valine exists within your DNA code. Say your patient has the following sequence:

CAC CTG ACT CCT GTG GAG AAG TCT GCC

Your goal is simply to see if a valine exists within your sequence. Write down the amino acids below for the given DNA sequence. Use the table below.

The Genetic Code (DNA)

TTT	Phe	TCT	Ser	TAT	Tyr	TGT	Cys
TTC	Phe	TCC	Ser	TAC	Tyr	TGC	Cys
TTA	Leu	TCA	Ser	TAA	STOP	TGA	STOP
TTG	Leu	TCG	Ser	TAG	STOP	TGG	Trp
CTT	Leu	CCT	Pro	CAT	His	CGT	Arg
CTC	Leu	CCC	Pro	CAC	His	CGC	Arg
CTA	Leu	CCA	Pro	CAA	Gln	CGA	Arg
CTG	Leu	CCG	Pro	CAG	Gln	CGG	Arg
ATT	Ile	ACT	Thr	AAT	Asn	AGT	Ser
ATC	Ile	ACC	Thr	AAC	Asn	AGC	Ser
ATA	Ile	ACA	Thr	AAA	Lys	AGA	Arg
ATG	Met*	ACG	Thr	AAG	Lys	AGG	Arg
GTT	Val	GCT	Ala	GAT	Asp	GGT	Gly
GTC	Val	GCC	Ala	GAC	Asp	GGC	Gly
GTA	Val	GCA	Ala	GAA	Glu	GGA	Gly
GTG	Val	GCG	Ala	GAG	Glu	GGG	Gly

*When within gene; at beginning of gene, ATG signals start of translation.

amino acid sequence for codons above: ______________________________________

Does your patient contain a valine (and thus the mutation for sickle cell)? YES / NO

What is the index of this codon within your DNA sequence? __________

PART THREE: Writing Code to Search for Amino Acids

Searching a sequence for one type of amino acid can be tedious by hand. Let's write a program to do the same thing.

Given a strand of DNA and a query amino-acid symbol, return the index of the first location of a codon in the strand that codes for the amino-acid. Return -1 if there is no codon that codes for the amino acid. Take a look at the assignment here.

1. You are going to store the codons and their corresponding amino acids into two String arrays.

List of codons:
	String codons [] = {"TTT",  "TTC",  "TTA",  "TTG",  "TCT",  "TCC",  "TCA",  "TCG",  
	"TAT",  "TAC",  "TGT",  "TGC",  "TGG",  "CTT",  "CTC",  "CTA",  
	"CTG",  "CCT",  "CCC",  "CCA",  "CCG",  "CAT",  "CAC",  "CAA",  
	"CAG",  "CGT",  "CGC",  "CGA",  "CGG",  "ATT",  "ATC",  "ATA",  
	"ATG",  "ACT",  "ACC",  "ACA",  "ACG",  "AAT",  "AAC",  "AAA",  
	"AAG",  "AGT",  "AGC",  "AGA",  "AGG",  "GTT",  "GTC",  "GTA",  
	"GTG",  "GCT",  "GCC",  "GCA",  "GCG",  "GAT",  "GAC",  "GAA",  
	"GAG",  "GGT",  "GGC",  "GGA",  "GGG"};
    
Corresponding amino-acids for each codon above:

	String aas [] = {"F", "F", "L", "L", "S", "S", "S", "S", 
	"Y", "Y", "C", "C", "W", "L", "L", "L", 
	"L", "P", "P", "P", "P", "H", "H", "Q", 
	"Q", "R", "R", "R", "R", "I", "I", "I", 
	"M", "T", "T", "T", "T", "N", "N", "K", 
	"K", "S", "S", "R", "R", "V", "V", "V", 
	"V", "A", "A", "A", "A", "D", "D", "E", 
	"E", "G", "G", "G", "G"};

Thus, the two arrays will look like this:

index	0	1	2	3	4	. . .
codons	"TTT"	"TTC"	"TTA"	"TTG"	"TCT"	. . .
a.a.s	"F"	"F"	"L"	"L"	"S"	. . .

2. The general body of code will be as follows:

Class: ProteinLocater
Method: find
Parameters: String, String
Returns: int

Method signature:

   public int find(String strand, String aa)

(be sure your method is public)

public class ProteinLocater { 
	public int find(String strand, String aa) { 
  	// fill in code here
	}
}

Your job is to fill in find. Make sure you keep the following constraints in mind:

Constraints

String strand will consist of at most 50 characters all either "A", "G", "T", or "C".
The string aa will be a single character string representing an amino-acid symbol.

2. What you need to do:

For all the amino acids in the array (indicated by a.a.s above) that match parameter (String) aa,
- get the codon from your List of codons that corresponds with that amino acid
- search your parameter (String) strand for the codon
- do this for all codons that correspond with the amino acid
- store the minimum index of the codon
Return the minimum index if you can find the amino acid, -1 otherwise.

3. Tips and tricks:

To compare two strings, use the boolean function string1.equals(string2). This will return true if the two strings are the same, and false otherwise.

There is a int function called indexOf. Let us say that we have a string called toSearch = "ABCDEFGHIJK". Let us say that we want to find the index where string smallString = "BCD" starts. Calling toSearch.indexOf(smallString) would return 1, since BCD starts at index 1 of the character array toSearch:

0	1	2	3	4	5	6	7	8	9	10
A	B	C	D	E	F	G	H	I	J	K

NOTE: CALLING INDEXOF TO FIND A STRING THAT IS NOT FOUND WILL RETURN -1.

The javadoc for these functions can be found here.

TEST APT

Tiffany Chen
APT taken from here

information about sickle cell:

http://www.carnegieinstitution.org/first_light_case/horn/lessons/sickle.html

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/C/Codons.html
Last Revised: 26 Nov 2007