Except where otherwise noted, this and all course materials for CS 112 are licensed under Attribution-NonCommerc
Except where otherwise noted, this and all course materials for CS 112 are licensed under Attribution-NonCommercial-ShareAlike CC BY-NC-SA held by the Trustees of the University of Illinois (University of Illinois at Chicago).
Learning objectives:
- Working with strings.
- Slicing strings.
- Basic functions.
- Working with GenBank.
- Understanding connection between DNA, mRNA, and proteins
Sequences in GenBank
On your computer, use a web browser to access GenBank: http://www.ncbi.nlm.nih.gov/genbank. Once there, find a nucleotide sequence for the human coagulation factor IX, sometimes called the "Christmas factor" (F9) gene. In other words, find a DNA sequence for the gene that encodes the coagulation factor IX protein. This is found by using the search area at the top of the GenBank web page. You are looking for a specific "accession"—a sequence submission record—with the accession ID: NG_007994
.
To summarize, we need to find the nucleotide sequence for the human coagulation factor IX, sometimes called the "Christmas factor" (F9) gene. To do this we:
- Use a web browser to access GenBank: http://www.ncbi.nlm.nih.gov/genbank
- Use the search area at the top of the GenBank web page
- Make sure we are searching for a Nucleotide (select Nucleotide using the drop down menu).
- Enter the accession ID:
NG_007994
in the search field - Click search
- Verify the page we go to specifies NCBI Reference Sequence: NG_007994.1 just under the main title.
Structure of Eukaryotic Genes
Eukaryotic genes (like F9) are composed of messenger RNA (mRNA)-coding sequences called exons (expressed portions of DNA sequence) and intervening sequences called introns (the name emphasizes their intervening role). Intron sequences in pre-mRNA are non-coding and are removed before transcription to mRNA. The exons are then joined together (concatenated) and comprise mature mRNA. The process of removing introns and reconnecting exons is called 'splicing.' Mature mRNA is comprised of coding sequence (CDS) and untranslated regions (UTR) at 5' and 3' ends. Coding sequence is made up of codons—the portion of mRNA that codes for amino acids.
The amino acid coding portions (CDS), along with other gene features, are annotated on the left side of the description in GenBank records. For example, you will see something similar to this in the annotations for the F9 gene:
CDS join(5030..5117,11275..11438)
The actual line on the GenBank page will be much longer (i.e. containing more than just the ranges for two exons) but the first two ranges match exactly what is given above.
The word join
in a GenBank record is analogous to a function in Python. It is an instruction to slice out and join (concatenate) the segments separated by commas within parentheses. The resulting string represents the amino acid coding sequence (CDS). Assuming we have the entire F9 gene sequence stored in a variable F9
, the example above could be written in Python as:
cds = F9[5029:5117] + F9[11274:11438]
Caution: Python indexes start at 0, but GenBank annotations start at 1. Notice how the coordinates differ between the GenBank record example and the Python code above. Failure to adjust indexes correctly is a common situation in computer science and the bugs related to this are known as off-by-one errors. While seemingly trivial, these errors may have serious consequences.
Assignment Description
- Write a function named
extract_f9_cds
which has one parameter is to take the argument of F9, the F9 gene sequence. The goal of this function is to extract the coding regions from the F9 gene sequence (provided in the template), concatenate them, and return the resulting string. Hint: You can confirm your program is functioning correctly by clicking on the CDS annotation in GenBank. This will highlight the relevant parts of the sequence, it should match your output. - Write a function named
get_max_possible_codons
which has one parameterseq
and returns the maximum number of codons this DNA sequence would contain if it was wholly composed of coding regions. Remember that each codon is made up of 3 nucleotide bases. - Write a function named
get_gc_percent
which has one parameterseq
. The goal of this function is to compute the proportion ofG
andC
bases (characters) inseq
to the total number of bases (characters) inseq
. The returned value should be of typefloat
in the range between 0.0 and 100.0 (as a percentage, not a fraction). To do this, use the string method count( ) to determine the number of 'G' bases and the number of 'C' bases. - Write a function named
get_coding_ratio
which has two parametersseq
andcds
. The goal of this function is to calculate the proportion of coding nucleotides to total nucleotides in the entire sequence. In other words: of the total number of nucleotides in the gene (seq
), what is the proportion that codes for amino acids (cds
)? Remember that a ratio will a value of type 'float' in the range between 0.0 and 1.0. - Write a function named
print_seq_info
which has two parametersseq
andcds
. This function should use the functions you wrote for problems 1 through 4 and print a correctly formatted summary:
Sequence length: ... Coding sequence length: ... Number of possible codons: ... Number of actual codons: ... First 4 codons of the coding sequence: ... Ratio of Coding NT to Total NT: ... GC percent of the entire sequence: ... GC percent of the coding sequence: ...
- The Sequence length: output should use the built-in len( ) function with the 'seq' parameter.
- The Coding sequence length: output should use the built-in len( ) function with the 'cds' parameter.
- The Number of possible codons: output should use your
get_max_possible_codons
( ) function with the 'seq' parameter. - The Number of actual codons: output should use your
get_max_possible_codons
( ) function with the 'cds' parameter. - The First 4 codons of the coding sequence: output should use slicing with the 'cds' parameter.
- The Ration of Coding NT to Total NT: output should use the
get_coding_ratio
( ) function with both the 'seq' and 'cds' parameters. - The GC percent of the entire sequence: output should use the
get_gc_percent
( ) function with the 'seq' parameter. - The GC percent of the coding sequence: output should use the
get_gc_percent
( ) function with the 'cds' parameter.
- Write a few sentences explaining what this gene is and what its protein does, state the name of a disease caused by a variant (mutation) at the F9 gene, and describe one such disease-causing variant. Hint: look in the right panel on GenBank, or use the web. (You can write your answer in the same file as your Python code by commenting out the text. The starter code for Lab 3 already has a place for this near the top of the file.)
- Make sure you are writing your code using Good Programming Style. Aspects of Good Program Style include (but are not limited to):
- File Header Comment/docstring at the beginning of the file to describe the purpose of the program
- File Header Comment/docstring at the beginning of the file to give information about the programmer/author of the program
- Function Comments/docstrings to describe the purpose of EACH function
- Using meaning variable names
- In-line comments/docstrings where needed
- Blank lines to separate sections of your code
- Proper use of indentation and consistent depth of indentation
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.