Home / Expert Answers / Computer Science / help-would-be-greatly-appreciated-building-python-programs-lab-6-dna-deliverables-turn-pa492

(Solved): Help would be greatly appreciated Building Python Programs Lab 6: DNA Deliverables: Turn ...



Help would be greatly appreciated

Building Python Programs
Lab 6: DNA
Deliverables:
Turn in a well-documented file named dna. py. (18 points)
Pseudocode for an

For more information, visit the Wikipedia page about DNA:
In this assignment you read an input file containing named sequence

Is protein?: No

Implementation Guidelines, Hints, and Development Strategy:
The main purpose of this assignment is to demonstrate your unders

multiple of 3 , although the nucleotides on a line might be in either uppercase orlowercase or a combination. Your program sh

Building Python Programs Lab 6: DNA Deliverables: Turn in a well-documented file named dna. py. (18 points) Pseudocode for any two functions like is opotein and mass.calculation (2points). Description: This assignment focuses on lists and file/text processing. You will also need the two input files dna. txt and ecoli. txt from the course web site. Save these files in the same folder as your program. The assignment involves processing data from genome files. Your program should work with the two given input files. If you are curious (this is not required), the National Center for Biotechnology Information publishes many other bacteria genome files. The last page tells you how to use your program to process other published genome files. Background Information About DNA: Note: This section explains some information from the feld of biology that is related to this assignment. It is for your information only: you do not need to fully understand it to complete the assignment. Deoxyribonucleic acid (DNA) is a complex biochemical macromolecule that carries genetic information for cellular life forms and some viruses. DNA is also the mechanism through which genetic information from parents is passed on during reproduction DNA consists of long chains of chemical compounds called nucleotides. Four nucleotides are present in DNA: Adenine (A), Cytosine (C), Guanine (G), and Thymine(T). DNA has a double-helix structure (see diagram below) containing complementary chains of these four nucleotides connected by hydrogen bonds. Certain regions of the DNA are called genes. Most genes encode instructions for building proteins(theyre called "protein-coding" genes). These proteins are responsible for carrying out most of the life processes of the organism. Nucleotides in a gene are organized into codons. Codons are groups of three nucleotides and are written as the first letters of their nucleotides (e.g., TAC or GGA). Each codon uniquely encodes a single amino acid, a building block of proteins. The process of building proteins from DNA has two major phases called tronscription and tronslation, in which a gene is replicated into an intermediate form called , which is then processed by a structure called a ribosome to build the chain of amino acids encoded bv the codons of the gene. The sequences of DNA that encode proteins occur between a start codon (which we will assume to be ATG) and a stop codon (which is any of TAA, TAG, or TGA). Not all regions of DNA are genes, large portions that do not lie between a valid start and stop codon are called intergenic DNA and have other (possibly unknown) functions. Computational biologists examine large DNA data files to find patterns and important information, such as which regions are genes. Sometimes they are interested in the percentages of mass accounted for by each of the four nucleotide types. Often high percentages of Cytosine (C) and Guanine (G) are indicators of important genetic data. For more information, visit the Wikipedia page about DNA: In this assignment you read an input file containing named sequences of nucleotides and produce in formation about them. For each nucleotide sequence, your program counts the occurrences of each of the four nucleotides (A, , and ). The program also computes the mass percentage occupied by each nucleotide type, rounded to one digit past the decimal point. Next the program reports the codons (trios of fucleotides) presentin each sequence and predicts whether or not the sequence is a protein-coding gene. For us, a protein-coding gene is a string that matches all of the following constraints*: - begins with a valid start codon (ATG) - ends with avalid stop codon (one of the following: TAA, TAG, or TGA) - contains atleast 5 total codons (includingits initial start codon and final stop codon) - Cytosine (C) and Guanine (G) combined account for at least of its total mass (These are approximations for our assignment, not exact constraints usedin computational biology to identify proteins.) The DNA input data consists of line pairs. The first line has the name of the nucleotide sequence, and the second is the nucleotide sequenceitself. Each character in a sequence of nucleotides will be , , or a dash character, "rr. The nucleotides in the input can be either upper or lowercase. Input file dna.txt (partial): Is protein?: No Implementation Guidelines, Hints, and Development Strategy: The main purpose of this assignment is to demonstrate your understanding of lists and list traversals with for loops. Therefore, you should use lists to store the various data for each sequence. In particular, your nucleotide counts, mass percentages, and codons should all be stored using lists. Additionally you should use lists and for loops to transfom the data from one form to another as follows: - from the original nucleotide sequence string to nucleotide counts; - from nucleotide counts to mass percentages; and - from the original nucleotide sequence string to codon triplets. These transformations are summarized by the following diagram using the "cure for cancer" protein data: Recall that you can print any list using stri). For example: numbers print "my data is " + str(numbers)) \# my data is To compute mass percentages, use the following as the mass of each nucleotide (grams/mol). The dashes representing "junk" regions are excluded from many parts of your computations, but they do contribute mass to the total. - Adenine (A): 135.128 - Cytosine (C): 111.103 - Guanine (G): 151.128 - Thymine (T): 125.107 - Junk (-): 100.000 For example, the mass of the sequence ATGG-AC is or 908.722 . Of this, is from the two Adenines; is from the Cytosine; 302.256 is from the two Guanines; is from the Thymine; and is from the "junk" dash. We suggest that you start this program by writing the code to read the input file. Try writing code to simply read each protein's name and sequence of nucleotides and print them. Next, write code to pass over a nucleotide sequence and count the number of , Gs, and Ts. Fut your counts into a list of size 4. To map between nucleotides and list indexes, you may want to write a function that converts a single character (i.e. ) into indices (i.e. 0 to 3 ). Once you have the counts working correctly, you can convert your counts into a new list of percentages of mass for each nucleotide using the preceding nucleotide mass values. If youve written code to map between nucleotide letters and list indexes, it may also help you to look up mass values in a list such as the following: masses You may store your mass percentages already rounded to one digit past the decimal or you can round when printing the mass percentages list. Remember that the "junk" dashes do contribute mass to the total. For other parts of your program you may want to remove dashes from the input. After computing mass percentages, you must break apart the sequence into codons and examine each codon. Youmay wish to review string functions as presented in lecture 14 , such as [ ], upper, and lower. We also suggest that you first get your program working correctly printing its output to the console before you save the output to a file. You may assume that the input file exists, is readable, and contains valid input. (In other words, you should not re-prompt for input or output file names.) You may assume that each sequence's number of nucleotides (without dashes) will be a 4 of 5 multiple of 3 , although the nucleotides on a line might be in either uppercase orlowercase or a combination. Your program should overmrite any existing datain the output file. Style Guidelines: For this assignment you are required to have the following four constants - one for the minimum number of codons a valid protein must have, as an integer (default of 5 ) - a second for the percentage of mass from and in order for a protein to be valid, as an integer (default of 30 ) - a third for the number of unique nucleotides (4, representing , and ) - a fourth for the number of nucleotides per codon (3) For full creditit should be possible to change the first two constantvalues (minimum codons and minimum mass percentage) and cause your program to change its behavior for evaluating protein validity. The other two constants wont ever be changed but are still useful to make your program more readable. Refer to these constants in your code and do not refer to the bare number such as 4 or 3 directly. You may use additional constants if they make your code clearer. We will grade your function structure strictly on this assignment. Use at least four nontrivial functions besides main These functions should use parameters and returns, including lists, as appropriate. The functions should be well-structured and avoid redundancy. No one function should do too large a share of the overall task. You may not nest these functions inside each other or in main. In particular, we require that you have the following particular function in your program: - A function to print all file output for a given potential protein (nucleotides, counts, \%, is it a protein, etc.) In other words, all output to the file should be done through one function called on each nucleotide sequence from the input. Your other functions should do the computations to gather in formation to be passed to this output function. Your main function should be a concise summary of the overall program. It is okay formain to contain some code such as print statements. Butmain should not perform toolarge a share of the overall workitself such as examining each character of an input line. Also avoid "chaining." when many functions call each other without ever returning to main. We will also check strictly for redundancy on this assignment. If you have avery similar piece of code that is repeated several times in your program, eliminate the redundancy such as by creating a function, by using for loops overthe elements of lists, and/or by factoring if/else code. Since lists are a key component of this assignment, part of your grade comes from using lists properly. For example, you should reduce redundancy as appropriate by using traversals over lists (for loops over the list's elements). This is preferable to writing out a separate statement for each list element (a statement for element , then another for [1], then for [2], etc.). Also carefully consider how lists should be passed as parameters andior returned from functions as you are decomposingyour program. Recall thatlists are mutable when passed as parameters, meaningthat alistpassed to a function can be modified by that function and the changes will be seen by the caller. You are limited to featuresin lectures . Follow past style guidelines such as indentation, names, variables, line lenghs, and comments (at the beginning of your program, on each function, and on complex sections of code). You may not have any global variables. Additional Input Files (Optional): If you would like to generate ad ditional input files to test your program, you can create them from actual NCBI genetic data The following web site has many data files that contain complete genomes for bacterial organisms: To connect to the website, an FTP client (like FileZilla) is suggested. The site contains many directories with names of organisms. After entering a directory, you can find and save a folder containing assembled genome dataincluding a genome file (a file whose name ends with fna) and a feature table. Gene sequences can be extracted from the genome sequence using start and end information in the feature table.


We have an Answer from Expert

View Expert Answer

Expert Answer


Here are pseudocode for the two functions:is_protein(sequence):# Check if sequence is a protein-coding geneif sequence starts with "ATG" and ends with
We have an Answer from Expert

Buy This Answer $5

Place Order

We Provide Services Across The Globe