Do all humans have an identical nucleotide sequence for certain proteins, e.g haemoglobin?

Do all humans have an identical nucleotide sequence for certain proteins, e.g haemoglobin?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

All humans have the same sort of proteins in our bodies. Take haemoglobin for example.

Is the gene coding for haemoglobin in my body identical to everyone else's gene or is there slight variations in the nucleotide sequence?

Are there examples of proteins that are always completely conserved at the population level?

Humans have many variants

There is variation. The project I use to help understand this natural variation is gnomAD. Using VarMap and a slightly out of date gnomAD file, I counted 16007805 protein-coding variants across the human genome. This number will only go up over time.

Indeed, the 1000 Genome project found that on average each person has between 250-300 loss of function protein variants that are not found in their parents (The 1000 Genomes Project Consortium, 2010).

This is an important concept for human health. ClinVar is one of many projects that aims to catalogue and study when these variations lead to disease in a clinical context. Another is the 100,000 genome project by Genomics England which studies NHS patient data in cases of rare disease and cancer.

Haemoglobin has variants, including disease variants

At the time of writing, HBA1 (haemoglobin alpha subunit gene) has 183 gnomAD variants and 17 pathogenic variants in ClinVar (sourced from gnomAD). Again, both of these numbers are likely to increase because the data will cover more people.

Constraints on highly important proteins

But the underlying question is, I think, "are there some proteins that are so important, that life keeps them highly constrained" i.e any variation will lead to an invalid cell or a disease phenotype. gnomAD attempts to add "constrained" metrics to each protein record, and some are more constrained than others.

For example:

  • Haemoglobin scores a pLI of 0.01 (higher scores are more intolerant to variation, specifically loss of function variation).

  • p53 is a gatekeeper of the cell cycle, mutants of which are common in cancer cells. It has a pLI score of 0.53 which means it is very intolerant to variation compared to haemoglobin.

  • Ribosomal protein L5 has a pLI of 0.998 implying it can tolerate little if any variation. The ribosome is critical in protein production, hence altering it may cause a complete breakdown of cellular life.

Variation and Evolution

There is an almost philosophical difference between human variation and human evolution. Variation is a static snapshot of our protein sequence from individual to individual. Evolution in the sense of a Dayhoff Matrix requires looking back millions or billions of years by comparing similar protein sequences across many species.

It is highly unlikely that there exist any protein that is made from completely identical nucleotide sequences across the entire human population. There will certainly be regions within a gene that are highly conserved, but there is little evolutionary pressure to conserve an entire gene's nucleotide sequence across the population.

This is in part due to the fact that different codons can translate into the same amino acid. So, even if a protein is (phenotypically) identical across the human population, the genes making that protein can use different sequences to code for the same amino acids. This alone can allow for an incredibly large number of possible gene variations that can code for an identical protein.

Related resources: Neutral mutation (wikipedia) and the neutral theory of molecular evolution (wikipedia)

At the whole-gene level, there is likely no absolute conservation of any human protein-coding gene at the population level, though there might be complete conservation between individuals. Keep in mind that most human genes are on the order of tens of thousands of base pairs long, and only a portion of that length encodes functional motifs. There are, however, ultra-conserved elements on the order of hundreds of base pairs that are identical between humans, rats, and mice. Most of these elements are non-coding, and some are transcribed as functional ncRNAs.

No gene and no base pair is immune to mutation. It is the natural selection pressure that keeps some gene relatively constant. Neutral mutations are not subject to pressure, so everything changes. And then, there are "almost neutral" mutations that also get passed by.


Proteins are large size molecules (macromolecules), polymers of structural units called amino acids. A total of 20 different amino acids exist in proteins and hundreds to thousands of these amino acids are attached to each other in long chains to form a protein. Amino acids can be released from proteins by hydrolysis. (Hydrolysis is the cleavage of a covalent bond by addition of water in adequate conditions.)

Due to their large size, proteins obligatorily form colloids when they are dispersed in a suitable solvent. This property characteristically distinguishes proteins from solutions containing small size molecules.

Since amino acids are the “building blocks” for proteins, their structure and properties will be considered first.

B. BLAST Search Parameters

Limit by Organism

A BLAST search may be limited by organism. The entry field will suggest completions once a user starts typing. A checkbox will exclude rather than include the organism in the search.

Limit by Entrez Query

A BLAST search can be limited to the result of an Entrez query against the database chosen. This restricts the search to a subset of entries from that database fitting the requirement of the Entrez query. Terms normally accepted by Entrez nucleotide or protein searches are accepted here. Examples are given below.

    protease NOT hiv1[organism]

This will limit a BLAST search to all proteases, except those in HIV 1.

This limits the search to entries with lengths between 1000 to 2000 bases for nucleotide entries, or 1000 to 2000 residues for protein entries.

This limits the search to mouse mRNA entries in the database. For common organisms, one can also select from the pulldown menu.

This is yet another example usage, which limits the search to protein sequences with calculated molecular weight between 10 kD to 100 kD.

This limits the search to entries that are annotated with a /specimen_voucher qualifier on the source feature.

For help in constructing Entrez queries please see the " Writing Advanced Search Statements" section of the Entrez Help document. Knowing the content of a database and applying the Entrez terms accordingly are important. For example, biomol_mrna[prop] should not be applied to htgs or chromosome database since they do not contain mRNA entries!

Compositional adjustments

Amino acid substitution matrices may be adjusted in various ways to compensate for the amino acid compositions of the sequences being compared. The simplest adjustment is to scale all substitution scores by an analytically determined constant, while leaving the gap scores fixed this procedure is called "composition-based statistics" (Schaffer et al., 2001). The resulting scaled scores yield more accurate E-values than standard, unscaled scores. A more sophisticated approach adjusts each score in a standard substitution matrix separately to compensate for the compositions of the two sequences being compared (Yu et al., 2003 Yu and Altschul, 2005 Altschul et al., 2005). Such "compositional score matrix adjustment" may be invoked only under certain specific conditions for which it has been empirically determined to be beneficial (Altschul et al., 2005) under all other conditions, composition-based statistics are used. Alternatively, compositional adjustment may be invoked universally.

[1] Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V. and Altschul, S.F. (2001) "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements," Nucleic Acids Res. 29:2994-3005.
[2] Yu, Y.-K., Wootton, J.C. and Altschul, S.F. (2003) "The compositional adjustment of amino acid substitution matrices," Proc. Natl. Acad. Sci. USA 100:15688-15693.
[3] Yu, Y.-K. and Altschul, S.F. (2005) "The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions," Bioinformatics 21:902-911.
[4] Altschul, S.F., Wootton, J.C., Gertz, E.M., Agarwala, R., Morgulis, A., Schaffer, A.A. and Yu, Y.-K. (2005) "Protein database searches using compositionally adjusted substitution matrices," FEBS J 272(20):5101-9.


This function mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton and Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman. Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.

Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.

It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT or refseq, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect. This will also lead to search error when default setting is used.

This option masks Human repeats (LINE's, SINE's, plus retroviral repeasts) and is useful for human sequences that may contain these repeats. Filtering for repeats can increase the speed of a search especially with very long sequences (>100 kb) and against databases which contain large number of repeats (htgs). This filter should be checked for genomic queries to prevent potential problems that may arise from the numerous and often spurious matches to those repeat elements.

For more information please see "Why does my search timeout on the BLAST servers?" in the BLAST Frequently Asked Questions.

Filter (Mask for lookup table only)

BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. This option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence.

With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases.

One can use different combinations of the above filter options to achieve optimal search result.


BLAST is a heuristic that works by finding word-matches between the query and database sequences. One may think of this process as finding "hot-spots" that BLAST can then use to initiate extensions that might eventually lead to full-blown alignments. For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size. For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied. The webpage allows the word-sizes 2, 3, and 6.


This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.

Reward and Penalty for Nucleotide Programs

Many nucleotide searches use a simple scoring system that consists of a "reward" for a match and a "penalty" for a mismatch. The (absolute) reward/penalty ratio should be increased as one looks at more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved a ratio of about one (1/-1) is best for sequences that are 75% conserved [1]. Read more here

[1] States DJ, Gish W, and Altschul SF (1991) METHODS: A companion to Methods in Enzymology 3:66-70.

Matrix and Gap Costs

A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The matrix used in a BLAST search can be changed depending on the type of sequences you are searching with (see the BLAST Frequently Asked Questions). See more information on BLAST substitution matrices.

The pull down menu shows the Gap Costs for the chosen Matrix. There can only be a limited number of options for these parameters. Increasing the Gap Costs will result in alignments which decrease the number of Gaps introduced.

PSI-BLAST can save the Position Specific Score Matrix constructed through iterations. The PSSM thus constructed can be used in searches against other databases with the same query by copying and pasting the encoded text into the PSSM field.

  1. Run a protein BLAST search.
  2. Check the PSI-BLAST box on formatting page.
  3. Click the "Format" Button.
  4. On the PSI-BLAST results page, click the "Run PSI-BLAST Iteration 2" button.
  5. Select the Download link at the top of the page and download the PSSM to your computer.

To use the PSSM in a new protein BLAST search against other databases:

  1. Open a new protein BLAST page.
  2. Select PSI-BLAST as the Algorithm under "Program Selection" (this may already be set).
  3. Select the "+" next to "Algorithm parameters" at the bottom of the search page.
  4. Scroll to the "PSI/PHI/DELTA BLAST" section and use the "Choose File" button to upload the PSSM that you saved in step 5 above.
  5. Select a different target database.
  6. Click "BLAST" button to start the search


PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines matching of regular expressions with local alignments surrounding the match. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question:

DNA and Proteins

What is DNA?
DNA stands for deoxyribonucleic acid, and it is the carrier of genetic information within a cell. A molecule of DNA consists of two chains that are wrapped around each other. The chains twist to form a double helix in shape. Each chain is made up of repeating subunits called nucleotides that are held together by chemical bonds. There are four different types of nucleotides in DNA, and they differ from one another by the type of base that is present: adenine (A), thymine (T), guanine (G), and cytosine (C). A base on one of the chains that makes up DNA is chemically bonded to a base on the other chain. This bonding holds the two chains together. Additionally, there are base pairing rules that determine which bases can bond with each other. Adenine and thymine form base pairs that are held together by two bonds, while cytosine and guanine form base pairs that are held together by three bonds. Bases that bond together are known as complementary.

How DNA Encodes for Proteins:

1. Transcription: DNA to mRNA

During transcription, DNA is converted to messenger RNA (mRNA) by an enzyme called RNA polymerase. RNA is a molecule that is chemically similar to DNA, and also contains repeating nucleotide subunits. However, the “bases” of RNA differ from those of DNA in that thymine (T) is replaced by uracil (U) in RNA. DNA and RNA bases are also held together by chemical bonds and have specific base pairing rules. In DNA/RNA base pairing, adenine (A) pairs with uracil (U), and cytosine (C) pairs with guanine (G). The conversion of DNA to mRNA occurs when an RNA polymerase makes a complementary mRNA copy of a DNA “template” sequence. Once the mRNA molecule has been synthesized, specific chemical modifications must be made that enable the mRNA to be translated into protein.

2. Translation: mRNA to protein

During translation, mRNA is converted to protein. A group of three mRNA nucleotides encodes for a specific amino acid and is called a codon. Each mRNA corresponds to a specific amino acid sequence and forms the resultant protein. Two codons, called start and stop codons, signal the beginning and end of translation. The final protein product is formed after the stop codon has been reached. A table called the genetic code can be referred to in order to see which codons encode for which specific amino acids. Several of the codons end up encoding for the same amino acid, a process that is referred to as redundancy in the genetic code.

14.2 DNA Structure and Sequencing

By the end of this section, you will be able to do the following:

  • Describe the structure of DNA
  • Explain the Sanger method of DNA sequencing
  • Discuss the similarities and differences between eukaryotic and prokaryotic DNA

The building blocks of DNA are nucleotides. The important components of the nucleotide are a nitrogenous (nitrogen-bearing) base, a 5-carbon sugar (pentose), and a phosphate group (Figure 14.5). The nucleotide is named depending on the nitrogenous base. The nitrogenous base can be a purine such as adenine (A) and guanine (G), or a pyrimidine such as cytosine (C) and thymine (T).

Visual Connection

The images above illustrate the five bases of DNA and RNA. Examine the images and explain why these are called “nitrogenous bases.” How are the purines different from the pyrimidines? How is one purine or pyrimidine different from another, e.g., adenine from guanine? How is a nucleoside different from a nucleotide?

The purines have a double ring structure with a six-membered ring fused to a five-membered ring. Pyrimidines are smaller in size they have a single six-membered ring structure.

The sugar is deoxyribose in DNA and ribose in RNA. The carbon atoms of the five-carbon sugar are numbered 1', 2', 3', 4', and 5' (1' is read as “one prime”). The phosphate, which makes DNA and RNA acidic, is connected to the 5' carbon of the sugar by the formation of an ester linkage between phosphoric acid and the 5'-OH group (an ester is an acid + an alcohol). In DNA nucleotides, the 3' carbon of the sugar deoxyribose is attached to a hydroxyl (OH) group. In RNA nucleotides, the 2' carbon of the sugar ribose also contains a hydroxyl group. The base is attached to the 1'carbon of the sugar.

The nucleotides combine with each other to produce phosphodiester bonds. The phosphate residue attached to the 5' carbon of the sugar of one nucleotide forms a second ester linkage with the hydroxyl group of the 3' carbon of the sugar of the next nucleotide, thereby forming a 5'-3' phosphodiester bond. In a polynucleotide, one end of the chain has a free 5' phosphate, and the other end has a free 3'-OH. These are called the 5' and 3' ends of the chain.

In the 1950s, Francis Crick and James Watson worked together to determine the structure of DNA at the University of Cambridge, England. Other scientists like Linus Pauling and Maurice Wilkins were also actively exploring this field. Pauling previously had discovered the secondary structure of proteins using X-ray crystallography. In Wilkins’ lab, researcher Rosalind Franklin was using X-ray diffraction methods to understand the structure of DNA. Watson and Crick were able to piece together the puzzle of the DNA molecule on the basis of Franklin's data because Crick had also studied X-ray diffraction (Figure 14.6). In 1962, James Watson, Francis Crick, and Maurice Wilkins were awarded the Nobel Prize in Medicine. Unfortunately, by then Franklin had died, and Nobel prizes are not awarded posthumously.

Watson and Crick proposed that DNA is made up of two strands that are twisted around each other to form a right-handed helix. Base pairing takes place between a purine and pyrimidine on opposite strands, so that A pairs with T, and G pairs with C (suggested by Chargaff's Rules). Thus, adenine and thymine are complementary base pairs, and cytosine and guanine are also complementary base pairs. The base pairs are stabilized by hydrogen bonds: adenine and thymine form two hydrogen bonds and cytosine and guanine form three hydrogen bonds. The two strands are anti-parallel in nature that is, the 3' end of one strand faces the 5' end of the other strand. The sugar and phosphate of the nucleotides form the backbone of the structure, whereas the nitrogenous bases are stacked inside, like the rungs of a ladder. Each base pair is separated from the next base pair by a distance of 0.34 nm, and each turn of the helix measures 3.4 nm. Therefore, 10 base pairs are present per turn of the helix. The diameter of the DNA double-helix is 2 nm, and it is uniform throughout. Only the pairing between a purine and pyrimidine and the antiparallel orientation of the two DNA strands can explain the uniform diameter. The twisting of the two strands around each other results in the formation of uniformly spaced major and minor grooves (Figure 14.7).

DNA Sequencing Techniques

Until the 1990s, the sequencing of DNA (reading the sequence of DNA) was a relatively expensive and long process. Using radiolabeled nucleotides also compounded the problem through safety concerns. With currently available technology and automated machines, the process is cheaper, safer, and can be completed in a matter of hours. Fred Sanger developed the sequencing method used for the human genome sequencing project, which is widely used today (Figure 14.8).

Link to Learning

Visit this site to watch a video explaining the DNA sequence-reading technique that resulted from Sanger’s work.

The sequencing method is known as the dideoxy chain termination method. The method is based on the use of chain terminators, the dideoxynucleotides (ddNTPs). The ddNTPSs differ from the deoxynucleotides by the lack of a free 3' OH group on the five-carbon sugar. If a ddNTP is added to a growing DNA strand, the chain cannot be extended any further because the free 3' OH group needed to add another nucleotide is not available. By using a predetermined ratio of deoxynucleotides to dideoxynucleotides, it is possible to generate DNA fragments of different sizes.

The DNA sample to be sequenced is denatured (separated into two strands by heating it to high temperatures). The DNA is divided into four tubes in which a primer, DNA polymerase, and all four nucleoside triphosphates (A, T, G, and C) are added. In addition, limited quantities of one of the four dideoxynucleoside triphosphates (ddCTP, ddATP, ddGTP, and ddTTP) are added to each tube respectively. The tubes are labeled as A, T, G, and C according to the ddNTP added. For detection purposes, each of the four dideoxynucleotides carries a different fluorescent label. Chain elongation continues until a fluorescent dideoxy nucleotide is incorporated, after which no further elongation takes place. After the reaction is over, electrophoresis is performed. Even a difference in length of a single base can be detected. The sequence is read from a laser scanner that detects the fluorescent marker of each fragment. For his work on DNA sequencing, Sanger received a Nobel Prize in Chemistry in 1980.

Link to Learning

Sanger’s genome sequencing has led to a race to sequence human genomes at rapid speed and low cost. Learn more by viewing the animation here.

Gel electrophoresis is a technique used to separate DNA fragments of different sizes. Usually the gel is made of a chemical called agarose (a polysaccharide polymer extracted from seaweed that is high in galactose residues). Agarose powder is added to a buffer and heated. After cooling, the gel solution is poured into a casting tray. Once the gel has solidified, the DNA is loaded on the gel and electric current is applied. The DNA has a net negative charge and moves from the negative electrode toward the positive electrode. The electric current is applied for sufficient time to let the DNA separate according to size the smallest fragments will be farthest from the well (where the DNA was loaded), and the heavier molecular weight fragments will be closest to the well. Once the DNA is separated, the gel is stained with a DNA-specific dye for viewing it (Figure 14.9).

Evolution Connection

Neanderthal Genome: How Are We Related?

The first draft sequence of the Neanderthal genome was recently published by Richard E. Green et al. in 2010. 1 Neanderthals are the closest ancestors of present-day humans. They were known to have lived in Europe and Western Asia (and now, perhaps, in Northern Africa) before they disappeared from fossil records approximately 30,000 years ago. Green’s team studied almost 40,000-year-old fossil remains that were selected from sites across the world. Extremely sophisticated means of sample preparation and DNA sequencing were employed because of the fragile nature of the bones and heavy microbial contamination. In their study, the scientists were able to sequence some four billion base pairs. The Neanderthal sequence was compared with that of present-day humans from across the world. After comparing the sequences, the researchers found that the Neanderthal genome had 2 to 3 percent greater similarity to people living outside Africa than to people in Africa. While current theories have suggested that all present-day humans can be traced to a small ancestral population in Africa, the data from the Neanderthal genome suggest some interbreeding between Neanderthals and early modern humans.

Green and his colleagues also discovered DNA segments among people in Europe and Asia that are more similar to Neanderthal sequences than to other contemporary human sequences. Another interesting observation was that Neanderthals are as closely related to people from Papua New Guinea as to those from China or France. This is surprising because Neanderthal fossil remains have been located only in Europe and West Asia. Most likely, genetic exchange took place between Neanderthals and modern humans as modern humans emerged out of Africa, before the divergence of Europeans, East Asians, and Papua New Guineans.

Several genes seem to have undergone changes from Neanderthals during the evolution of present-day humans. These genes are involved in cranial structure, metabolism, skin morphology, and cognitive development. One of the genes that is of particular interest is RUNX2, which is different in modern day humans and Neanderthals. This gene is responsible for the prominent frontal bone, bell-shaped rib cage, and dental differences seen in Neanderthals. It is speculated that an evolutionary change in RUNX2 was important in the origin of modern-day humans, and this affected the cranium and the upper body.

Link to Learning

Watch Svante Pääbo’s talk explaining the Neanderthal genome research at the 2011 annual TED (Technology, Entertainment, Design) conference.

DNA Packaging in Cells

Prokaryotes are much simpler than eukaryotes in many of their features (Figure 14.10). Most prokaryotes contain a single, circular chromosome that is found in an area of the cytoplasm called the nucleoid region.

Visual Connection

In eukaryotic cells, DNA and RNA synthesis occur in a separate compartment from protein synthesis. In prokaryotic cells, both processes occur together. What advantages might there be to separating the processes? What advantages might there be to having them occur together?

The size of the genome in one of the most well-studied prokaryotes, E.coli, is 4.6 million base pairs (approximately 1.1 mm, if cut and stretched out). So how does this fit inside a small bacterial cell? The DNA is twisted by what is known as supercoiling. Supercoiling suggests that DNA is either “under-wound” (less than one turn of the helix per 10 base pairs) or “over-wound” (more than 1 turn per 10 base pairs) from its normal relaxed state. Some proteins are known to be involved in the supercoiling other proteins and enzymes such as DNA gyrase help in maintaining the supercoiled structure.

Eukaryotes, whose chromosomes each consist of a linear DNA molecule, employ a different type of packing strategy to fit their DNA inside the nucleus (Figure 14.11). At the most basic level, DNA is wrapped around proteins known as histones to form structures called nucleosomes. The histones are evolutionarily conserved proteins that are rich in basic amino acids and form an octamer composed of two molecules of each of four different histones. The DNA (remember, it is negatively charged because of the phosphate groups) is wrapped tightly around the histone core. This nucleosome is linked to the next one with the help of a linker DNA. This is also known as the “beads on a string” structure. With the help of a fifth histone, a string of nucleosomes is further compacted into a 30-nm fiber, which is the diameter of the structure. Metaphase chromosomes are even further condensed by association with scaffolding proteins. At the metaphase stage, the chromosomes are at their most compact, approximately 700 nm in width.

In interphase, eukaryotic chromosomes have two distinct regions that can be distinguished by staining. The tightly packaged region is known as heterochromatin, and the less dense region is known as euchromatin. Heterochromatin usually contains genes that are not expressed, and is found in the regions of the centromere and telomeres. The euchromatin usually contains genes that are transcribed, with DNA packaged around nucleosomes but not further compacted.


As an Amazon Associate we earn from qualifying purchases.

Want to cite, share, or modify this book? This book is Creative Commons Attribution License 4.0 and you must attribute OpenStax.

    If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:

  • Use the information below to generate a citation. We recommend using a citation tool such as this one.
    • Authors: Mary Ann Clark, Matthew Douglas, Jung Choi
    • Publisher/website: OpenStax
    • Book title: Biology 2e
    • Publication date: Mar 28, 2018
    • Location: Houston, Texas
    • Book URL:
    • Section URL:

    © Jan 7, 2021 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License 4.0 license. The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.


    The DNA in a cell is not a single long molecule. It is divided into a number of segments of uneven lengths. At certain points in the life cycle of a cell, those segments can be tightly packed bundles known as chromosomes. During one stage, the chromosomes appear to be X-shaped.

    Every fungus, plant, and animal has a set number of chromosomes. For example, humans have 46 chromosomes (23 pairs), rice plants have 24 chromosomes, and dogs have 78 chromosomes.

    The Mechanism of Protein Synthesis

    Translation is similar in prokaryotes and eukaryotes. Here we will explore how translation occurs in E. coli, a representative prokaryote, and specify any differences between bacterial and eukaryotic translation.


    The initiation of protein synthesis begins with the formation of an initiation complex. In E. coli, this complex involves the small 30S ribosome, the mRNA template, three initiation factors that help the ribosome assemble correctly, guanosine triphosphate (GTP) that acts as an energy source, and a special initiator tRNA carrying N-formyl-methionine (fMet-tRNA fMet ) (Figure 4). The initiator tRNA interacts with the start codon AUG of the mRNA and carries a formylated methionine (fMet). Because of its involvement in initiation, fMet is inserted at the beginning (N terminus) of every polypeptide chain synthesized by E. coli. In E. coli mRNA, a leader sequence upstream of the first AUG codon, called the Shine-Dalgarno sequence (also known as the ribosomal binding site AGGAGG), interacts through complementary base pairing with the rRNA molecules that compose the ribosome. This interaction anchors the 30S ribosomal subunit at the correct location on the mRNA template. At this point, the 50S ribosomal subunit then binds to the initiation complex, forming an intact ribosome.

    In eukaryotes, initiation complex formation is similar, with the following differences:

    • The initiator tRNA is a different specialized tRNA carrying methionine, called Met-tRNAi
    • Instead of binding to the mRNA at the Shine-Dalgarno sequence, the eukaryotic initiation complex recognizes the 5′ cap of the eukaryotic mRNA, then tracks along the mRNA in the 5′ to 3′ direction until the AUG start codon is recognized. At this point, the 60S subunit binds to the complex of Met-tRNAi, mRNA, and the 40S subunit.

    Figure 4. Translation in bacteria begins with the formation of the initiation complex, which includes the small ribosomal subunit, the mRNA, the initiator tRNA carrying N-formyl-methionine, and initiation factors. Then the 50S subunit binds, forming an intact ribosome.


    In prokaryotes and eukaryotes, the basics of elongation of translation are the same. In E. coli, the binding of the 50S ribosomal subunit to produce the intact ribosome forms three functionally important ribosomal sites: The A (aminoacyl) site binds incoming charged aminoacyl tRNAs. The P (peptidyl) site binds charged tRNAs carrying amino acids that have formed peptide bonds with the growing polypeptide chain but have not yet dissociated from their corresponding tRNA. The E (exit) site releases dissociated tRNAs so that they can be recharged with free amino acids. There is one notable exception to this assembly line of tRNAs: During initiation complex formation, bacterial fMet−tRNA fMet or eukaryotic Met-tRNAi enters the P site directly without first entering the A site, providing a free A site ready to accept the tRNA corresponding to the first codon after the AUG.

    Elongation proceeds with single-codon movements of the ribosome each called a translocation event. During each translocation event, the charged tRNAs enter at the A site, then shift to the P site, and then finally to the E site for removal. Ribosomal movements, or steps, are induced by conformational changes that advance the ribosome by three bases in the 3′ direction. Peptide bonds form between the amino group of the amino acid attached to the A-site tRNA and the carboxyl group of the amino acid attached to the P-site tRNA. The formation of each peptide bond is catalyzed by peptidyl transferase, an RNA-based ribozyme that is integrated into the 50S ribosomal subunit. The amino acid bound to the P-site tRNA is also linked to the growing polypeptide chain. As the ribosome steps across the mRNA, the former P-site tRNA enters the E site, detaches from the amino acid, and is expelled. Several of the steps during elongation, including binding of a charged aminoacyl tRNA to the A site and translocation, require energy derived from GTP hydrolysis, which is catalyzed by specific elongation factors. Amazingly, the E. coli translation apparatus takes only 0.05 seconds to add each amino acid, meaning that a 200 amino-acid protein can be translated in just 10 seconds.


    The termination of translation occurs when a nonsense codon (UAA, UAG, or UGA) is encountered for which there is no complementary tRNA. On aligning with the A site, these nonsense codons are recognized by release factors in prokaryotes and eukaryotes that result in the P-site amino acid detaching from its tRNA, releasing the newly made polypeptide. The small and large ribosomal subunits dissociate from the mRNA and from each other they are recruited almost immediately into another translation init iation complex.

    In summary, there are several key features that distinguish prokaryotic gene expression from that seen in eukaryotes. These are illustrated in Figure 6 and listed in Table 1.

    Figure 6. (a) In prokaryotes, the processes of transcription and translation occur simultaneously in the cytoplasm, allowing for a rapid cellular response to an environmental cue. (b) In eukaryotes, transcription is localized to the nucleus and translation is localized to the cytoplasm, separating these processes and necessitating RNA processing for stability.

    30S (small subunit) with 16S rRNA subunit

    40S (small subunit) with 18S rRNA subunit

    Human Biology C11

    Use these flashcards to help memorize information. Look at the large card and try to recall what is on the other side. Then tap the card to flip it. If you knew the answer, tap the green Know box. Otherwise, tap the red Don't know box.

    When you've placed seven or more cards in the Don't know box, tap "retry" to try those cards again.

    If you've accidentally put the card in the wrong box, just tap on the card to take it out of the box.

    You can also use your keyboard to move the cards as follows:

    • SPACEBAR - flip the current card
    • LEFT ARROW - move card to the Don't know pile
    • RIGHT ARROW - move card to Know pile
    • BACKSPACE - undo the previous action

    If you are logged in to your account, this website will remember which cards you know and don't know so that they are in the same box the next time you log in.

    When you need a break, try one of the other activities listed below the flashcards like Matching, Snowman, or Hungry Bug. Although it may feel like you're playing a game, your brain is still making more connections with the information to help you out.

    Protein Structure

    Each successive level of protein folding ultimately contributes to its shape and therefore its function.

    Learning Objectives

    Summarize the four levels of protein structure

    Key Takeaways

    Key Points

    • Protein structure depends on its amino acid sequence and local, low-energy chemical bonds between atoms in both the polypeptide backbone and in amino acid side chains.
    • Protein structure plays a key role in its function if a protein loses its shape at any structural level, it may no longer be functional.
    • Primary structure is the amino acid sequence.
    • Secondary structure is local interactions between stretches of a polypeptide chain and includes α-helix and β-pleated sheet structures.
    • Tertiary structure is the overall the three-dimension folding driven largely by interactions between R groups.
    • Quarternary structures is the orientation and arrangement of subunits in a multi-subunit protein.

    Key Terms

    • antiparallel: The nature of the opposite orientations of the two strands of DNA or two beta strands that comprise a protein’s secondary structure
    • disulfide bond: A bond, consisting of a covalent bond between two sulfur atoms, formed by the reaction of two thiol groups, especially between the thiol groups of two proteins
    • β-pleated sheet: secondary structure of proteins where N-H groups in the backbone of one fully-extended strand establish hydrogen bonds with C=O groups in the backbone of an adjacent fully-extended strand
    • α-helix: secondary structure of proteins where every backbone N-H creates a hydrogen bond with the C=O group of the amino acid four residues earlier in the same helix.

    The shape of a protein is critical to its function because it determines whether the protein can interact with other molecules. Protein structures are very complex, and researchers have only very recently been able to easily and quickly determine the structure of complete proteins down to the atomic level. (The techniques used date back to the 1950s, but until recently they were very slow and laborious to use, so complete protein structures were very slow to be solved.) Early structural biochemists conceptually divided protein structures into four “levels” to make it easier to get a handle on the complexity of the overall structures. To determine how the protein gets its final shape or conformation, we need to understand these four levels of protein structure: primary, secondary, tertiary, and quaternary.

    Primary Structure

    A protein’s primary structure is the unique sequence of amino acids in each polypeptide chain that makes up the protein. Really, this is just a list of which amino acids appear in which order in a polypeptide chain, not really a structure. But, because the final protein structure ultimately depends on this sequence, this was called the primary structure of the polypeptide chain. For example, the pancreatic hormone insulin has two polypeptide chains, A and B.

    Primary structure: The A chain of insulin is 21 amino acids long and the B chain is 30 amino acids long, and each sequence is unique to the insulin protein.

    The gene, or sequence of DNA, ultimately determines the unique sequence of amino acids in each peptide chain. A change in nucleotide sequence of the gene’s coding region may lead to a different amino acid being added to the growing polypeptide chain, causing a change in protein structure and therefore function.

    The oxygen-transport protein hemoglobin consists of four polypeptide chains, two identical α chains and two identical β chains. In sickle cell anemia, a single amino substitution in the hemoglobin β chain causes a change the structure of the entire protein. When the amino acid glutamic acid is replaced by valine in the β chain, the polypeptide folds into an slightly-different shape that creates a dysfunctional hemoglobin protein. So, just one amino acid substitution can cause dramatic changes. These dysfunctional hemoglobin proteins, under low-oxygen conditions, start associating with one another, forming long fibers made from millions of aggregated hemoglobins that distort the red blood cells into crescent or “sickle” shapes, which clog arteries. People affected by the disease often experience breathlessness, dizziness, headaches, and abdominal pain.

    Sickle cell disease: Sickle cells are crescent shaped, while normal cells are disc-shaped.

    Secondary Structure

    A protein’s secondary structure is whatever regular structures arise from interactions between neighboring or near-by amino acids as the polypeptide starts to fold into its functional three-dimensional form. Secondary structures arise as H bonds form between local groups of amino acids in a region of the polypeptide chain. Rarely does a single secondary structure extend throughout the polypeptide chain. It is usually just in a section of the chain. The most common forms of secondary structure are the α-helix and β-pleated sheet structures and they play an important structural role in most globular and fibrous proteins.

    Secondary structure: The α-helix and β-pleated sheet form because of hydrogen bonding between carbonyl and amino groups in the peptide backbone. Certain amino acids have a propensity to form an α-helix, while others have a propensity to form a β-pleated sheet.

    In the α-helix chain, the hydrogen bond forms between the oxygen atom in the polypeptide backbone carbonyl group in one amino acid and the hydrogen atom in the polypeptide backbone amino group of another amino acid that is four amino acids farther along the chain. This holds the stretch of amino acids in a right-handed coil. Every helical turn in an alpha helix has 3.6 amino acid residues. The R groups (the side chains) of the polypeptide protrude out from the α-helix chain and are not involved in the H bonds that maintain the α-helix structure.

    In β-pleated sheets, stretches of amino acids are held in an almost fully-extended conformation that “pleats” or zig-zags due to the non-linear nature of single C-C and C-N covalent bonds. β-pleated sheets never occur alone. They have to held in place by other β-pleated sheets. The stretches of amino acids in β-pleated sheets are held in their pleated sheet structure because hydrogen bonds form between the oxygen atom in a polypeptide backbone carbonyl group of one β-pleated sheet and the hydrogen atom in a polypeptide backbone amino group of another β-pleated sheet. The β-pleated sheets which hold each other together align parallel or antiparallel to each other. The R groups of the amino acids in a β-pleated sheet point out perpendicular to the hydrogen bonds holding the β-pleated sheets together, and are not involved in maintaining the β-pleated sheet structure.

    Tertiary Structure

    The tertiary structure of a polypeptide chain is its overall three-dimensional shape, once all the secondary structure elements have folded together among each other. Interactions between polar, nonpolar, acidic, and basic R group within the polypeptide chain create the complex three-dimensional tertiary structure of a protein. When protein folding takes place in the aqueous environment of the body, the hydrophobic R groups of nonpolar amino acids mostly lie in the interior of the protein, while the hydrophilic R groups lie mostly on the outside. Cysteine side chains form disulfide linkages in the presence of oxygen, the only covalent bond forming during protein folding. All of these interactions, weak and strong, determine the final three-dimensional shape of the protein. When a protein loses its three-dimensional shape, it will no longer be functional.

    Tertiary structure: The tertiary structure of proteins is determined by hydrophobic interactions, ionic bonding, hydrogen bonding, and disulfide linkages.

    Quaternary Structure

    The quaternary structure of a protein is how its subunits are oriented and arranged with respect to one another. As a result, quaternary structure only applies to multi-subunit proteins that is, proteins made from more than one polypeptide chain. Proteins made from a single polypeptide will not have a quaternary structure.

    In proteins with more than one subunit, weak interactions between the subunits help to stabilize the overall structure. Enzymes often play key roles in bonding subunits to form the final, functioning protein.

    For example, insulin is a ball-shaped, globular protein that contains both hydrogen bonds and disulfide bonds that hold its two polypeptide chains together. Silk is a fibrous protein that results from hydrogen bonding between different β-pleated chains.

    Four levels of protein structure: The four levels of protein structure can be observed in these illustrations.

    Elongation and Termination in Prokaryotes

    Transcription elongation begins with the release of the polymerase σ subunit and terminates via the rho protein or via a stable hairpin.

    Learning Objectives

    Explain the process of elongation and termination in prokaryotes

    Key Takeaways

    Key Points

    • The transcription elongation phase begins with the dissociation of the σ subunit, which allows the core RNA polymerase enzyme to proceed along the DNA template.
    • Rho-dependent termination is caused by the rho protein colliding with the stalled polymerase at a stretch of G nucleotides on the DNA template near the end of the gene.
    • Rho-independent termination is caused the polymerase stalling at a stable hairpin formed by a region of complementary C–G nucleotides at the end of the mRNA.

    Key Terms

    • elongation: the addition of nucleotides to the 3′-end of a growing RNA chain during transcription

    Elongation in Prokaryotes

    The transcription elongation phase begins with the release of the σ subunit from the polymerase. The dissociation of σ allows the core RNA polymerase enzyme to proceed along the DNA template, synthesizing mRNA in the 5′ to 3′ direction at a rate of approximately 40 nucleotides per second. As elongation proceeds, the DNA is continuously unwound ahead of the core enzyme and rewound behind it. Since the base pairing between DNA and RNA is not stable enough to maintain the stability of the mRNA synthesis components, RNA polymerase acts as a stable linker between the DNA template and the nascent RNA strands to ensure that elongation is not interrupted prematurely.

    Elongation in prokaryotes: During elongation, the prokaryotic RNA polymerase tracks along the DNA template, synthesizes mRNA in the 5′ to 3′ direction, and unwinds and rewinds the DNA as it is read.

    Termination in Prokaryotes

    Once a gene is transcribed, the prokaryotic polymerase needs to be instructed to dissociate from the DNA template and liberate the newly-made mRNA. Depending on the gene being transcribed, there are two kinds of termination signals: one is protein-based and the other is RNA-based.

    Rho-dependent termination is controlled by the rho protein, which tracks along behind the polymerase on the growing mRNA chain. Near the end of the gene, the polymerase encounters a run of G nucleotides on the DNA template and it stalls. As a result, the rho protein collides with the polymerase. The interaction with rho releases the mRNA from the transcription bubble.

    Rho-independent termination is controlled by specific sequences in the DNA template strand. As the polymerase nears the end of the gene being transcribed, it encounters a region rich in C–G nucleotides. The mRNA folds back on itself, and the complementary C–G nucleotides bind together. The result is a stable hairpin that causes the polymerase to stall as soon as it begins to transcribe a region rich in A–T nucleotides. The complementary U–A region of the mRNA transcript forms only a weak interaction with the template DNA. This, coupled with the stalled polymerase, induces enough instability for the core enzyme to break away and liberate the new mRNA transcript.

    Upon termination, the process of transcription is complete. By the time termination occurs, the prokaryotic transcript would already have been used to begin synthesis of numerous copies of the encoded protein because these processes can occur concurrently in the cytoplasm. The unification of transcription, translation, and even mRNA degradation is possible because all of these processes occur in the same 5′ to 3′ direction and because there is no membranous compartmentalization in the prokaryotic cell. In contrast, the presence of a nucleus in eukaryotic cells prevents simultaneous transcription and translation.


  1. Matunaagd

    Sorry to interrupt you, there is a proposal to take a different path.

  2. Tong

    Quite right! It seems to me it is very good idea. Completely with you I will agree.

  3. Charleson

    Your answer is matchless... :)

  4. Makus

    It is good idea. I support you.

Write a message