The structure and mutation of Gene as well as control of gene expression
By Live Dr - Sun Jan 11, 12:45 pm
Gene: a basic unit of heredity. Chemically: A specific DNA fragment, which can be duplicated and mutated. In physics: Arranged on chromosome linearly, can be exchanged and transmitted to the next generation. Functionally: Controlling the expression of specific characteristics of a living organism.
Genome: Total genetic information in a living organism. Total genes in a haploidy set of chromosomes.
Human genome: 3.2×109 bp, enough for encoding:1.5×106 proteins.
Structural genes: genes directing the synthesizing proteins. Only about 20 000 ~ 25 000 genes, 1.5%~2% of genome,Normally in non-repetitive or low repetitive DNA
1. Gene structure
1.1 Exon and intron
The double-helical structure of DNA serves as the repository for genetic information as well as the basis for DNA replication. These topics are addressed in detail in molecular biology texts and will not be reviewed exhaustively here of particular importance in considering genetic contributions to medicine is an appreciation of the structure of individual genes. Genes represent discrete regions of DNA they may be quite short or may extend over hundreds of kilobases (kb, 1kb = 1000 individual base pairs[bp]). Individual regions of genes are defined by specific sequence features. One of the most prominent features of most human genes is the presence of distinct segments, some of them responsible for protein-coding information and others separating such coding sequences. The former-coding-sequences are referred to as exons. The latter-noncoding-regions between exons are referred to as introns. Noncoding sequences must be removed to assemble contiguous coding information. This transcribing the entire gene region from DNA into RNA and then subjecting the newly transcribed RNA to a “splicing” process. During splicing, introns are removed and exons are joined. In some complex gene systems, there may be more than one pattern of splicing. Such “alternative splicing” can generate a group of related-but still distinct-proteins from the same initial transcript. In addition to linear arrays of nucleotides, complex folded arrays can form, especially between single strands. Such arrays constitute additional control features and are intrinsic to the base sequence itself. Thus DNA sequence information encodes structural and control information in addition to specifying linear amino acid arrays.
Other critical features of gene structure include regions for controlling the initiation of transcription and signals for beginning and terminating translation. (All genes are transcribed and translated in the same direction [5’→3′].) These signals are necessarily based on nucleotide sequences, many of which must be recognized by proteins transcription factors, enzymes, etc. It is important to realize that these nucleotide sequences have important topological features. It is instructive to examine a model of double-helical DNA in order to appreciate that the three-dimensional profile of the edges of the bases in the major groove permits clear distinction among base pairs. If you do this, you will find that the surface of the space occupied by a sequence of nucleotides can provide highly specific interactions with protein transcription factors, permitting strict regulation of gene expression.
There are approximately three billion (3.2×109) base pairs in human DNA. Not all regions of DNA are responsible for encoding proteins. We already have considered introns, whose sequences are removed from RNA transcripts prior to protein synthesis. Other regions of the DNA serve large-scale structural functions. These include areas such as telomeres at the ends of chromosomes and other sequences near centromeres that are essential for cell division.
Telomeres, centromeres, and other regions in human DNA are characterized by repeated nucleotide sequences. Some repeats are quite short. The dinucleotide CA is found in stretches of as many as 20 or more repeats in over 50,000 positions throughout cellular DNA. These are usually designated (CA)n, where n is the repeat number. Repeats longer sequences are also present. Generally, repeated sequences are not associated with protein coding information, and at least some may be evolutionary vestiges of no particular significance. Other repeated sequences are likely to have important roles in DNA structure, in the packing of DNA in chromosomes, and in recombination and replication. The specific locations of these repeated sequences in the DNA are very important, however. Because repeated sequences often are found within regions of DNA that also contain unique sequences, it is possible to define specific repeated segments on the basis of the unique DNA sequences flanking them. Thus, repeated sequences can be assigned to unique positions in the linear array of human DNA based on their neighbors. As will be discussed below, this is a very important notion in terms of developing molecular markers for the gene map.
1.2 Gene clusters
Genes sometimes occur in clusters, with genes of similar function located near each other. Many genes are related to one another. They make up so-called gene families. These are groups of genes of often similar structure and function. Some gene families are grouped as contiguous arrays. By far the best-studied gene cluster are those for the globin gene cluster.
For example, globin gene cluster(β-globin gene clusters/α-globin gene cluster):
In Figure 2.1 are shown the α- and β-globin gene clusters of humans. The β -globin gene cluster on human chromosome 11 and the α-globin gene cluster on human chromosome 16 have been studied in great detail. Located near the β-globin gene are four other functional genes, namely ε, Gγ, Aγ, and δ, which also code for hemoglobin proteins but are expressed at different times during development. Similarly, the α-globin genes (there are actually two α1 and α2 with identical coding regions) are also part of a gene cluster.
They represent a group of related genes and include globin genes that are transcribed and synthesized at different times in development. In general, adults synthesize only adult globin genes but fetal globin genes remain present, although unexpressed, in adults. Figure 2.1 presents an outline of the structures of the globin gene loci.
The ζ gene is an embryonic gene, and the gene marked θ1 remains to have its function determined. In the β-globin cluster, an additional gene marked Ψβ1 is shown, and in the α-globin cluster, a Ψζ, a Ψα1, and and Ψα2 gene are indicated.
As evident for the globin gene cluster shown in Figure 5.2 there are often relatively long stretches of DNA in between transcribed genes. In the 80kb(1 kb = 1000 bp) of DNA analyzed around the β-globin genes, only about 12% is actually transcribed and only 2.5% codes for protein. The nontranscribed DNA is termed “intergenic” DNA. Some of the sequences in intergenic DNA close to expressed genes are crucial for control of gene expression (as we shall see shortly), but a large amount of intergenic DNA seems to be rather dispensable and of no known function. Similarly, many genes contain very large introns much of which also appear to be dispensable. For example, introns account for more than 99% of the 2400kb dystrophin gene. Located within intergenic DNA, and sometimes also within introns, are repetitive sequences that occur dispersed throughout the genome in many thousands of copies and of no apparent known function. The most ubiquitous of these, the so-called Alu repetitive sequence, is about 300 bp in length and occurs approximately 500,000 times in the human genome. Since their dispersal into the genome millions of years ago, the Alu sequences have diverged, so that one Alu repeat is about 80% identical to another one. It is unusual to find a stretch of DNA longer than about 30 kb that does not contain at least one of these sequences.
These so-called pseudogenes, designated by Ψ, are DNA sequences that have some of the structures of expressed genes and were presumably once functional but have acquired one or more mutations during evolution that render them incapable of producing a protein product.
Also present within many but not all gene families are “pseudogenes.” Such sequences are no longer functional and cannot make proteins. Presumably, they have arisen as evolutional derivatives of their parent functional genes but cannot themselves be transcribed or successfully translated. Sometimes, as in the globin gene clusters (see Figure 2.2), the pseudogenes are present in the same region as their parent. Other pseudogenes may be dispersed in nonrelated areas of human DNA. Pseudogenes are examples of historical genetic remodeling and recombination events but appear to have no functional significance in themselves. They may be very similar to their parent genes, however, so that considerable study may be required to establish that they are, in fact, incapable of being expressed.
Although the α-and β-globin gene families are distinct, they also share many structural features. Together, they can be considered a “superfamily“. Similar relationships exist among members of the immunoglobulin superfamily. In the latter case, all related genes are not necessarily physically close or even on the same chromosome.
2. Control of Gene Expression
With a few notable exceptions, all of the cells of the human body contain the complete genome. Yet, in any given tissue only a subset of these genes are being expressed. Therefore, the control of gene expression is fundamental to understanding virtually all aspects of human biology.
In general, it is the mature protein product of a gene that carries out its function. The level of this mature protein can be altered by ① the rate of transcription of the gene into RNA, ② the processing of this RNA; ③ the transport of the mRNA from nucleus to cytoplasm; ④the rate of translation of the mRNA into protein on cellular ribosomes; ⑤ the rate of degradation of the mRNA; ⑥ post translational modifications of the protein; and ⑦ the rate of degradation of the protein. All of these control mechanisms have been implicated in specific instances. Perhaps the most economical method of control, however, and one that is widespread in eukaryotes, is to control the protein production at its earliest level, namely that of transcription of the gene. Figure 5.1 shows a schematic diagram of the control elements of an idealized human gene. The important sequence elements have been identified by a variety of methods, including mutational analysis, evolutionary comparison, and functional assays using gene transfer into cultured cells or transgenic mice.
2.1 The promoter
The promoter is somewhat loosely defined as the sequence elements located immediately 5′ to the gene that interact with RNA polymerase and other components of the transcription machinery. These elements fix the site of transcription initiation and control mRNA quantity and sometimes tissue specificity. While in some situations the promoter may extend for several kilobases, the important promoter elements are generally located in the region 100-200 bp 5′ to the gene.
Many human genes contain a conserved “TATA box” sequence, which is located 25-30 bp 5′ to the start of transcription, and seems to be involved in the precise localization of the start. Further upstream, there is often a “CCAAT box” sequence located 75-80 bp 5′ to the start site, although this is less commonly present than the TATA box. In those genes with a CCAAT box, its presence seems to be required for quantitatively efficient transcription, at least in gene transfer experiments. Notably, some “housekeeping” genes, which encode enzymes that are present in virtually all cells, are usually lacking both of these boxes and contain promoters that are highly,enriched in C and G nucleotides. The start site of transcription in genes lacking a TATA box often shows heterogeneity within a 10～20bp region. A particular modified nucleotide, 7-methylguanosine, called a “cap,” is added to the 5′ end of the growing mRNA chain. Thus, the site of initiation of transcription is also often called the “cap site.”
As noted previously, most eukaryotic genes have their coding regions interrupted by introns, which must be removed in a process called splicing to generate a mature mRNA that can be translated into a functional protein. While the function of introns remains unclear the mechanism of splicing is beginning to be understood. At the beginning and end of an intron, certain nucleotide sequences are found. The intron almost always begins with a GT (the splice donor) and ends with an AG (the splice acceptor), and other adjacent bases tend to follow a certain sequenee (referred to as a consensus). However, these consensus sequences while necessary, are not entirely sufficient for recognition by the splicing apparatus; one can find consensus splice donor or acceptor sequences in transcribed genes that are not used. Interestingly, inactivation of the normal splice signal by mutation occasionally activates one of these “cryptic” splice signals.
The mechanism by which a particular splice donor “finds” the correct acceptor remains unclear. A 5′ to 3′ scanning model would be one possibility, but is not consistent with the pattern of splicing seen in the presence of certain splice acceptor mutations. A random search mechanism, however, is not tenable, given the fact that some genes such as collagen contain up to 50 separate introns and yet always connect the correct donor to the correct acceptor.
Most messenger RNAs that code for protein are characterized by the addition of a string of about 200 adenosine residues at their 3′ end (polyadenylatiun). A hexa nucleotide signal AAUAAA in the 3′-untranslated region is a consistent feature of such mRNAs, although other sequences in the vicinity also may play a role in correct polyadenylation. The A residues are added at a point 18-20 bp downstream from this AAUAAA signal. The “poly-A tail” appears to play a role in transport out of the nucleus and the regulation of mRNA stability.
Enhancers are DNA sequences defined by the following properties: (1) they increase transcription from a nearby gene; (2) they can operate over considerable distances and are relatively unaffected by altering this distance; and (3) they are effective even if inverted. The first enhancers characterized were those of certain DNA viruses such as SV40, which bears a 72-bp twice-repeated sequence meeting these criteria, capable of increasing transcription from a large number of genes in almost any tissue tested. More recently, tissue-specific enhancers have been discovered. An example of the latter is the enhancer located in the immunoglobulin gene, which has been shown to be functional (which synthesize immunoglobulin) but not in other tissue types.
Mutations represent differences in DNA organization or sequence in an individual with respect to some standard sequence. Many differences lead to observable amino acid changes in proteins, but some do not. Nevertheless, identifying different mutations has proved useful both for diagnostic studies and for determining the locations of genes and their alterations.
3.1 Base substitution
The simplest mutations represent local DNA base changes. These can include the substitution of one purine for another (A for G or G for A) or one pyrimidine for another (T for C or C for T) these are called transitions. Alternatively, mutations may exchange a pyrimidine for a purine or vice versa (C for A, T for G etc.) these are called transversions.
Such changes may lead to a change in the protein derived from that DNA sequence because one amino acid’s codon is substituted for another’s. Sickle cell anemia (OMIM #141900) is an example of a single base change that causes a single amino acid change. On the other hand, much of the coding information in DNA is said to be redundant, because there often are multiple triplet codes for the same amino acid (see Table 2.1). This situation can permit a base change to be invisible at the level of the protein because the “mutation” led to another codon for the same amino acid. Another consequence of base changes can be formation of a triplet that signals a stop to protein synthesis (see Table 2.1). Such a stop or premature termination results in a shorter protein, very frequently with aberrant properties. Still another result can be the substitution of a similar amino acid (e.g., alanine for valine). As described earlier, many such minor variations cause no meaningful changes in the protein or problems for the organism and are sometimes referred to as “conservative” changes.
3.2 Frame shift mutation
Another possible DNA alteration is the loss of one or more bases. Deletions can cause serious problems. The loss of three contiguous’ bases can either lead to the loss of a single amino acid codon (as occurs in the most common mutation for cystic fibrosis [OMIM #219700] or affect two contiguous codons. Nevertheless, with the loss of three bases in a row (or any multiple of three), the reading frame of the gene remains intact. The loss of different numbers of bases destroys the triplet reading frame and leads to complete aberrancy in the protein produced. Deletions also can occur on a larger scale, such that entire DNA regions can be lost. In some situations these losses are large enough to be seen as chromosomal changes in other situations they are recognizable only with DNA studies. Large deletions generally have significant biological effects.
The opposite of a deletion is an insertion. In this case, one or more bases are added to the DNA strand. The consequences are predictable based on the same reasoning presented above for deletions. Current sequence studies indicate that the rate of local differences in DNA sequence (polymorphisms) between individuals is about one variation in 500 bases about 15% of these variations are insertions or deletions.
Local regions of DNA also can be duplicated. This may occur as part of the replication process for DNA or may be a result of errors in recombination. Duplications may add a region of amino acid sequence to a protein or may cause changes in the reading frame and/or a termination, as discussed above.
The division of most mammalian genes into introns and exons means that splicing is required to achieve a usable transcript, as discussed earlier. Errors in splicing have important effects on gene and protein structure. Because intron sequences are not coding sequences, their retention in messenger RNA causes an aberrant gene product. The signals for splicing are found in the base sequences at the junctions between introns and exons. Thus, mutations in these regions can cause absent or incorrect splicing, possibly with retention of the intron. Another change can involve the mutation of a base sequence distant from the normal splicing site into a new splicing site this also causes large problems for the fidelity of gene transcription by changing the sequences in the final spliced product.
The movement of a small or large piece of DNA from one position to another, a process referred to as transposition, is another source of DNA variation. In bacteria and many less complicated organisms, transposition of DNA sequences from one position to another occurs relatively readily. Although ‘this is less frequent in mammals, it has been recognized as a source of mutation in humans. The movement of DNA sequences can have important effects on the sequence that is moved. It may have lost its appropriate controlling elements or may be only a fragment of the mature gene. Such a movement also may affect the region into which the sequence is moved-the movement of a stretch of foreign DNA into a structural gene may disrupt that gene by inserting anomalous information or introducing inappropriate control sequences.
3.4 Dynamic mutation
The repeat copy numbers of nuclear sequence composed of DNA molecule were increased with the amplification in different degree.
As mentioned earlier, having repeated DNA base sequences establishes a propensity for variation in their number. This now has been documented in several important instances and is being recognized more frequently. The process of amplification (which may be considered an extension of the process of duplication) can result in very large increases in the number of small repeated regions of DNA. Two types of sequences are particularly recognized as subject to amplification. As described earlier, the so-called dinucleotide repeat is typified by (CA)n repeats.
With 50,000 of these in noncoding regions of the human genome, individual variations in their length turn out to be useful position markers for mapping but do not generally cause diseases because they are not in coding regions.
In ontrast, the category of “triplet repeat” disease is medically significant. In these conditions, amplification of adjacent groups of three base pairs may lead to aberrant gene products, aberrant gene control, or other pathological effects. With triplet repeats, the problems usually develop from an increase of the repeat number over a baseline (or threshold) level. Most clinically unaffected individuals have a relatively low number of repeats. Rarely, individuals have a higher repeat number, creating an unstable situation. Beyond this unstable intermediate level, further amplification leads to overt disease thus, the intermediate level can be considered a “pre-mutation.” This is an important category of genetic illness and will be considered in more detail below. Triplet repeat disorders frequently have neurological manifestations.
The general properties in dynamic mutation include:
⑴Mutation manifests as a change (usually increase) in repeat copy number with mutation rate related to the initial copy number of the repeat.
⑵Rare “founder” events (such as loss of repeat interruption) lead to alleles with increased likelihood of undergoing changes in repeat copy number.
⑶The diseases caused by repeat expansion exhibit a relationship between copy number of the repeat and the severity and/or age-at-onset of symptoms. These properties together account for anticipation, the increasing severity/incidence and/or decreasing age-at-onset in successive generations within an affected family.
What kinds of repeat sequences undergo dynamic mutation?
The first expanded repeats to be identified were:
⑴The trinucleotides CCG/CGG and CAG/CTG. Initially this was taken as evidence that only trinucleotide repeats could undergo this form of mutation (which was sometimes referred to as trinucleotide repeat expansion).
⑵Two repeats[e.g., (CA)n] could form secondary structures, it was assumed that this was also a necessary condition for repeat instability .
⑶5 and 6 bp microsatellite repeats.
So far dynamic mutations have been detected in diseases (and at fragile site loci) with very high penetrance.
4. Genomic Imprinting
The phenomenon of parent-of-origin gene expression. The expression of a gene depends upon the parent who passed on the gene. /The allele from one parent is expressed and the allele from the other parent is not.
Genomic imprinting is contrary to Mendelian principles of inheritance. Genomic imprinting is the differential expression of genetic material depending on whether it was inherited from the male or female parent.
When to suspect genomic imprinting: Evidence of genomic imprinting comes from examining the pedigree. If a disorder is always expressed when inherited only from the male or the female parent, genomic imprinting should be suspected.
Genomic imprinting plays a critical role in fetal growth and development. Imprinting is regulated by DNA methylation and chromatin structure.
For instance, two different disorders ‑- Prader-Willi syndrome and Angelman syndrome are due to deletion of the same part of chromosome 15. When the deletion involves the chromosome 15 that came from the father, the child has Prader-Willi syndrome, but when the deletion involves the chromosome 15 that came from the mother, the child has Angelman syndrome.
PWS and AS are unusual, frightening disorders. Deletions, random loss of chromosomes, and nondisjunction play major roles in determining their phenotypes, and are helpful in diagnosis and genetic counseling. Imprinting genes with allele-specific differences in transcription and methylation can be altered by deletions and uniparental disomy leading to PWS and AS. These and other disorders can benefit from extensive studies into genomic imprinting.