Genomic analysis of two baterial pathogens

Mycoplasma genitalium

  1. Bacterium
  2. Phylogenetically among Gram positive bacteria
    1. However lacks a cell wall, so does not stain with Gram stain
    2. Class Molicutes
    3. low G+C group
  3. Parasite
    1. Strain studied was isolated from patient with non-gonoccol urethritis
  4. Genome
    1. Sequenced at TIGR in mid 1990's
    2. Among the first complete genomes sequenced
    3. Very small, thought to be reduced in size from a larger ancestral genome
    4. 580 Kb
    5. Roughly 470 open reading frames
    6. Seen as model for a minimal functional gene set

Minimal functional gene set

  1. One key question in genomics is the search for the minimal gene set
  2. What is the minimal complement of genes that is necessary to make a living organism
  3. This is a complex question, and depends upon assumptions of what functions define a living organism

Sequencing strategy

  1. Construct two random libraries from DNA sheared to appropriate size
    1. Large -- 15-20 kb
    2. Small -- ca 2 kb
  2. Plate Library & verify its quality
  3. Sequence
    1. High throughput DNA sequencing using dye primer chemistry
    2. Sequence each clone from both ends, but do not worry about completely sequencing clone
    3. 9846 sequencing reactions run in 8 weeks by 5 people runing 8 ABI373 sequencers
    4. Got 8472 high quality sequences, combined these with 299 random marker sequences from the literature
    5. Mean sequence coverage was 6.5x
    6. 99% of sequence was sequenced with better than single-stranded coverage
  4. Assemble
    1. TIGR ASSEMBLER generated 39 contigs
    2. Ranged from 606-73,351 bp
    3. A total of 3,806,280 bp of primary DNA sequence data
    4. ASM_ALIGN links contigs on basis of orientation of reads from each end of a single clone
    5. All 39 gaps were covered by at least one clone from the small-size library
    6. Verification of assembly
      1. Checked location of marker sequences on known map
      2. Used GRASTA to look for small overlaps that would have been missed by TIGR ASSEMBLER
      3. This reduced gaps to 28
  5. Close gaps
    1. Physical Gaps
      1. In this case no physical gaps were present
    2. Sequence Gaps
      1. Selected clones that spanned gaps, and selectively sequenced these clones
  6. Edit
    1. Manually inspected sequencer output traces in alignment for ambiguities that could be resolved
    2. 53 ambiguities and 25 possible frameshifts were found
    3. These regions were re-sequenced with dye terminator chemistry
  7. Annotate
    1. Organization
      1. Circular chromosome of 580,070 bp
      2. G+C content 32% overall, with lower G+C regions flanking the origin of replication
        1. Ribosomal RNA and tRNA genes had higher G+C content, presumably because of functional constraints
      3. 74 EcoRI fragments
        1. Generally consistent with map; one discrepancy was resolved in favor of the sequence
      4. Precise origin of replication was not identified, but a 4 kb region probably containing it was identified
        1. This lies between dnaA and dnaN
        2. An untranscribed regions between these was selected as the origin for numbering.
      5. There is a polarity to transcription: genes to the right are preferentially transcribed on the plus strand, and those to the left on the minus strand, with the distinction extending roughly half way around the chromosome
    2. Predicted Coding Regions
      1. Initial search for ORFs larger than 100 bp
        1. Translations were made assuming UGA encodes tryptophan (as is known to be the case in some Mycoplasmas)
      2. Predicted proteins were searched against a non-redundant bacterial protein database (NRBP)
      3. Sequences that were similar to a protein in NRBP were assigned that name
      4. Matches were aligned with PRAZE (a modified Smith-Waterman algorithm)
      5. GenMark was trained with 308 M. genitalium sequences and used to evaluate 170 unidentified ORFs
      6. Peptide sequences from other genomes were also used to search all six reading-frames of the genome
    3. Sequence similarity searches support close relationship between Mycoplasma spp. and gram positive bacteria

The genome of Mycoplasma genitalium compared with that of Haemophilus influenzae

  1. At the time the Mycoplasma genitalium sequence was determined, very few complete sequences were available, but that of Haemophilus influenzae had already been determined.
  2. Both are human pathogens. H. influenzae causes a form of meningitis.
  3. The two bacteria are not closely related
    1. Mycoplasma genitalium is in a group of obligate parasites embedded within the gram positive bacteria
    2. Haemophilus influenzae is a gamma proteobacterium (i.e., relatively closely related to Escherichia coli)
  4. M. genitalium has a more highly reduced genome
  Size Complexity
M. genitalium 580,070 bp 470 predicted coding regions
H. influenzae 1,830,137 1743 predicted coding regions
  1. M. genitalium is missing many metabolic pathways necessary for independent growth.
    1. Biosynthesis of amino acids, cofactors, and cell wall are all reduced, and there are relatively few regulatory factors.
    2. The category of "unassigned" genes is also greatly reduced.
    3. Thus a relatively large fraction of its genome is devoted to DNA replication and gene expression.
    4. 90 genes not found in H. influenzae; most of these (60%) resemble genes found in other gram positive bacteria
    5. Relatively few unique genes in M. genitalium -- this is a function of the reduced genome
  2. Both have mechanisms to promote antigenic variation
    1. M. genitalium
      1. Stalked bacterium with an an adhesin protein (MgPa) on the tip
      2. The adhesin elicits a strong immune response
      3. This protein is encoded in an operon along with two other open reading frames
      4. The arrangement is 29 kd ORF -- 6nt spacer -- MgPa (160 kd) -- 1 nt spacer -- 114kd ORF
      5. Several copies of the MgPa gene were known to be scattered around the genome
      6. The complete genome included the complete MgPa operon and nine partial repeats, constituting 4.7% of the entire genome
      7. Sequence identity of the repeats to the intact MgPa operon ranges from 78 - 90%
      8. Recombination is thought to occur among the members of the gene family, resulting in increased antigenic variation
    2. H. influenzae
      1. Antigenic response is primarily to adhesins and lipo-oligosachharides
      2. A key antigenic locus (lic-1) had been identified, which contains 4 genes
      3. The first gene in the lic-1 operon contains tandem tetramer repeats (CAAT)
      4. The number of these repeats varies, which shifts the genes in or out of frame
      5. This constitutes a translational switch
      6. Promotes antigenic variation
      7. Determination of the H. influenzae genome made it practical to identify all such repeats in the genome

Fraser, C.M (and 29 others). 1995. The Minimal Gene Complement of Mycoplasma genitalium. Science 270:397-403.

Weiser, J.N. et al. 1989. The molecular mechanism of phase variation of H. influenzae lipopolysaccharide. Cell 59:657-665.