Genomic analysis of two baterial pathogens
Mycoplasma genitalium
  - Bacterium
- Phylogenetically among Gram positive bacteria 
    
      - However lacks a cell wall, so does not stain with Gram stain
- Class Molicutes
- low G+C group 
 
- Parasite 
    
      - Strain studied was isolated from patient with non-gonoccol urethritis
 
- Genome 
    
      - Sequenced at TIGR in mid 1990's
- Among the first complete genomes sequenced
- Very small, thought to be reduced in size from a larger ancestral genome
- 580 Kb
- Roughly 470 open reading frames
- Seen as model for a minimal functional gene set 
 
Minimal functional gene set
  - One key question in genomics is the search for the minimal gene set
- What is the minimal complement of genes that is necessary to make a living 
    organism
- This is a complex question, and depends upon assumptions of what functions 
    define a living organism
Sequencing strategy
  - Construct two random libraries from DNA sheared to appropriate size 
    
      - Large -- 15-20 kb
- Small -- ca 2 kb
 
- Plate Library & verify its quality
- Sequence 
    
      - High throughput DNA sequencing using dye primer chemistry
- Sequence each clone from both ends, but do not worry about completely 
        sequencing clone
- 9846 sequencing reactions run in 8 weeks by 5 people runing 8 ABI373 
        sequencers
- Got 8472 high quality sequences, combined these with 299 random marker 
         sequences from the literature
- Mean sequence coverage was 6.5x
- 99% of sequence was sequenced with better than single-stranded coverage
 
- Assemble 
    
      - TIGR ASSEMBLER generated 39 contigs
- Ranged from 606-73,351 bp
- A total of 3,806,280 bp of primary DNA sequence data
- ASM_ALIGN links contigs on basis of orientation of reads from each end 
        of a single clone
- All 39 gaps were covered by at least one clone from the small-size library
- Verification of assembly 
        
          - Checked location of marker sequences on known map
- Used GRASTA to look for small overlaps that would have been missed 
            by TIGR ASSEMBLER
- This reduced gaps to 28 
 
 
- Close gaps 
    
      - Physical Gaps 
        
          - In this case no physical gaps were present
 
- Sequence Gaps 
        
          - Selected clones that spanned gaps, and selectively sequenced these 
            clones
 
 
- Edit 
    
      - Manually inspected sequencer output traces in alignment for ambiguities 
        that could be resolved
- 53 ambiguities and 25 possible frameshifts were found
- These regions were re-sequenced with dye terminator chemistry
 
- Annotate 
    
      - Organization 
        
          - Circular chromosome of 580,070 bp
- G+C content 32% overall, with lower G+C regions flanking the origin 
            of replication 
            
              - Ribosomal RNA and tRNA genes had higher G+C content, presumably 
                because of functional constraints 
 
- 74 EcoRI fragments 
            
              - Generally consistent with map; one discrepancy was resolved 
                in favor of the sequence
 
- Precise origin of replication was not identified, but a 4 kb region 
            probably containing it was identified 
            
              - This lies between dnaA and dnaN
- An untranscribed regions between these was selected as the origin 
                for numbering.
 
- There is a polarity to transcription: genes to the right are preferentially 
            transcribed on the plus strand, and those to the left on the minus 
            strand, with the distinction extending roughly half way around the 
            chromosome
 
- Predicted Coding Regions 
        
          - Initial search for ORFs larger than 100 bp
            
              - Translations were made assuming UGA encodes tryptophan (as is 
                known to be the case in some Mycoplasmas) 
 
- Predicted proteins were searched against a non-redundant bacterial 
            protein database (NRBP)
- Sequences that were similar to a protein in NRBP were assigned that 
            name
- Matches were aligned with PRAZE (a modified Smith-Waterman algorithm)
- GenMark was trained with 308 M. genitalium sequences and used to 
            evaluate 170 unidentified ORFs
- Peptide sequences from other genomes were also used to search all 
            six reading-frames of the genome
 
- Sequence similarity searches support close relationship between Mycoplasma 
        spp. and gram positive bacteria
 
The genome of Mycoplasma genitalium compared with that of Haemophilus 
  influenzae
  - At the time the Mycoplasma genitalium sequence was determined, very 
    few complete sequences were available, but that of Haemophilus influenzae 
    had already been determined.
- Both are human pathogens. H. influenzae causes a form of meningitis.
- The two bacteria are not closely related
    
      - Mycoplasma genitalium is in a group of obligate parasites embedded 
        within the gram positive bacteria 
- Haemophilus influenzae is a gamma proteobacterium (i.e., relatively 
        closely related to Escherichia coli)
 
- M. genitalium has a more highly reduced genome
   
    |  | Size | Complexity | 
   
    | M. genitalium | 580,070 bp | 470 predicted coding regions | 
   
    | H. influenzae | 1,830,137 | 1743 predicted coding regions | 
  - M. genitalium is missing many metabolic pathways necessary for independent 
    growth. 
    
      - Biosynthesis of amino acids, cofactors, and cell wall are all reduced, 
        and there are relatively few regulatory factors.
- The category of "unassigned" genes is also greatly reduced.
- Thus a relatively large fraction of its genome is devoted to DNA replication 
        and gene expression.
- 90 genes not found in H. influenzae; most of these (60%) resemble genes 
        found in other gram positive bacteria
- Relatively few unique genes in M. genitalium -- this is a function 
        of the reduced genome
 
- Both have mechanisms to promote antigenic variation 
    
      - M. genitalium 
        
          - Stalked bacterium with an an adhesin protein (MgPa) on the tip
- The adhesin elicits a strong immune response
- This protein is encoded in an operon along with two other open reading 
            frames
- The arrangement is 29 kd ORF -- 6nt spacer -- MgPa (160 kd) -- 1 
            nt spacer -- 114kd ORF
- Several copies of the MgPa gene were known to be scattered around 
            the genome
- The complete genome included the complete MgPa operon and nine partial 
            repeats, constituting 4.7% of the entire genome
- Sequence identity of the repeats to the intact MgPa operon ranges 
            from 78 - 90%
- Recombination is thought to occur among the members of the gene 
            family, resulting in increased antigenic variation
 
- H. influenzae
        
          - Antigenic response is primarily to adhesins and lipo-oligosachharides
- A key antigenic locus (lic-1) had been identified, which contains 
            4 genes 
- The first gene in the lic-1 operon contains tandem tetramer repeats 
            (CAAT)
- The number of these repeats varies, which shifts the genes in or 
            out of frame
- This constitutes a translational switch
- Promotes antigenic variation
- Determination of the H. influenzae genome made it practical 
            to identify all such repeats in the genome
 
 
Fraser, C.M (and 29 others). 1995. The Minimal Gene Complement of Mycoplasma 
  genitalium. Science 270:397-403. 
Weiser, J.N. et al. 1989. The molecular mechanism of phase variation of H. 
  influenzae lipopolysaccharide. Cell 59:657-665.