*Pamela Greenwell and Sanjiv Rughooputh
Molecular and Medical Microbiology Research Group
Corresponding author : Dr Pamela Greenwell. Email email@example.com
Following hard on the heels of the human genome project, microbial genome projects are producing vast amounts of information on the nucleotide sequences of specific microbes. How useful is this information and how can we wade through the millions of base pairs of sequence data to find genes or sequences of interest for either diagnostic or therapeutic strategies? In theory, the answer lies in the science of bioinformatics which itself encompasses genomics, proteomics and metabolomics; terms which are more recognisable to many as molecular genetics and biochemistry.
Many scientists have the impression that once the genome of an organism is sequenced everything is then known. Is this really true? At this stage, it is useful to look at an example of an organism whose genome has been completely sequenced. Chlamydia trachomatis serovar D (CT) has a genome of 1.045Mb that was sequenced in 1998. Full details of the genes and proteins encoded and references can be obtained simply by accessing http://www.TIGR.org and clicking on the icon for “Comprehensive microbial resource”. Clicking on the box “Visit a CMR page for an individual genome” will reveal the names of all the organisms cloned. In our case we would go to Chlamydia trachomatis serovar D. From here you can explore everything that is known about the genome of that organism.
By now, you are probably wondering why anyone still needs to work on CT if everything is known. The truth is that we know everything about the nucleotide sequence of CT and can predict that it has 877 proteins, of which 103 are secreted, as judged by the presence of signal peptides and 241 have transmembrane domains and therefore appear to be membrane bound. However, we still do not know the function of more than 20% of those proteins.
One way to identify genes is by homology with other cloned genes using the assumption that bacteria must have evolved from each other and therefore will share genes, albeit with some changes due to evolution. To see such homology, we can employ an alignment search tool such as BLAST (basic local alignment search tool), which allows us to compare the nucleotides or derived proteins from CT with all other genes or proteins lodged in databases worldwide or with subsets of that data, for example, just prokaryotic genes and proteins. The results show areas of homology in the form of alignments and give the statistical probability that the homology is significant and not just chance. We tend to work with the translated amino acids derived from the nucleotide sequences as different organisms do have different codon preferences that can make nucleotide comparisons difficult.
Of course, you might expect that, as the databases contain billions of sequences, we could identify most proteins by their homology to other known proteins. Sadly, this is not true and more than 20% of the CT genome cannot be identified using this technique. You might ask whether CT is just an unusual microbe, but we have also looked at the sequences of Trichomonas vaginalis and in this case most of the genes cloned encode proteins with little homology to other cloned genes.
Does this mean bioinformatics is not useful? Of course not, although we cannot identify all genes by this method we can identify some proteins and derive other useful data. For example, when proteins encoded by CT genes are compared to those of Escherichia coli and Bacillus subtilis, 195 showed better homology to E.coli proteins, whereas 259 had greater homology to proteins of B.subtilis. This implies CT has characteristics of both gram-positive and gram-negative organisms. When CT proteins are compared to those of other sexually transmitted organisms (STIs), for example Treponoma pallidum and Mycoplasma genitalium , 68 of the encoded proteins of CT have homology with proteins of M.genitalium and 286 with proteins of T.pallidum. However, the majority of the proteins of CT are not similar to either of those organisms. Nevertheless, those proteins conserved between these organisms may be of great interest as targets for therapeutics strategies, allowing us to target more than one organism at once.
One of the most useful tools for the molecular microbiologist is whole genome comparison. This can be done using traditional tools such as BLAST, however, for the best visualisation, Artemis would be the tool of choice http://www.sanger.ac.uk/Software/Artemis/ and ACT the tool required for whole genome comparison http://www.sanger.ac.uk/Software/ACT/. ACT allows us to directly compare two genomes and in the case of Mycobacteria, it has been possible to show that M.leprae and M.tuberculosis originally had similar genomes but that M.leprae went on to lose genes resulting in its small genome size. Indeed, small fragments of these lost genes are still visible using ACT. In other organisms comparisons of pathogenic and non-pathogenic bacteria have highlighted the presence of “pathogenicity islands”. These can then be investigated for use as potential targets in therapeutics. Similarly we can look at the evolution of antibiotic resistance or determine areas of microbial genomes suitable as targets for diagnostics.
Of course, having nucleotide information allows us to identify unique areas of the genome and tools are available, for example http://www-genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi , for designing primers for PCR based detection.
So is bioinformatics useful? The answer has to be yes, but it cannot tell us everything. It is simply a tool albeit a very powerful one to help us understand microbial genomes. It is interesting to ponder the value of the human genome project in the light of the limitations found in microbial analysis. The take home message must be, “ you can know everything about the nucleotides in the genome, but that doesn’t imply you know everything about the organism”.
Back to the main page