Make your own free website on

Brief  Overview on some   DNA Analysis tools


 Harris Ramuth ,  BSc (Hons) MSc

Central Health Laboratory





The National Center for Biotechnology Information (NCBI) web site.




Computers are of essence nowadays as they are able to perform four interdependent functions, which are of utmost importance in Bioinformatics: they are an important means of communication, permit computation with numerous algorithms in software packages supported by valid statistical evidence, permit to control the data at hand and at the same time provide a means of storage of information.


 Not only do scientists from all over the world have access to computers and are thus able to utilise information all around the clock but also this information is   available to the public in general. The parallel development taking place in the field of computing and Artificial Intelligence has invariably linked Molecular Biology to computers.


Gregor Mendel’s First principles, the deciphering of the genetic code and The Central Dogma of Molecular Biology as defined by James Watson are the fundamentals that have   transformed the field of Genetics.




The advent of refined sequencing tools enhanced by PCR techniques and cloning supplied the data that enabled a major computer programme namely Graig Venter’s Celera Genomics, provided by a private company, to give birth to the Human Genome Project as originally proposed by twelve biologists in 1985.


By 2002, scientists had already sequenced or provided a draft of the entire genome of at least eight other model organisms including Mus musculus (the common housefly) and Saccharomyces cerevisiae.

Mass spectrometry and two-hybrid screens are techniques that are fast evolving and there is an incredible growth of the known cell map with new data on the structure of cell signaling and metabolic networks.

To keep pace with these findings visualization and analysis tools for this data         (e.g. BIND) are available to assist in understanding this complex data. (1)(2)            The NCBI contained more than 800,000 nonredundant protein sequences in 2002(3) and this data is increasing.

Various centers and Institutes have taken up this endeavour to ensure that the growing body of information from molecular biology and genome research is placed in the public domain and is accessible freely to all facets of the scientific community in ways that promote scientific progress.


Some examples are non-profit academic organisations like the European Bioinformatics Institute (EBI) , the Sanger Centre, the UK MRC Human Genome Mapping project to name only  a few.


The National Center for Biotechnology Information is the leading information provider in America and was established in 1988 as a division of

the National Library of Medicine (NLM), and is located on the campus of the      National Institute of Health (NIH).


The specific aim of the NCBI is to create an automated system to gather, analyse and store biological information and does so by facilitating access to database and software by the user. It maintains the database of the Gen Bank and is associated with large databases such as the EMBL and the Data DDBJ .Gen Bank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.

More than 165,000 named organisms  obtained from individual laboratories and batch submissions from large-scale sequencing projects are present as Web-based BankIt or GenBank staff assign standalone Sequin programs and accession numbers are used to compile the data(4).

 Bank it is normally used for small   non-complicated sequences, which require no sequencing tools. BankIt transforms  data into GenBank format for   review and when the record is completed, it can be submitted directly to GenBank.

 However, when   submitting long or complex submissions or when submitting mutation, phylogenetic, population, environmental or segmented sets sequin is used instead. It is also used for graphical viewing, editing options, including the alignment editor, and if network access to related analytical tools is desired.



 Sequence Retrieval and seeking information ( Entrez)


The NCBI developed an indexing and retrieval system similar to the EBI Sequence Retrieval System (SRS) that facilitates access to a range of bio-databanks. Both molecular biology data (Online Mendelian Inheritance in Man (OMIM)) and bibliographic citations from integrated databases (e.g. Pub Med) are available through Entrez.  


It integrates data from a large number of sources, formats and databases into a uniform information model and retrieval system.


Unlike the SRS, the Entrez system has a customised search so that preferred databases are accessed. This can be a disadvantage to an unaccustomed user as he/ she may assume that the search will be made across all the databases linked to Entrez automatically and can therefore end up with no information.


Entrez has an advantage over conventional systems (search engines such as Goggle, Yahoo! etc) in that it is also able to select terms on its own and is consequently undemanding in its browser requirements, avoiding at the same time information about other connotations and there is thus no time wastage.


However, a basic knowledge of the database being accessed by the user is mandatory   to optimise the use of the Entrez portal. This illustrates the fact that the expansion of biological data happens at such a rapid pace that not all databases are fully up to date.   Since the system does not verify all the databases on a single query; this implies that a knowledge of the relevant databases for subsequent searches have to be made for analysis. A user who accesses the wrong database may end up with the wrong conclusions.


The Entrez system is also used to review, revise or combine results of the most recent 100 searches maintains search histories.

The Entrez system accommodates databases of nucleotide sequences, protein sequences, macromolecular structures and whole genomes. It supports both inter- and intra-database linking.


Results of an Entrez search can be displayed in a variety of formats and can be displayed in a variety of ways (on a clipboard feature on the computer, saved on a disc or a floppy drive or printed on a worksheet)

There are nonetheless certain policies that have to be respected before submitting a sequence for a search on a data bank. ( )


3 BLAST and other tools to sequence DNA

(nucleotide sequences)

The most popular data formats in bioinformatics are the FASTA, PHYLIP, and MAML.

FASTA, BLAST and Smith-Waterman algorithms are the useful and powerful search tools, which are available at NCBI. These permit to make sequence alignments rapidly through a variety of updated databases.

A sequence within a FASTA sequence file consists of three parts:

  1. A title line, beginning with one or more `>' symbols, which do not form part of the title.
  2. Optional annotation lines, beginning with `;'.
  3. The sequence itself, containing possible newlines and continuing until end of file or the next `>' is reached.

Example: The figure shows a FASTA format.

>gi|282349|pir||A41961 chitinase (EC D - Bacillus circulans


The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.


BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families ( )


BLAST is based on mathematical statistics coupled with human intuition and the BLAST algorithm permits a straightforward and rapid turnaround of sequence searches in the various databases.


It calculates all segments in pairs (a segment pair is a pair of sub-sequences of the same length that form an ungapped alignment.) in what is known as a pairwise alignment technique.


The concept of the E value in the BLAST search is based on the view that the database is a collection of independent objects (sequences).   BLAST and PSI-BLAST now permit calculated E-values to take into account the amino acid composition of the individual database sequences involved in reported alignments.


This improves E-value accuracy, thereby reducing the number of false positive results.


 Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates.


For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.


In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix.


The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities.


 PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.


BLAST on the NCBI site is extensively used for  a huge variety of purposes . Just as an example: It is used to compare the genomes of Caenorhabditis elegans with that of its evolutionary counterpart Caenorhabditis briggsae. Extensive work exists on these nematodes and a wormbase containing the sequences of both worms are widely used. These research have shed considerable light on a variety of fields including Developmental Biology , on apoptosis and on Evolution.  The Sanger Center and the Genome Sequencing Center at Washington University have now completed the sequencing of the Caenorhabditis elegans genome and this is readily accessible at the NCBI.


This was the first multicellular eukaryotic genome to be completed: The genome in this release, Wormbase WS97, is 100.274 Mb, organized in six chromosomes. The Sequencing Consortium with 19,542 predicted coding sequences, 21,437 when counting 1,891 alternate splice forms, has decorated the genome. (

The wormbase contains all the nucleotide sequences; an example of the nucleotide sequence     is shown below

Sequence Details for: B0212

tatgtgcttt ttctaacttt tatgtgcttt gtttaagtat cacaaaagat gtacaatgat gacgagctca cggaattcat ggatttctta gtggaacaga caaaggattc gatttaccca atggtggcgg caaaagtatt caaacaattc cccaatcgtcgtttgatccc aaaagataag cgttatcaaa ggtaaaaaga cctggacttc gctttgttca ccaatattct gttcagattt gtcttacaac ttgcaccaaa aatgaacgat tggaataact acagtattga ggcacgaatt cgattgatgt atgcgctgcg tggaaaagtt gaagatgatt ttttggcaag gtgggaatga caccactttt caagattaaa tctgaaaaag gcgtccgcgg ccaacggaaa gtaccggcaa tttcggaaat tttgaattct aattacatat tgaaataaaa aaaaaacttt cacgttgatt tttagtattc ggttagtcat tcggctagac agaagttcct tcattcaatt tttcatctaa aattgatata aggatgtaga aaaaggtgtg ttctatccga acgtaacatt tcacactttt agcacttttg tggaaaacgg caactttcag gcactatacg aaattttttt gaagcttttc aaccaaaata attgttattt tcagatagaa ggaatctttt tctacattca tatatcagct ttgtttgaaa gaagttagtt tatcttgttg aaatatgaag ttttgtgcgt ttgccgaaaa aaaaatcgac gcccatccct gatcataact tcaatgatta aaccattttg agcatgaaaa gagcaaattg ttcagtacaa ttcatgttta ggctctgaaa tctacaaaaa gtatctaata tacattaccg aaaaattccg gaaatcaatt gccgcactcc cctgatctga ataatttgca gaattgaagc tcatggaact gttcagctca atgaaaaaca aagaatctcc aaatttacag cgaatgacgg aacatttttt ttggaagatg atcgtaaacg tttatcacta gacgctacaa gactcgaatt cgtcaagctc atgggctttc tcattgaaaa

aaccaaagat gccgttgaac cattgccgaa cacaaaacta atttttcagg agtttagcca aattgaacct gttaagaagc ctaatgatac tgacataatg tataagtgag atgagcattc catgttttct ttacacaaat ttaattcaga tttcgcagta




Inserting a sequence in the BLAST programme can reveal a variety of information . In this case a known sequence was introduced and blasted in the Caenorhaditis elegans genome in search for sequence similarity.The result obtained is shown below:

 Query: 1    acttttagcacttttgtggaaaacggcaactttcaggcactatacgaaannnnnnngaag 60
            |||||||||||||||||||||||||||||||||||||||||||||||||       ||||
Sbjct: 595  acttttagcacttttgtggaaaacggcaactttcaggcactatacgaaatttttttgaag 654
Query: 61   cttttcaaccaaaataattgttattttcagatagaaggaatctttttctacattcatata 120
Sbjct: 655  cttttcaaccaaaataattgttattttcagatagaaggaatctttttctacattcatata 714
Query: 121  tcagctttgtttgaaagaagttagtttatcttgttgaaatatgaagttttgtgcgtttgc 180
Sbjct: 715  tcagctttgtttgaaagaagttagtttatcttgttgaaatatgaagttttgtgcgtttgc 774
Query: 181  cgnnnnnnnnntcgacgcccatccctgatcataacttcaatgattaaaccattttgagca 240
            ||         |||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 775  cgaaaaaaaaatcgacgcccatccctgatcataacttcaatgattaaaccattttgagca 834
Query: 241  tgaaaagagcaaattgttcagtacaattcatgtttaggctctgaaatctacaaaaagtat 300
Sbjct: 835  tgaaaagagcaaattgttcagtacaattcatgtttaggctctgaaatctacaaaaagtat 894
Query: 301  ctaatatacattaccgaaaaattccggaaatcaattgccgcactcccctgatctgaataa 360
Sbjct: 895  ctaatatacattaccgaaaaattccggaaatcaattgccgcactcccctgatctgaataa 954
Query: 361  tttgcagaattgaagctcatggaactgttcagctcaatgaaaaacaaagaatctcc 416
Sbjct: 955  tttgcagaattgaagctcatggaactgttcagctcaatgaaaaacaaagaatctcc 1010
 Score = 42.1 bits (21), Expect = 1.3
 Identities = 27/29 (93%)
 Strand = Plus / Plus
 The results have to be verified and expertly dealt with for any conclusions. E values give a good indication of the validity of similarity but is certainly not the sole criterion to consider for a conclusion.

The FastA  algorithm is based upon the idea of identifying short words or k-tuples common to both sequences under comparison.It was described by Lipman and Pearson in 1985( Lipman D.J. and Parson ,W.R.(1985)Rapid and sensitive protein similarity changes.Science,227,1435-1441.) FASTA thus  follows the tradition of combinatorial optimization.


4 Visualising protein folding patterns

The sequence of residues of nascent polypeptide chains embodies the ultimate rationale behind all purposeful structures and behaviours of living beings.

Jacob Monod (1970), from Chance and necessity.


Proteins are the main essential active agents, essential for the metabolic processes that we associate with life to take place. Understanding the action of proteins is the key to understanding the spark of life itself. Proteins occupy a unique position between Chemistry and Biology. It takes not too many of the non-alive single proteins to associate together with a little nucleic acid for life-like behaviour to emerge .On the borderline of life, the HIV (Human Immunosuppressive Virus) has only about ten different types of proteins.

To understand proteins it is necessary to have information about:

1 basic structure and geometry

2 secondary structures


Bimolecular Interaction Network Database (BIND) is designed to capture protein function, defined at the molecular level as the set of other molecules with which a protein interacts or reacts along with the molecular outcome. XML versions of all data with accompanying DTDs are supported with the NCBI programming toolkit.


 A collection of interaction information, such as BIND, enables the study of the relationships between protein domain architecture and protein–protein interactions. Specifically, it is possible to classify the interactors of a protein into distinct groups based on domain composition.


The Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the NCBI Entrez Protein Database based on domain architecture, defined as the sequential order of conserved domains in proteins. The algorithm finds protein similarities across significant evolutionary distances using sensitive protein domain profiles rather than by direct sequence similarity. Proteins similar to a query protein are grouped and scored by architecture. Relying on domain profiles allows CDART to be fast, and, because it relies on annotated functional domains, informative. Domain profiles are derived from several collections of domain definitions that include functional annotation(7).


The CDART query page can be found on the Internet at On this page, one may enter a sequence accession or FASTA formatted sequence.


For this example we enter the accession for human BRCA1, NP_009225. ( a tumor suppressor protein) Pressing the search button runs RPS-BLAST on the sequence, comparing the sequence to the domain definitions in the CDD database. The search completes in a few seconds, and the results are displayed in figure below(7).

 Figure on page :Domains found in BRCA1 are shown in beads-on-a-string style at the top of the page and include zinc fingers and BRCT protein-protein interaction domains. Similar domain architectures are listed below using the same style. If an architecture contains more than one protein, it is preceded by a graphical icon, and clicking on the icon gives the full list of proteins with that architecture. At the bottom of the page are controls to subset the list of architectures by taxonomy and by domain.










 Stimulate 3D protein- protein interactions.

One of the current interests in Bioinformatics is to accelerate the expensive drug discovery process and the pharmaceutical industry has embraced genomics as a source of drug targets.

Binary relationships—that is, relationships between two objects—are used and this  allows an integrated analysis that can extract biological features more accurately.

Protein-protein interactions result in the formation of transcient or stable multi-subunit complexes.

No research has yet been validated by bioinformatics alone till date but the drug industry is planning to invest more in terms of money to make more useful and purposeful tools available on NCBI (e.g. Quantitative Structure Activity Relationships(QSAR))


Cn3D is another helper application on the web browser that allows a view of 3-dimensional structures from NCBI's. It simultaneously displays structure, sequence, and alignment, and now has powerful annotation and alignment editing features.

The application can be  easily downloaded from the NCBI site . Cn3D runs on Windows, Macintosh, and Unix. Cn3D simultaneously displays structure, sequence, and alignment, and now has powerful annotation and alignment editing features.


Other programmes such as Ras Mol which have been cited by literature are not available on the NCBI site however.










6 Conclusion

The NCBI web site is continuously keeping pace with new technology and trends by updating regularly its software programmes and developing new tools. 

There is a tremendous amount of information at hand which can be used with a relatively small knowledge of the art. However, the use of this information can not always be manipulatedby the right hands. Itwas reported that in July 2002 scientists from the StateUniversity of NewYork at StonyBrook had successfully synthesised the Polio virus using sequence data available on the net. This could have wrongly been used for bacteriological warfare purposes on the hands of unscrupulous individuals. Nevetheless,one cannot but accept the fact that this is a risk  that is present but the fast pace that research has gained via this system is nonetheless more important. This has recently been illustrated  as the SARS virus research was done at lightning speed and was due to conjugated efforts brought about by the use of tools such as NCBI.