Brief
Overview on some DNA Analysis tools
Harris
Ramuth , BSc (Hons) MSc
Central Health Laboratory
Candos
Email: bioanalyst@gmail.com
The
http://www.ncbi.nlm.nih.gov/
|
1 |
Introduction
Computers are of essence nowadays as they are able to
perform four interdependent functions, which are of utmost importance in
Bioinformatics: they are an important means of communication, permit
computation with numerous algorithms in software packages supported by valid
statistical evidence, permit to control the data at hand and at the same time
provide a means of storage of information.
Not only do scientists
from all over the world have access to computers and are thus able to utilise
information all around the clock but also this information is available to the public in general. The
parallel development taking place in the field of computing and Artificial
Intelligence has invariably linked Molecular Biology to computers.
Gregor Mendel’s First
principles, the deciphering of the genetic
code and The Central Dogma of Molecular Biology as
defined by James Watson are the fundamentals that have transformed the field of Genetics.
The advent of refined sequencing tools enhanced by
PCR techniques and cloning supplied the data that enabled a major computer
programme namely Graig Venter’s Celera
Genomics, provided by a private company, to give birth to the Human Genome Project as
originally proposed by twelve biologists in 1985.
By 2002, scientists had already sequenced or provided
a draft of the entire genome of at least eight other model organisms including Mus musculus (the common housefly) and Saccharomyces cerevisiae.
Mass
spectrometry and two-hybrid screens are techniques that are fast
evolving and there is an incredible growth of the known cell map
with new data on the structure of cell signaling and metabolic networks.
To
keep pace with these findings visualization and analysis tools for this data (e.g. BIND) are available to assist
in understanding this complex data. (1)(2) The NCBI contained more than
800,000 nonredundant protein sequences in 2002(3) and this data is increasing.
Various centers and Institutes have taken up this
endeavour to ensure that the growing body of information from molecular biology
and genome research is placed in the public domain and is accessible freely to
all facets of the scientific community in ways that promote scientific progress.
Some examples are non-profit academic organisations
like the European Bioinformatics Institute
(EBI) , the Sanger Centre, the UK MRC Human Genome Mapping project to
name only a few.
The National
Center for Biotechnology Information is the leading information provider in
the National
Library of Medicine (NLM), and is located on the campus of the National Institute of Health (NIH).
The specific aim of the NCBI is to create an
automated system to gather, analyse and store biological information and does
so by facilitating access to database and software by the user. It maintains
the database of the Gen Bank and is associated with large databases such as the
EMBL and the Data DDBJ .Gen Bank is the NIH genetic sequence database, an
annotated collection of all publicly available DNA sequences.
More than
165,000 named organisms obtained from
individual laboratories and batch submissions from large-scale sequencing
projects are present as Web-based BankIt
or GenBank staff assign standalone Sequin
programs and accession numbers are used to compile the data(4).
Bank it is normally used for small non-complicated sequences, which require no
sequencing tools. BankIt transforms data into GenBank format for review
and when the record is completed, it can be submitted directly to GenBank.
However, when submitting long or complex submissions or
when submitting mutation, phylogenetic, population, environmental or segmented
sets sequin is used instead. It is also used for graphical viewing, editing options, including the alignment
editor, and if network access to related analytical tools is desired.
|
2 |
Sequence Retrieval and seeking information
( Entrez)
The NCBI developed an indexing and retrieval system
similar to the EBI Sequence Retrieval System (SRS) that facilitates access to a
range of bio-databanks. Both molecular biology data (Online Mendelian
Inheritance in Man (OMIM)) and bibliographic citations from integrated databases
(e.g. Pub Med) are available through Entrez.
It integrates data from a large number of sources,
formats and databases into a uniform information model and retrieval system.
Unlike the SRS, the Entrez system has a customised
search so that preferred databases are accessed. This can be a disadvantage to
an unaccustomed user as he/ she may assume that the search will be made across
all the databases linked to Entrez automatically and can therefore end up with
no information.
Entrez has an advantage over conventional systems (search
engines such as Goggle, Yahoo! etc)
in that it is also able to select
terms on its own and is consequently undemanding in its browser requirements,
avoiding at the same time information about other connotations and there is
thus no time wastage.
However, a basic knowledge of the database being
accessed by the user is mandatory to optimise the use of the Entrez portal. This
illustrates the fact that the expansion of biological data happens at such a
rapid pace that not all databases are fully up to date. Since the system does not verify all the
databases on a single query; this implies that a knowledge of the relevant
databases for subsequent searches have to be made for analysis. A user who
accesses the wrong database may end up with the wrong conclusions.
The Entrez system is also used to review, revise or
combine results of the most recent 100 searches maintains search histories.
The Entrez
system accommodates databases of nucleotide sequences, protein sequences,
macromolecular structures and whole genomes. It supports both inter- and
intra-database linking.
Results of an Entrez search can be displayed in a
variety of formats and can be displayed in a variety of ways (on a clipboard
feature on the computer, saved on a disc or a floppy drive or printed on a
worksheet)
(nucleotide
sequences)
The most popular data formats in bioinformatics are
the FASTA, PHYLIP, and MAML.
FASTA, BLAST and Smith-Waterman algorithms are the
useful and powerful search tools, which are available at NCBI. These permit to
make sequence alignments rapidly through a variety of updated databases.
A sequence within a
FASTA sequence file consists of three parts:
Example: The figure shows a FASTA format.
>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulansLNQAVRFRPVITFALAFILIITWFAPRADAAAQWQAGTAYKQGDLVTYLNKDYECIQPHTALTGWKYVGEGTGGGTPTPDTTPPTVPAGLTSSLVTDTSVNLTWNASTDNVGVTGYEVYRNGTLVTAVVTGLTAGTTYVFTVKAKDAAGNLSAASTSLSVTTSTGSSNPGPSGSKWLIGYWHNFDNQFEDSLKSIISTYGFNGLDIDLEGSSLSLNAGDTDFRSPTTPKIVNLINGVKALKSHFGANFVLTAYVQGGYLNYGGPWGAYLPVIHALRNDLTLLHVQHYNTGSMVGLDGRSYAQGTADFHVAMGGSSGPFFSPLRPDQIAIGVPASQQAAGGGYTAPAELQKALNYLIKGVSYGGSYTLRQLRAMSVSRAL
Source: http://www-hto.usc.edu/software/seqaln/doc/fasta-format.html
The Basic Local Alignment Search Tool (BLAST)
finds regions of local similarity between sequences. The program compares
nucleotide or protein sequences to sequence databases and calculates the
statistical significance of matches.
BLAST can be used to infer functional and
evolutionary relationships between sequences as well as help identify members
of gene families (http://www.ncbi.nlm.nih.gov/BLAST/
)
BLAST is based on
mathematical statistics coupled with human intuition and the BLAST algorithm permits a straightforward and
rapid turnaround of sequence searches in the various databases.
It calculates all segments in pairs (a segment pair
is a pair of sub-sequences of the same length that form an ungapped alignment.)
in what is known as a pairwise alignment technique.
The concept of the E
value in the BLAST search is based on the view that the database is a
collection of independent objects (sequences).
BLAST and PSI-BLAST now permit calculated E-values to take into account
the amino acid composition of the individual database sequences involved in
reported alignments.
This improves E-value
accuracy, thereby reducing the number of false positive results.
Recent
mathematical results on the stochastic properties of MSP scores allow an
analysis of the performance of this method as well as the statistical significance
of alignments it generates.
For protein comparisons, a variety of definitional,
algorithmic and statistical refinements described here permits the execution
time of the BLAST programs to be decreased substantially while enhancing their
sensitivity to weak similarities. A new criterion for triggering the extension
of word hits, combined with a new heuristic for generating gapped alignments,
yields a gapped BLAST program that runs at approximately three times the speed
of the original.
In addition, a method is introduced for automatically
combining statistically significant alignments produced by BLAST into a
position-specific score matrix, and searching the database using this matrix.
The resulting Position-Specific Iterated BLAST
(PSI-BLAST) program runs at approximately the same speed per iteration as
gapped BLAST, but in many cases is much more sensitive to weak but biologically
relevant sequence similarities.
PSI-BLAST is
used to uncover several new and interesting members of the BRCT superfamily.
BLAST on the NCBI site is extensively used for a huge variety of purposes . Just as an example:
It is used to compare the genomes of Caenorhabditis
elegans with that of its evolutionary counterpart Caenorhabditis briggsae. Extensive work exists on these nematodes
and a wormbase containing the sequences of both worms are widely used. These
research have shed considerable light on a variety of fields including
Developmental Biology , on apoptosis and on Evolution. The Sanger
Center and the
This was the first multicellular eukaryotic genome to
be completed: The genome in this release, Wormbase WS97, is 100.274 Mb, organized in
six chromosomes. The Sequencing Consortium with 19,542 predicted coding
sequences, 21,437 when counting 1,891 alternate splice forms, has decorated the
genome. (http://www.ncbi.nlm.nih.gov/mapview/map)
The wormbase contains all the nucleotide sequences;
an example of the nucleotide sequence is shown below
Sequence Details
for: B0212
tatgtgcttt ttctaacttt tatgtgcttt gtttaagtat
cacaaaagat gtacaatgat gacgagctca cggaattcat ggatttctta gtggaacaga caaaggattc
gatttaccca atggtggcgg caaaagtatt caaacaattc cccaatcgtcgtttgatccc aaaagataag
cgttatcaaa ggtaaaaaga cctggacttc gctttgttca ccaatattct gttcagattt gtcttacaac
ttgcaccaaa aatgaacgat tggaataact acagtattga ggcacgaatt cgattgatgt atgcgctgcg tggaaaagtt
gaagatgatt ttttggcaag gtgggaatga caccactttt caagattaaa tctgaaaaag gcgtccgcgg ccaacggaaa
gtaccggcaa tttcggaaat tttgaattct aattacatat tgaaataaaa aaaaaacttt cacgttgatt tttagtattc
ggttagtcat tcggctagac agaagttcct tcattcaatt tttcatctaa aattgatata aggatgtaga aaaaggtgtg
ttctatccga acgtaacatt tcacactttt agcacttttg tggaaaacgg caactttcag gcactatacg aaattttttt
gaagcttttc aaccaaaata attgttattt tcagatagaa ggaatctttt tctacattca tatatcagct ttgtttgaaa
gaagttagtt tatcttgttg aaatatgaag ttttgtgcgt ttgccgaaaa aaaaatcgac gcccatccct gatcataact
tcaatgatta aaccattttg agcatgaaaa gagcaaattg ttcagtacaa ttcatgttta ggctctgaaa tctacaaaaa
gtatctaata tacattaccg aaaaattccg gaaatcaatt gccgcactcc cctgatctga ataatttgca gaattgaagc
tcatggaact gttcagctca atgaaaaaca aagaatctcc aaatttacag cgaatgacgg aacatttttt ttggaagatg
atcgtaaacg tttatcacta gacgctacaa gactcgaatt cgtcaagctc atgggctttc tcattgaaaa
aaccaaagat gccgttgaac cattgccgaa cacaaaacta
atttttcagg agtttagcca aattgaacct gttaagaagc ctaatgatac tgacataatg tataagtgag
atgagcattc catgttttct ttacacaaat ttaattcaga tttcgcagta
Source: http://unc.wormbase.org/db/seq/sequence
Inserting a sequence in the BLAST programme can
reveal a variety of information . In this case a known sequence was introduced
and blasted in the Caenorhaditis elegans
genome in search for sequence
similarity.The result obtained is shown below:
]
Query: 1 acttttagcacttttgtggaaaacggcaactttcaggcactatacgaaannnnnnngaag 60 ||||||||||||||||||||||||||||||||||||||||||||||||| ||||Sbjct: 595 acttttagcacttttgtggaaaacggcaactttcaggcactatacgaaatttttttgaag 654 Query: 61 cttttcaaccaaaataattgttattttcagatagaaggaatctttttctacattcatata 120 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 655 cttttcaaccaaaataattgttattttcagatagaaggaatctttttctacattcatata 714 Query: 121 tcagctttgtttgaaagaagttagtttatcttgttgaaatatgaagttttgtgcgtttgc 180 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 715 tcagctttgtttgaaagaagttagtttatcttgttgaaatatgaagttttgtgcgtttgc 774 Query: 181 cgnnnnnnnnntcgacgcccatccctgatcataacttcaatgattaaaccattttgagca 240 || |||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 775 cgaaaaaaaaatcgacgcccatccctgatcataacttcaatgattaaaccattttgagca 834 Query: 241 tgaaaagagcaaattgttcagtacaattcatgtttaggctctgaaatctacaaaaagtat 300 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 835 tgaaaagagcaaattgttcagtacaattcatgtttaggctctgaaatctacaaaaagtat 894 Query: 301 ctaatatacattaccgaaaaattccggaaatcaattgccgcactcccctgatctgaataa 360 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 895 ctaatatacattaccgaaaaattccggaaatcaattgccgcactcccctgatctgaataa 954 Query: 361 tttgcagaattgaagctcatggaactgttcagctcaatgaaaaacaaagaatctcc 416 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 955 tttgcagaattgaagctcatggaactgttcagctcaatgaaaaacaaagaatctcc 1010 Score = 42.1 bits (21), Expect = 1.3 Identities = 27/29 (93%) Strand = Plus / Plus The results have to be verified and expertly dealt with for any conclusions. E values give a good indication of the validity of similarity but is certainly not the sole criterion to consider for a conclusion.
The FastA
algorithm is based upon the idea of identifying short words or k-tuples
common to both sequences under comparison.It was described by Lipman and
Pearson in 1985( Lipman D.J. and Parson ,W.R.(1985)Rapid and sensitive protein
similarity changes.Science,227,1435-1441.) FASTA
thus follows the tradition of
combinatorial optimization.
4 Visualising protein
folding patterns
The sequence
of residues of nascent polypeptide chains embodies the ultimate rationale
behind all purposeful structures and behaviours of living beings.
Jacob
Monod (1970), from Chance and necessity.
Proteins are the main essential active agents, essential
for the metabolic processes that we associate with life to take place.
Understanding the action of proteins is the key to understanding the spark of
life itself. Proteins occupy a unique position between Chemistry and Biology.
It takes not too many of the non-alive
single proteins to associate together with a little nucleic acid for life-like
behaviour to emerge .On the borderline of life, the HIV (Human
Immunosuppressive Virus) has only about ten different types of proteins.
To understand proteins it is necessary to have
information about:
1 basic structure and geometry
2 secondary structures
Bimolecular Interaction Network Database (BIND) is
designed to capture protein function, defined at the molecular level as the set
of other molecules with which a protein interacts or reacts along with the
molecular outcome. XML versions of all data with accompanying DTDs are supported
with the NCBI programming toolkit.
A collection
of interaction information, such as BIND, enables the study of the
relationships between protein domain architecture and protein–protein
interactions. Specifically, it is possible to classify the interactors of a
protein into distinct groups based on domain composition.
The Conserved Domain Architecture Retrieval Tool
(CDART) performs similarity searches of the NCBI Entrez Protein Database based
on domain architecture, defined as the sequential order of conserved
domains in proteins. The algorithm finds protein similarities across
significant evolutionary distances using sensitive protein domain
profiles rather than by direct sequence similarity. Proteins similar
to a query protein are grouped and scored by architecture. Relying
on domain profiles allows CDART to be fast, and, because it relies
on annotated functional domains, informative. Domain profiles are
derived from several collections of domain definitions that include
functional annotation(7).
The CDART query page can be found on the Internet at http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi.
On this page, one may enter a sequence accession or FASTA formatted
sequence.
For this example we enter the accession for human
BRCA1, NP_009225. ( a tumor suppressor protein) Pressing the search button runs
RPS-BLAST on the sequence, comparing the sequence to the domain
definitions in the CDD database. The search completes in a few
seconds, and the results are displayed in figure below(7).
Figure on page
:Domains found in BRCA1 are shown in
beads-on-a-string style at the top of the page and include zinc fingers and
BRCT protein-protein interaction domains. Similar domain architectures are
listed below using the same style. If an architecture contains more than one
protein, it is preceded by a graphical icon, and clicking on the icon gives the
full list of proteins with that architecture. At the bottom of the page are
controls to subset the list of architectures by taxonomy and by domain.

Stimulate 3D protein- protein interactions.
One of the current interests in Bioinformatics is to
accelerate the expensive drug discovery process and the pharmaceutical industry
has embraced genomics as a source of drug targets.
Binary relationships—that is,
relationships between two objects—are used and this allows an integrated analysis that can
extract biological features more accurately.
Protein-protein interactions
result in the formation of transcient or stable multi-subunit complexes.
No research has yet been validated by bioinformatics
alone till date but the drug industry is planning to invest more in terms of
money to make more useful and purposeful tools available on NCBI (e.g. Quantitative Structure
Activity Relationships(QSAR))
Cn3D is another helper application on the web browser
that allows a view of 3-dimensional structures from NCBI's. It simultaneously
displays structure, sequence, and alignment, and now has powerful annotation
and alignment editing features.
The application can be easily downloaded from the NCBI site . Cn3D
runs on Windows, Macintosh, and Unix. Cn3D simultaneously displays structure,
sequence, and alignment, and now has powerful annotation and alignment editing
features.
Other programmes such as Ras Mol which have been cited
by literature are not available on the NCBI site however.
6 Conclusion
The NCBI web site is continuously keeping pace with
new technology and trends by updating regularly its software programmes and
developing new tools.
There is a tremendous amount of information at hand
which can be used with a relatively small knowledge of the art. However, the use
of this information can not always be manipulatedby the right hands. Itwas
reported that in July 2002 scientists from the StateUniversity of NewYork at
StonyBrook had successfully synthesised the Polio virus using sequence data
available on the net. This could have wrongly been used for bacteriological
warfare purposes on the hands of unscrupulous individuals. Nevetheless,one
cannot but accept the fact that this is a risk
that is present but the fast pace that research has gained via this
system is nonetheless more important. This has recently been illustrated as the SARS virus research was done at
lightning speed and was due to conjugated efforts brought about by the use of
tools such as NCBI.