Genomic Contribution and information retrieval techniques in bioinformatics | Gene retrieval techniques in bioinformatics

Genomic Contribution and information retrieval techniques in bioinformatics for Big data Analysis, A Complete study of involvements of algorithms to retrieve biological data

Abstract

This research article has elaborated the application of information retrieval
techniques to various high throughput computational accesses in bioinformatics,
thus a number of methods are developed in this area to understand the original query
or retrieve data by exploring all linked network system as like relational
databases. Bioinformatics provides access of biological portals for the life
science to contact with sequence, structure, and function to provide
information content and perform a structured query based on the IR system to retrieve
information such as retrieving of data in databases to give a number of facilities
in genomics retrieval for sequence analysis on digital bases. The number of
methods develop to find out similarity against those databases on the request
of a user, such as Genome annotation, analysis of gene expression, regulation,
protein expression, mutations in cancer, comparative genomics, computational
evolutionary biology, modeling biological systems, high-throughput image
analysis and their literature. Thus IR technology are used to interoperate
number of databases on rationality basis to retrieve data by developing
methodology.

Keywords-
information retrieval, bioinformatics, indexing, genomics, database,

I.
INTRODUCTION

Information retrieval is a set of
processes in the field of computer science in which a number of the technique are
used to querying a collection of an object containing text free document to retrieve
relevant information[1,2]. such as data in a number of modes like heterogeneous, structured, semi-structured and unstructured is distributed
randomly in large space[3]. Therefore in which numbers of keywords are used to
facilitate accurate search and retrieve interest data by using query processing
phenomena that come from end-user [4]. As like user interested in a scientific paper in the field of computational biology titled is “homo species”
about human all data can be retrieved. Therefore information retrieval is a
very broad term in the sense of information [5].

IR
is also overlaps with many other fields like database technology and natural
language processing. Information retrieval is a mainstream based on search
engines such as Yahoo, Google, Alta, Vista, and Excite are used millions of user
on the web across worldwide to perform several specific tasks in the area of
bioinformatics to get information about a gene, protein, 3-D structure and their
coding sequence that are already stored in different databases. The manageable
online databases also concerning with numerous documents to make positive
feedback for desire information based on a query that comes from end-user. So
information retrieval takes a decision to perform logical tasks in
bioinformatics to maximize throughputs and minimize time [6].

A. Traditional
databases and IR

With the repaid growth of networking www, retrieval of data is more
important and global relationship in IR technology is a necessary part to
create logics on data for understanding and searching against desire query to
access data for everyone. Thus, this ability is based on the rationality of data
that offer all major components based on data management system. Functionality
of data is based on merge system provides best retrieving for homogenous and
heterogeneous data sets from databases. IR indexing show implementation for the integrity of data with a relational database management system (RDMS) such as
proximity and synonyms indexing are now despairing with RDMS based process.
Critical IR phenomena will also utilize index with RDMS as front for number of
searching with variable formats (HTML, PDF, and Word). So, the searching of IR
is terrified to access data rather than RDMS because IR is so easy way and
implemented as long as to retrieve data within long page consist of 3 or more
paragraph or in the other hand RDMS can never implement as expert as like MS
SQL unable to search using proprietary extension as like IR; therefore, SQL
play in relational databases.

Fig.
1 Overview of integrated information retrieval system each line represent extraction and conversion of information through
proper enterprise. Entrize is also under continuous evolution to retrieval meaningful information based on logic’s therefore RDMS is main point to allow IR
system to extract biological data. Bioinformatics is not easy for information,
thus relationship of two controversy field will provide data furnish

II.
MATERIAL AND METHOD’S

A.
Document
indexing and Normalization

There are number of existing method are
used to handle documents on digital basis for speed up their retrieval by using
term indexing. Indexing in computer science is a term used to facilitate for
ranking of documents by using number of techniques such as vector space model. This
type of developed model is used to find similarity by facilitating the use of
cosine among different documents. In advance based on user query one thing is
notable because they also rely on top returning Documents [7]. Thus reliable
results are based on ignoring stop word and uninteresting words to access
location of items of interest such as electronically analog of text book. Almost
all indexing models are used to find term frequency within the documents behalf
on query and give ranking based on their weighting factor. For example their
weighting formula is given below

Another
ways are used for the concept of indexing, where various techniques are used to
find words, phrases and try to map them to synonym’s by exploring words same as
in dictionary based knowledge. Indexing is created by number of ways on the
basis of searching by user to explore particular dataset in specific field such
as bioinformatics. The “global term index frequency” gives idea about entire
document collections that are how many time each distinct term occur. The
“document term frequency index” explore the idea about how often particular
term occur in every document and “proximity index” used to detect the
positional of individual term and allow researcher to requiring two or more
terms within the same sentence and document as word or paragraph offset.

It
is noteworthy that many documents are retrieved for extracting their feature
based on quires but some time it is not compactable. Thus in this cause first
we try to normalized the values against each specific query using by highly
recommend methods.

B.
Word
indexing

In
a particular way bag of stop word may need to be expended. Thus “biology” is
word in science would be stop word with other related collection are present in
multiple number to normalize the nature of word biology by using
“normalization” technique. In which reduce the size of index to able for
researcher to search easy therefore word transformation involve to manage lower
behalf of querying any cause and remove variation in noun or tense. Thus
“biology, bioinformatics and biotechnology” is normalized to bio and “computer,
computing, computers to comp. another is stemming based on root pattern to
transfer or match word based on suffix and prefix so streaming show drastic the result thus the number of the algorithm are used to make batter result such as “porter
stemming algorithm” is the commonest algorithm for stemming English and suggest
result good as other stemming options [8].

But
in some cause, same originated word stem itself
and show similar character sequence but only differentiated on the basis
of “s, es” that type of word give limitation in steaming to have cause doubt
for English text [9,10].

C.
Word
Indexing Techniques

A.
Signature
file

The procedure is based on using
of string because every document create a bit string by using logical
representation of hashing imposed on its word based on coding, thus required
documents are stored in a separate file that is called a signature file. They a signature file is very less in size rather than the original file, therefore
signature file searching is much faster [11].

B.
Inversion
indices

In which inversion technique
special characters are most important because every document is donated by a number of keywords, which used to explore the different contents of documents.
Then we can achieve to retrieve information very quickly as possible as mention
in posting based on those keywords that are store alphabetically in the indexing
file. Therefore this method is used in almost all commercial systems [12].

D.
Evaluation
Measure of Documents

Evaluation measure is a
systematic process in which documents are evaluated based on user query that
received for a particular purpose, thus the search engine compares this command
based on a set of giving algorithms to rank this assigning by their weightage or
score such as precision at the position, normalized discounted cumulated gain and
mean average precision match in the range of [0.1]. This term and methodology
are widely used in document measuring technology for information retrieval,
thus in the level of ranking used a number of gold standard methods to achieve
goal.

Firstly
we seek the number of the top document in a result that are showing relieving to produced
rank result as like precision at rank .In
average precision n is the number of retrieve documents that
return relation among [0,1] in the mode of query for each standard produced a
single measure for document corpus[13].

In which binary relation if the returning
value is 1 then the documents are in consideration of similar or match result
otherwise 0 shows the query result to explore falsity in the form of negativity
(machining not done against sending query). Mean average precision is micro averaging used to explore the average
precision in all query dataset and each query count equally in many web search
engine that are used in research system
as like measure in research paper.

Normalized
Discounted Cumulated Gain (NDCG) considers the reciprocal of algorithm and
computed as to rank relevant documents on their position
on the basis of consideration to formulated as fellow [14].

Here
is rating of documents in the rank list and normalized the
constant that selected for the perfect matched link
list would use to obtain of score 1 that are
given below for applied on datasets in the form of evaluation of 1 and 0.

E.
Area
of Bioinformatics and IR application

IR is a systematic process for
developing of business strategies in number of fields such as universities, schools,
hospitals and global library. In general IR is used to retrieve information
from much application in the field of literature, science, journals, books and
other documents [15] most applicable application of information retrieval is as
follows

A.
Library
science on digital basis

Many biological databases are
used to store information in the form of digital basis. Thus for the digital
library (type of information retrieval) is used to collect and store locally
data that accessed by different computer networks that are logically and physically
design on global basis [15].

B.
Search
Engines and media

In the field of IR search engine
and media search are very capable for practical applications to collect huge
amount of documents. They are also used
for information management to retrieve data such as browsing, searching,
retrieving digital image based on web search engine, desktop, enterprise,
remote, mobile and social search are common example [15].

Above
diagram show Common framework in IR system to retrieve information by using
number of search engine in word wide to retrieve data and give satisfactory
result about user query [16, 17].

C.
Bio-business

Bioinformatics is a margining
field in which computer techniques are used to summarize biological data based
on digital format to done molecular biology challenges. Therefore this phenomenon
IR is used in the field of life science to finding new and high throughput
approach to deal with volume (genome and their ligand) and complexity (DNA to
protein formation) of data and provide young researcher to better access for
analyzing and computing tools in order to advance understanding of our genetic
legacy and its role in health and disease [18]. Therefore by exploring all
phenomenal process we can say that bioinformatics is marriage of biology and
informatics as like mathematics, statistics, management and computer science [19].

D.
Major
Components of bioinformatics in the area of IR

·
Create a database for storing and
organizing large biological data sets.

·
Development of algorithms and their
analysis for speed and capability of data by using statistics to determine the relationship among the member of large datasets.

·
The interpretation of various
analysis of biological data as like DNA RNA, protein-protein interaction
, protein sequence, structure and their function, gene expression and biological
pathways [20 , 21].

IR
is a major component to retrieval data for next-generation sequences in
bioinformatics because the creation of a database to store information of nucleotide
and amino acid. Sequences are very important for word wide research by matching
contents that are most relevant for the development of complex interference because
number of similarities can be funded based on information retrieval techniques to
get existing data as well as submit new or revised data[22, 23, 24].

The
field of bioinformatics used to appreciate in a number of methodology for
development and implantation of tools that are proving efficient approach to
manage the number of information in life sciences as like development of new
algorithms (computer, mathematics, statistics) give true result about retrieve
of data and relationship among the number of large datasets, such as locating a
gene within protein predicted structure and the cluster of protein sequences
into families of related sequences [25].

IR
technique creates a high throughput methodology to give a number of a facility in
bioinformatics to retrieve data about sequence analysis on digital bases such
as Genome annotation, analysis of gene expression, regulation, protein
expression, mutations in cancer, comparative genomics, computational
evolutionary biology, modeling biological systems, high-throughput image
analysis and their literature [26, 27].

E.
IR
and challenges of Bioinformatics

Biology shows a number of problems
by exploring a large number of datasets for the researcher because of some time species
showing biological redundancy and multiplicity , similarity in protein
sequences, an organism with similar genes, multiple functions of similar genes,
grouping of genes in pathways and sequence redundancy in genomes [28,29].

III.
RESULTS

Database
is a collection of organized forms of data in a regular manner based on a number of a derived method such a way that digitally can be easily found and retrieve
information. Relational databases show table including rows and Colum in which
each table compactable to composed of records and record identified a field
attach with unique value. The mode of table based on logic so every table must
share at least one attribute with any other table with an association of “one to
one” one to many or many to one relationship. These logical relationships are
used to allow building large databases in biology to handle a number of user
access like DDJB, SWISS- PROT [30, 31]. They also provide different
functionality about nucleotide and amino acid sequence data and also prove
retrieving and searching for information about biological data for high throughput
results [32].

A.
Biological
complexity

Fig.2
shown, “complexity of biological databases” They are built on basis of
information retrieve by user query because some point share common data to meet
a single point therefore we need to create a relations among different
databases such as Ensemble show major components TrEMBL,
PIR-PSD, EMBL, IPI to
provide a set of relevant information for a particular protein [33, 34].

Biological
databases are large in number according to classification of information so we
need a flexible manner to organize data and creating a relation among them by
using various techniques in database management system. These variable
techniques can help out for data retrieval or management to add functionality
by including tools to modify, delete insert update and produce report by
summarizing all data. Some integrated sites for database management is avail to
develop relation among different sites to produce results such as
protein-protein interaction by using RAST database to create all type of
relational data spared in multiple areas so behalf of this user can retrieve
this information very quickly in anywhere they want such as we want to extract
information about protein then PDB provides information about all component of
protein such as structure, function, and annotation that are ranked by
different searchers for their interest[35].

Number
of homogenous data, we can retrieve in a single platform because of this platform
used to connect with other databases to create rationality. In the mode of
similar data on the basis of rationality automatic pipelining is used to
collect information for the user by different sources such as NCBI use her major
components “ UNIGENE, VAST, GEO, Pubmed, CDD, COG, OMIM, Genomes, CGAP, dbEST,
dbGSS, dbMHC, dbSNP, dbSTS, GenBank, Genes, HomoloGene, MeSH, MGC, MMDB, OMSSA,
OMSSA, PubCHEM, RefSeq” used for Online a searchable collection of books, reports, databases and other scholar literature
in biology, medicine and life sciences [36].

B. Querying and ranking text databases

IR technology give important
views to retrieve interesting documents but some time error are accepted due to
recognition of indexing artifacts such as false positive( some non-relevant
document retrieve) and false negative( relevant documents entirely
missed).thus, the complexity of this conceive sensitivity for searching information
we are trying to avoid that type of error by using recall and precision
respectively. Since in the same document, same term can occur in a number of times,
when we are searching complex database such as European bioinformatics institute
(EBI) show major components as like BioMart,
ChEBI, EMBLSVA, UniProt, ArrayExpress, ASD,
CSA, GOA, IntAct, IntEnz, DALI, MSD,
MSDchem [37]. then a simple retrieval Boolean query including all term A, B , C
is not very useful so it returns the number of the document even million that contain
these term even once. The search in this cause must be confirmed by using
relevance ranking where document is ranked best on highest score by using any a formulation as like cosine similarity measure technique in vector space model
[38,39]. Which are based on index frequencies so in which uncommon document
giving more weight rather than relatively term and documents once or more of
the query containing many times are weighted more than documents including term
frequently and all search system in
bioinformatics in literature views such as NCBI, NIH, and KIGG used document
vector method. Numbers of other high throughput approaches are used to measure
weight of documents in research manuscripts: term in the title, abstract is
giving first weighting rather than the remaining part of research manuscripts.

Fig.3
In Venn diagram shows the intersection area of both genes containing the same function in different species
occurs in biological datasets.

Remember
that same documents can return very useful results based on their ranking the property even if documents are never contained single term in user query because
some time queries must be present in a document that are also considered relevant [40,41].
As an aside by using the advance system for
web page indexing is also based on rank algorithms that refer for frequency
indexing used to control words in web pages that are repeated many times in
HTML page as a comment. HTML also
invisible for human as a page but logically visible for the computer to
perform indexing, therefore, relevance raking search those word that already
mention for the program. Modern search
engine as like Google use weighting algorithms to perform searching based on
particular web pages by way of the hyperlink [42].

Fig.4
The Reactome database in biology fellow same structure to introduce or retrieve
data for expert with number of useful information as like query about BRACA1 return
multiple linked pathways for explanation of cancer occurrence in human species
by linking with NCBI, Ensembl, UniProt,
KEGG, ChEBI and PubMed. This database also includes information of mouse, rat,
chicken, puffer fish, worm, fly, yeast, rice, and Arabidopsis. Thus, this all
information we can retrieve within single platform.

IV. CONCLUSION

Information retrieval
is a set of processes in the field of computer science in which the number of the technique is used to querying a collection of an object containing text free
document to retrieve relevant information. Such as data in number of mode like
heterogeneous, structured, semi-structured and unstructured is distributed
randomly in large spaces.

With
the repaid growth of networking WWW, retrieval of data is more important and
global relationship in IR technology is a necessary part to create logic on
data for understanding and searching against desire query to access data for
everyone, thus this ability is based on the rationality of data that offer all
major components based on the data management system.

Biological
databases are large in number according to classification of information so we
need a flexible manner to organize data and creating a relation among them by
using various techniques in database management system. These variable
techniques can help out for data retrieval or management to add functionality
by including tools to modify, delete insert update and produce a report by
summarizing all data. Some integrated sites for database management is avail to
develop relation among different sites to produce results such as
protein-protein interaction by using RAST database to create all type of
relational data spared in multiple areas. Number of homogeneous data we can
retrieve in a single platform because this platform used to connect with
other databases to create rationality. In the mode of similar data on the basis
of rationality automatic pipe-lining is used to collect information for the user by
different sources.

In
advance “information” is widely distributed in digital formats and IR show
multiple techniques in the field of
biology to facilitate comprehensive productivity, thus the number of
databases that are distributed worldwide in the area of bioinformatics on
computational basis can retrieve data easy by matching the query based on
similarity. The availability of large databases UniProt, PDB, NCBI, PubMed, are
broadly used to facilitate productivity. Biology shows a number of problems
by exploring a large number of data sets for the researcher because of some time species
showing biological redundancy and multiplicity, the similarity in protein
sequences, organism with similar genes, multiple functions of similar genes,
grouping of genes in pathways and sequence redundancy in genomes.The number of
user is happy to retrieve information in such a way if IR used to make good
environment for the researcher.

V.
REFERENCES

[1] Salton G. Automatic text processing: The transformation, analysis, and
retrieval of. Reading: Addison-Wesley. 1989.

[2]
Rijsbergen V, Joost CK.
Information retrieval Butterworths London.

[3] Dominich
S, Dominich S. The modern algebra of information retrieval. Heidelberg:
Springer; 2008 Apr 18.

[4] Grossman
DA, Frieder O. Information retrieval: Algorithms and heuristics. Springer
Science & Business Media; 2012 Nov 12.

[5] Leser
U, Hakenberg J. What makes a gene name? Named entity recognition in the
biomedical literature. Briefings in bioinformatics. 2005 Dec 1;6(4):357-69.

[6] Cohen
AM, Hersh WR. A survey of current work in biomedical text mining. Briefings in
bioinformatics. 2005 Mar 1;6(1):57-71.

[7] He B, Ounis I. Term frequency normalisation tuning for BM25 and DFR
models. InEuropean Conference on Information Retrieval 2005 Mar 21 (pp.
200-214). Springer, Berlin, Heidelberg.

[8] Porter
MF. An algorithm for suffix stripping. Program. 1980 Mar 1;14(3):130-7.

[9] Savoy
J. A stemming procedure and stopword list for general French corpora. Journal
of the American Society for Information Science. 1999;50(10):944-52.

[10] Xu J, Croft WB. Corpus-based stemming
using cooccurrence of word variants. ACM Transactions on Information Systems
(TOIS). 1998 Jan 1;16(1):61-81.

[11]
Faloutsos C, Oard DW. A survey of information retrieval and filtering
methods. 1998 Oct 15.

[12] Salton
G, McGill MJ. Introduction to Modern Information Retrieval McGraw-Hill New York
Google Scholar.

[13]
Fan W, Gordon MD, Pathak P. Personalization of search engine services
for effective retrieval and knowledge management. InProceedings of the twenty
first international conference on Information systems 2000 Dec 10 (pp. 20-34).
Association for Information Systems.

[14]
Fan W, Gordon MD, Pathak P, PATHAK P. Genetic programming-based
discovery of ranking functions for effective web search. Journal of Management
Information Systems. 2005 Apr 1;21(4):37-56.

[15]
Comer DE, Gries D, Mulder MC, Tucker A, Turner AJ, Young PR, Denning PJ.
Computing as a discipline. Communications of the ACM. 1989 Jan 1;32(1):9-23.

[16]
Rani K, Sharma R. Study of different image fusion algorithm.
International journal of Emerging Technology and advanced Engineering. 2013 May
5;3(5):288-91.

[17]
Rani K, Sharma R. Study of different image fusion algorithm.
International journal of Emerging Technology and advanced Engineering. 2013 May
5;3(5):288-91.

[19]
Langville AN, Meyer CD. Google’s PageRank and beyond: The science of
search engine rankings. Princeton University Press; 2011 Jul 1.

[20]
Henikoff JG, Greene EA, Pietrokovski S, Henikoff S. Increased coverage
of protein families with the blocks database servers. Nucleic acids research.
2000 Jan 1;28(1):228-30.

[21]
Pietrokovski S, Henikoff JG,
Henikoff S. The Blocks database—a system for protein classification. Nucleic
acids research. 1996 Jan 1;24(1):197-200.

[22] Berry MW, Browne M. Understanding search engines: mathematical modeling
and text retrieval. Siam; 2005.

[23]
Eldén L. Matrix methods in data
mining and pattern recognition. SIAM; 2007.

[24] Feldman
R, Sanger J. The text mining handbook: advanced approaches in analyzing
unstructured data. Cambridge university press; 2007.

[25]
Hand DJ, Mannila H, Smyth P. Principles of data mining (adaptive
computation and machine learning). Cambridge, MA: MIT press; 2001 Aug.

[26]
Langville AN, Meyer CD. Google’s PageRank and beyond: The science of
search engine rankings. Princeton University Press; 2011 Jul 1.

[27]
Weiss SM, Indurkhya N, Zhang T, Damerau F. Text mining: predictive
methods for analyzing unstructured information. Springer Science & Business
Media; 2010 Jan 8.

[28]
Couto BR, Ladeira AP, Santos MA. Application of latent semantic indexing
to evaluate the similarity of sets of sequences without multiple alignments
character-by-character. Genet Mol Res. 2007 Jan 1;6(4):983-99.

[29]
Rodrigues TD, Pacífico LG, Teixeira SM, Oliveira SC, Braga AD.
Clustering and artificial neural networks: classification of variable lengths
of Helminth antigens in set of domains. Genetics and Molecular Biology.
2004;27(4):673-8.

[30]
Boeckmann B, Bairoch A, Apweiler R,
Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’donovan C, Phan
I, Pilbout S. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in
2003. Nucleic acids research. 2003 Jan 1;31(1):365-70.

[31]
Bairoch A, Boeckmann B. The SWISS-PROT protein sequence data bank.
Nucleic acids research. 1991 Apr 25;19(Suppl):2247.

[32] Tateno Y, Saitou N, Okubo K, Sugawara H,
Gojobori T. DDBJ in collaboration with mass-sequencing teams on annotation.
Nucleic acids research. 2005 Jan 1;33(suppl_1):D25-

[33]
Butler D. NIH pledges cash for global protein database.

[34]
Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger
E, Huang H, Lopez R, Magrane M, Martin MJ. The universal protein resource
(UniProt). Nucleic acids research. 2005 Jan 1;33(suppl_1):D154-9.

[35]
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,
Shindyalov IN, Bourne PE. The protein data bank, 1999–. InInternational Tables
for Crystallography Volume F: Crystallography of biological macromolecules 2006
(pp. 675-684). Springer, Dordrecht.

[36]
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,
Shindyalov IN, Bourne PE. The protein data bank, 1999–. InInternational Tables
for Crystallography Volume F: Crystallography of biological macromolecules 2006
(pp. 675-684). Springer, Dordrecht.

[37]
Kanz C, Aldebert P, Althorpe N, Baker W,
Baldwin A, Bates K, Browne P, van den Broek A, Castro M, Cochrane G, Duggan K.
The EMBL nucleotide sequence database. Nucleic acids research. 2005 Jan 1;33(suppl_1):D29-33.

[38]
Williams Jr JH, Perriens MP. Automatic full text indexing and searching
system. InInformation System Symposium, Washington, DC Proceedings.
International Business Machines Corp., Gaithersburg, Md 1968 Sep 4 (pp.
335-350).

[39] Sparck
Jones K. A statistical interpretation of term specificity and its application
in retrieval. Journal of documentation. 1972 Jan 1;28(1):11-21.

[40] Jones
KS, Walker S, Robertson SE. A probabilistic model of information retrieval:
development and comparative experiments: Part 2. Information processing &
management. 2000 Nov 1;36(6):809

[41] Jones KS, Walker S, Robertson SE. A
probabilistic model of information retrieval: development and comparative
experiments: Part 2. Information processing & management. 2000 Nov
1;36(6):809-40.

[42] Elmuti D, Kathawala Y. An overview of
strategic alliances. Management decision. 2001 Apr 1;39(3):205-18.

Note: This article is written by Imran Zafar Department of bioinformatics and computational Biology Virtual University Pakistan Lahore, Pakistan

Genomic Contribution and information retrieval techniques in bioinformatics for Big data Analysis, A Complete study of involvements of algorithms to retrieve biological data

Leave a Comment Cancel reply