Abstract

Protein Annotators’ Assistant: A Novel Application of
Information Retrieval Techniques1
Michael J. Wise
Centre for Communications Systems Research2
Cambridge University
[email protected]
Abstract
The Protein Annotators’ Assistant (or PAA), http://www.ebi.ac.uk/paa/, is a software system which assists
protein annotators in the task of assigning functions to newly sequenced proteins. Working backwards from
SwissProt, a database which describes known proteins, and a prior sequence similarity search that returns a list of
known proteins similar to a query, PAA suggests keywords and phrases which may describe functions performed
by the query. In a preprocessing step, a database is built from the protein names that appear in the SwissProt
database, and against each protein are listed keywords and phrases that are extracted from the corresponding text
records. Common words, either in general English usage or from the biological domain, are removed as the
phrases are assembled. This process is assisted by the use of a simple stemming algorithm, which extends the list
of stop-words (i.e. reject words), together with a list of accept-words. At runtime, the search algorithm, invoked
by a user via a Web interface, takes a list of protein names and clusters the named proteins around
keywords/phrases shared by members of the list. The assumption is that if these proteins have a particular
keyword/phrase in common, and they are related to a query protein, then the keyword/phrase may also describe
the query. Overall, PAA employs a number of IR techniques in a novel setting and is thus related to text
categorization, where multiple categories may be suggested, except that in this case none of the categories are
specified in advance.
1. Introduction
Proteins are made up of long, folded chains (or sequences) of amino-acids. There are 20 amino-acids, and although
they are, in fact, 3-dimensional molecules, for some purposes they can be represented as letters drawn from the
set {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. Proteins, therefore, can be viewed as strings drawn
from that alphabet. Information about proteins is found in text format databases such as SwissProt (Bairoch &
Apweiler, 1999); each record contains the sequence together with related information such is it’s name, an
accession number, bibliographic information, a descriptor and perhaps some keywords and comments. (For a
complete description see Bairoch & Apweiler (1998)).
A protein sequence is discovered either through physical/chemical means or, more commonly, through the
translation of an Open Reading Frame, or ORF, i.e. a DNA sequence that is potentially protein-coding. Whichever
way the protein is discovered, its function is not normally known so one way to hypothesize its function is to look
at the functions of similar proteins. Similarity to other proteins is typically established using sequence alignment
programs such as BLAST (Altschul et al, 1990)] or FASTA (Pearson & Lipman, 1988), which are approximations of
the Smith-Waterman algorithm (Smith & Waterman, 1981).3 That is, using one of these systems the sequence is run
as a query against the sequences in a protein sequence database and returns a list of matches in order of
1
This article appeared in the Journal of the American Society for Information Science (JASIS), 51(12), October 2000, pp. 1131-1136
2.
Dr Wise is a Senior Research Fellow at Pembroke College, Cambridge and CCSR under a grant from Bristol-Myers Squibb. He is also a
Visiting Researcher at the European Bioinformatics Institute.
3.
For a tutorial introduction to biosequence comparison, see Setubal & Meidanis (1997).
-1-
decreasing similarity. Table 1 below contains edited output from the BLAST program where the known sequence
CRB_DROME (accession number P10040) has been used as the query. (Protein identifiers are typically of the
form <protein type>_<species>, where DROME is short for Drosophila melanogaster.) The final column of the
Table 1: Sample BLAST Output
Sequences producing High-scoring Segment Pairs:
Score
P(N)
CRB_DROME
NOTC_XENLA
NTC1_RAT
NOTC_DROME
NOTC_BRARE
NTC3_MOUSE
FBP1_STRPU
NTC4_MOUSE
FBN1_HUMAN
FBN2_MOUSE
FBN2_HUMAN
LI12_CAEEL
SERR_DROME
FBP3_STRPU
TGFB_HUMAN
DLL1_RAT
DLL1_HUMAN
TGFB_RAT
DL_DROME
GLP1_CAEEL
YNX3_CAEEL
SLIT_DROME
LRP1_CHICK
LRP1_HUMAN
PGBM_MOUSE
LML2_CAEEL
PGBM_HUMAN
LMA_DROME
TENA_HUMAN
12030
1567
1494
1495
1414
1375
1698
1087
668
601
604
749
733
643
447
632
632
433
559
563
434
521
264
281
212
262
212
232
409
0.0
6.5e-232
2.4e-218
5.5e-215
1.8e-205
7.0e-196
6.1e-184
2.4e-158
5.4e-95
7.2e-89
4.2e-88
2.1e-74
1.8e-70
2.6e-64
3.5e-61
1.4e-59
2.6e-58
5.6e-58
4.1e-51
1.3e-50
1.2e-49
1.4e-48
9.2e-44
1.3e-42
3.2e-36
1.3e-35
1.6e-35
1.7e-34
2.2e-34
P10040
P21783
Q07008
P07207
P46530
Q61982
P10079
P31695
P35555
Q61555
P35556
P14585
P18168
P49013
P22064
P97677
O00548
Q00918
P10041
P13508
P34576
P24014
P98157
Q07954
Q05793
Q21313
P98160
Q00174
P24821
drosophila melanogaster (fruit fly)....
xenopus laevis (african clawed frog)...
rattus norvegicus (rat). neurogenic ...
drosophila melanogaster (fruit fly)....
brachydanio rerio (zebrafish) (zebra...
mus musculus (mouse). neurogenic loc...
strongylocentrotus purpuratus (purpl...
mus musculus (mouse). neurogenic loc...
homo sapiens (human). fibrillin 1 pr...
mus musculus (mouse). fibrillin 2 pr...
homo sapiens (human). fibrillin 2 pr...
caenorhabditis elegans. lin-12 prote...
drosophila melanogaster (fruit fly)....
strongylocentrotus purpuratus (purpl...
homo sapiens (human). latent transfo...
rattus norvegicus (rat). delta-like ...
homo sapiens (human). delta-like pro...
rattus norvegicus (rat). latent tran...
drosophila melanogaster (fruit fly)....
caenorhabditis elegans. glp-1 protei...
caenorhabditis elegans. hypothetical...
drosophila melanogaster (fruit fly)....
gallus gallus (chicken). low-density...
homo sapiens (human). low-density li...
mus musculus (mouse). basement membr...
caenorhabditis elegans. laminin-like...
homo sapiens (human). basement membr...
drosophila melanogaster (fruit fly)....
homo sapiens (human). tenascin precu...
output contains estimates of the probability that the matches could have occurred by chance. If sufficient
similarity is found between two sequences an inference is made that the two sequences are homologous, i.e. have
diverged over evolutionary time from a common ancestor and therefore that they may be functionally related. In
practice, if the experimenter has tried a query protein against SwissProt using BLAST or FASTA, and found that
many of the hits have similar descriptors, then it is reasonable to assume that these may represent a common
function. At other times the relationships between the database hits may not be clear, e.g. if the query protein is a
mosaic protein, i.e. contains multiple functional domains which together contribute to the overall function. When
this occurs, the task then becomes one of discerning the functions shared between the set of proteins which are
similar to a query, and the query.
The Protein Annotators’ Assistant, or PAA, assists biologists with the process of ascribing functions to unknown
proteins. PAA, which will be described in detail below, leverages the excellent work done by the curators of
SwissProt and performs keyword/phrase clustering based on a set of protein names (or accession identifiers),
drawn from SwissProt, that the user enters via a Web interface. What is returned to the user is a list of
keywords/phrases, and with each keyword/phrase, a list of the proteins containing that keyword/phrase. The
-2-
principle is that if these proteins have a particular keyword/phrase in common, and they are related to a query
protein, then the keyword/phrase may also describe the query.
A system with somewhat similar intent but rather different methodology is described in Andrade & Valencia
(1998). In this system, Medline articles are retrieved for 71 families of closely related sequences4 and their
keywords extracted and compared to a background distribution of keywords extracted from other, unrelated
families of sequences. Specifically, for each word the frequency of its occurrences in the abstracts for a protein
family is compared to the occurrences across the database as a whole. A very simple form of stemming is used,
accepting words as being similar when their endings differ by at most two characters and the words are longer
than 5 characters. A word is then considered significant if it appears across a number of abstracts within a limited
number of families, but infrequently across the database as a whole. The members of the families are also
constrained so that no mosaic proteins are included as these can span families.
Another approach, adopted by many biologists, is to use the excellent genomic/proteomic database search
engines SRS (Etzold et al, 1996) or Entrez (Schuler et al, 1996) to do manually much of what PAA is able to achieve
automatically. Both SRS and Entrez accept either protein names or accession numbers, or combinations of
keywords, and return the matching records. In practice, after the user has extracted a number of the
BLAST/FASTA hits and noticed common words appearing, the same system can be used to extract other records
with the same keywords.
2. System and Methods
The software system that makes up PAA consists of two parts: an off-line keyword/phrase database creation
application and the online query-processing application.
2.1 Keyword/Phrase Extraction
Keyword/Phrase extraction from the source database, SwissProt (currently version 38, containing 80,000
sequences), is undertaken only once, by the systems administrator, upon the arrival of a new version of SwissProt.
The object of this stage is to produce a subsidiary database which maps each protein record into a set of
keywords/phrases that describe the protein.
Data for this process is extracted from the KW (keyword), DE (descriptor) and CC (comment) fields. The KW
field is structured - a semicolon list of keywords/phrases - and has a controlled vocabulary. Semicolon has
therefore been adopted as the overall keyword/phrase separator. The DE and CC fields have a partially
controlled syntax and vocabularies (Bairoch & Apweiler, 1998). In the first stage, data from these three fields are
extracted and some syntactic preprocessing is carried out. For example, punctuation characters except for
hyphen are converted to word/phrase separators and text split over two lines is rejoined; hyphens are converted
to spaces. However, care must be taken even with this relatively simple task. For example, double quote can be
removed but single quote is generally retained (3’ and 5’ are significant), except for possessives (i.e. ’s). Care
must also be taken with chemical identities such as "NAD+" and "H(2)O", because parentheses and + are
otherwise translated into separators.
The second stage is where the keywords/phrases are recognized. The primary vehicle for this stage is a list of
stop-words, i.e. words that are common English particles, which are rejected by being converted to separators.
Because PAA undertakes categorization/keyword-clustering in a particular domain, a large stop-word list, 24,611
words, is being used. The stop-words list is based on a lexicon taken from one of the Linux distributions and
includes most of the more common English words, which are of no interest in this context. Unfortunately, the
source list contained a number of terms from chemistry, biochemistry and biology (e.g. "alanine", "acetate") that
are potentially relevant, together with some specialist terms (often acronyms) that happen to also be domain4.
The starting point are proteins with known 3-D structure taken from the Protein Data Bank (Bernstein et al, 1997) which have very low
levels of similarity to each other. Around each of these seeds are grouped other proteins which have substantial similarity to their
respective seeds. Each such family must also have at least 5 members. The result is 71 families.
-3-
relevant words, e.g "air", "box", "camp", "dump" and "gap". These have been removed from the stop-words list. In
addition, the stop-words list is extended by the use of stemming rules, so, for example, because "group" is in the
stop-words list, "groups", "grouping" and "grouped" are assumed to also be in the stop-words list and if
encountered are converted to separators. This dictionary-based stemming is similar to Krovetz (1993), except that
the latter applies stemming to the final dictionary, rather than the stop-words list, and there is no counterpart to
the accept-words list as described below. The stemming rules currently implemented are simpler than Lovins
(1968) or Porter (1980) and cover the following suffixes:
•
•
•
•
-ing (including the change from a word with a final "e", and double final consonants)
-s, (including "es" and "ies")
-ed, (including "ied", simple addition of "d" to word already ending in "e", and double final consonants)
-ly, (including "ily")
It should be borne in mind that deficiencies in either the stop-words list or in the stemming algorithm are less
serious when applied to a stop-words list than when applied to the final list of keywords/phrases; mistakes will
generally only result in some additional words being included in the list of keywords/phrases and thus flowing
on to the runtime clustering application.
As well as the stop-words list, an accept-words list containing 176 words has also been created. The accept-words
list has a twofold purpose:
•
It is possible that a word is significant, but a similar word also appears in the stop-words list and the significant
word can be manufactured from the stop-word using the stemming rules. In that case, the significant word
must be retained explicitly. One example is "seed", which is otherwise formed through the -ed stemming rule
applied to "see". The other 3 words in this category are "humps", "red" and "nod", which are all acronyms;
"hump", "re" and "no" are in the stop-words list. The alternative would be to complicate the stemming rules far
beyond what is warranted by the application.
•
The majority of entries in the accept-words list are words that are marked as conditional. These are words, such
as "acid", "protein" or "sequence", that may be significant as part of a phrase, but which are too common in this
domain to convey useful information by themselves.
Phrases are built up by the concatenation of words which are in the accept-words list or at least are not in the
stop-words list. In a post-processing step, empty phrases are removed, as are any phrases which consist solely of
words that have been marked as conditional in the accept-words list, e.g. "amino-acid sequence". Note that
duplicate references to the same keyword/phrase are ignored.
By way of example, processing of the record for AZUR_ALCDE (Azurin Precursor, AC P00280), yields the list of
keywords/phrases:
azurin, azurin precursor, copper, cytochrome c551, cytochrome oxidase, electron transpor t, periplasmic,
plastocyanin, signal, transfer electrons
2.2 Runtime Application
The runtime application is invoked by users interacting via the Internet with a CGI script. Specifically, users enter
lists of protein names or accession numbers that they believe to be related.
In the first stage, the application takes the protein identifiers or accession numbers and retrieves the
corresponding lists of keywords/phrases. Then, working backwards from the keywords/phrases, for each
keyword/phrase a list is created containing the names of the proteins which mention that keyword/phrase. In
the next stage, the lists of protein names corresponding to sub-phrases of other phrases are collapsed based on
commonality of stemmed whole words. For example, the union of the lists of proteins corresponding to
cytochromes and cytochrome c551 becomes the list for cytochrome; the list for cytochrome c551 remains
-4-
untouched. If the shared sub-phrase does not already exist, a new keyword is created and has the union of the
input phrases recorded against it, e.g. electron transpor t and oxygen transpor t yields the new keyword transpor t.
On the other hand, the lists corresponding to tr ypsin and chymotr ypsin will not be collapsed.
The process of finding shared sub-phrases and collapsing their lists is the most time consuming part of the
runtime application because there are typically several hundred unique keywords/phrases across the set of input
proteins. To speed up the process a superimposed code-word (Gabbe et al ,1978; Roberts, 1979), 256 bits in length,
is created for the keywords/phrases, where the stemmed, unconditional keywords in the phrases each contribute
one bit based on their hash-value. Then, as each pair of keywords/phrases are to be compared, a bitwise AND is
taken of their respective code-words, and only if there is a non-zero result are their hash values compared and
only if that is successful the words themselves are compared. In the case of the UROT_HUMAN example which
will discussed below, 52,179 non-matches are caught by the code-word test and 288 non-matches are resolved by
the hash-value comparisons on the component keywords, leaving 183 genuine cases of shared sub-phrases.
Once the check for shared sub-phrases has been completed, keywords common to at least two proteins and their
corresponding lists of proteins are reported. At this stage, the information input by the user can be further used
to provide an ordering for the keywords/phrases. If just the names of proteins were entered, the list of keywords
is ordered simply by the number of proteins which share a particular keyword/phrase. That is, each protein in a
list scores 1. On the other hand, if the user also enters match probability values, e.g. by cutting-and-pasting from
a BLAST output such as Table 1, these are converted into scores by taking − log2 (x), with a ceiling value of 100 so
strong matches do not swamp weaker ones, and a floor value of 1, because a weak match should not attract a
lower score than would be used if no probability information had been provided. Each keyword/phrase
associated with a particular protein is given the same score, so from the point of view of the (shared)
keywords/phrases, the total scores are the sums of the contributions from the proteins in their respective lists.
2.3 Implementation and What the User Sees
The keyword/phrase extraction application and the runtime application have both been written in Python
(www.python.org), which has greatly reduced the development time. As described above users interact with a
Web page in which they enter the names or accession numbers of the proteins which are to be clustered. Once the
data has been entered, a CGI script (also written in Python) is invoked which strips away any extraneous
information leaving a list of pairs of proteins and probability values (or just 1, as described above). However,
rather than passing the list directly to the clustering functions, it is transmitted via a socket pair to a server
application which performs the clustering. (Socket programming in Python is discussed in Watters et al (1996).
This implementation extends basic socket services by implementing buffered transmission of variable-length
messages). Use of this internal client-server arrangement allows decoupling of the IO-bound CGI script from the
more computer intensive clustering functions, so the two functions can now be placed on different computers.
The server application is currently single-threaded, but conversion to multi-threaded execution is also possible
and may be introduced in later versions of the application. Finally, the decoupling also helps improve the
security of the clustering application.
Once clustering has completed, users are presented, on a new Web page, with three lists together with the results
of the clustering. One list contains those proteins not represented in the database while a second has those
proteins which have no keywords/phrases in common with any other proteins in the input list. A third list
simply confirms the input list supplied by the user. The principal table contains keywords/phrases shared by at
least two of the proteins in the input list, and for each such keyword or phrase, a list of the proteins mentioning it.
Users can then click on a protein corresponding to a particular keyword/phrase and have the SwissProt record
for that protein presented with the keyword/phrase highlighted. (The retrieval of the SwissProt record and
marking with the keyword/phrase is also done by the clustering server.)
-5-
3. Examples
To see how PAA might be used, the (known) protein UROT_HUMAN (Human Tissue Plasminogen Activator
Precursor, AC P00750) is here used as a query. Tissue Plasminogen Activator Precursor, or TPA, is an example of a
mosaic protein. Specifically, TPA acts to convert inactive plasminogen into its active form, plasmin, which in turn
acts to dissolve blood clots. TPA contains five domains: a fibronectin type I domain, an EGF domain, two kringle
domains and a serine protease domain (Ellis, 1999a). The list of keywords/phrases extracted for UROT_HUMAN
is:
activase, acute ischemic, acute myocardial infarction, ais, alteplase, alternative splicing, arginine, bm, cell migration,
cleavage, destruction, disulfide bonds, ec3.4.21.68, egf, embolism, fibrin, fibrinolysis, fibronectin type, glycoprotein,
hydrolase, hydrolyzing, kringle, pe, peptidase, plasma, plasmin, plasminogen, plasminogen activation, plasminogen
activator precursor, protease domain, remodeling, retavase, reteplase, s1, serine protease, signal, t plasminogen
activator, tpa, trypsin, valine bond
The amino acid sequence for UROT_HUMAN was compared with the other proteins in SwissProt using the
Washington University implementation of BLAST, WU-BLAST 2.0 (http://blast.wustl.edu/). From the ranked
list of hits (similar in form to Table 1), ignoring all references to TPA from other species, to Salivary Plasminogen
Activator and to the closely related Urokinase (Ellis, 1999b), representatives were selected from the remaining
hits using the following strategy: when a new protein type is encountered in the list of hits, the first example is
selected and all other hits of the same protein type but different species are ignored on the assumption that, given
the level of control exercised by the SwissProt curators, all references for a given protein type will have similar
annotations. Using the list of representatives, which contains 32 proteins, PAA is largely able to reconstruct the
list of keywords/phrases taken directly from the record of UROT_HUMAN; the italicized entries in the list for
UROT_HUMAN above are among the keywords/phrases found by PAA based on the representative protein set.
Interestingly, the phrase blood coagulation does not appear in the record for TPA, but is discovered by PAA.
Apart from being an interesting example of a mosaic protein, UROT_HUMAN was chosen because it is well
characterized. Of course, PAA will normally be used in cases where far less is known about the proteins in
question.
However, as a sort of null hypothesis, groups of 100 proteins have been selected at random and their names
presented to PAA. What typically occurs is that a small number are listed as being unrelated to any other protein
in the input list, while the majority are only related via quite general keywords and even then, the groups tend to
be small. For example, in one such experiment, 17 proteins of the 100 shared the keyword "transmembrane".
4. Discussion
Many of the strengths of PAA stem directly from the use of SwissProt as the source database, and stand in
contrast to the system of Andrade and Valencia, which draws its list of keywords from Medline abstracts.
Because the content of the KW (keyword) fields comes from a controlled vocabulary, while the CC (comment) and
DE (descriptor) fields have controlled syntax and semi-controlled vocabulary, common concepts are more likely to
be rendered by the same words/phrases than is possible for free-text abstracts written by individual authors and
collected by services such as Medline. This appears to contradict recent research which has queried the
usefulness of controlled vocabularies for information retrieval, e.g. Lewis & Sparck Jones (1996). However, in this
context, where the application is working backwards from proteins to the keywords that they might have in
common, it is because the vocabulary is relatively controlled that clustering is able to take place in the face of
terminology which itself is still a matter of debate; see for example, the continuing work of the International
Union of Biochemistry and Molecular Biology (http://www.chem.qmw.ac.uk/iubmb/. Further robustness and
generality is added by the runtime algorithm for collapsing lists of proteins based on shared sub-phrases because
a keyword may be used slightly different contexts and it will still be recognized.
On the other hand, a problem with abstracts generally is that they must be short, and are therefore unable to
-6-
canvass the assumed knowledge behind the specific topic being addressed by a complete paper. However, even
full papers have generally to stay very focussed and therefore must assume that the reader already possesses
much of the background knowledge. By contrast, the SwissProt curators bring a range of resources to the task of
finding appropriate keywords and writing comments. For example, the SwissProt record for AZUR_ALCDE,
discussed above, lists five papers discussing AZUR_ALCDE. While AZUR_ALCDE can be identified through the
five Medline abstracts as a blue copper protein, the abstracts are predominantly interested in its structure; the
functions that are characteristic of blue copper proteins are not discussed. The SwissProt record, however, does
have some of this information.
Looking further at the two systems, the technique employed by PAA, in which the user specifies the set of
proteins that are said to be related, sidesteps the issue of gathering a sufficient number of similar proteins to form
a family, whose words can be contrasted with those of other protein families in the database. On the other hand,
PAA relies on the correctness of the a priori assignment of words into the stop-words database and to a lesser
extent the accept-words database, but given the defined domain, this can be achieved at reasonable cost and will
only change slowly. In addition, with PAA there is no limitation on the use of mosaic proteins, which might span
families and PAA does not require families of related proteins to be found before an analysis can be carried out,
which means that it can better cope with proteins which are unrelated to any members from known families.
Most broadly, PAA illustrates what can be done if curated databases exist for particular domains. In its overall
function, PAA most resembles the recent work on automated text categorization, e.g. Lam & Ho (1998) or Dumais
et al (1998), which generally aims to assign texts such as news stories to one or more predefined categories. PAA
is able to leverage the fact that it operates in a limited domain with a well curated resource and is able, in real
time, to suggest multiple categories which have not been specified in advance. PAA can thus be viewed as
performing unsupervised classification.
5. References
Altschul, Stephen F., Gish, Warren, Miller, Webb, Myers, Eugene W., & Lipman, David J. (1990), Basic Local
Alignment Search Tool. Journal of Molecular Biology 215(3), p. 403−410.
Andrade, Miguel A. & Valencia, Alfonso (1998), Automatic Extraction of Keywords from Scientific Text:
Application to the Knowledge Domain of Protein Families. Bioinformatics 14, pp. 600-607.
Bairoch, A. & Apweiler, R. (1999), The SWISS-PROT Protein Sequence Data Bank and its Supplement TrEMBL in
1999. Nucleic Acids Research 27, pp. 49-54.
Bairoch, Amos & Apweiler, Rolf (1998), The SWISS-PROT Protein Sequence Database User Manual (Release 37),
(http://www.expasy.ch/txt/userman.txt), December 1998.
Bernstein, F.C., Koetzle, T.F., Williams, G.J., Jr, E.E. Meyer, Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi,
T., & Tasumi, M. (1977), The Protein Data Bank: a computer-based archival file for macromolecular structures.
Journal of Molecular Biology 112(3), pp. 535-542 (http://www.rcsb.org/pdb/).
Dumais, Susan, Platt, John, Heckerman, David, & Sahami, Mehran (1998), Inductive Learning Algorithms and
Representations for Text Categorization. Seventh International Conference on Information and Knowledge Management
(CIKM’98), Bethesda, MD, USA, pp. 148-155, ACM.
Ellis, V. (1999a), Plasminogen Activators. In Thomas E. Creighton (eds), Encyclopedia of Molecular Biology., pp.
1865-1866, John Wiley.
Ellis, V. (1999b), Urokinase. In Thomas E. Creighton (eds), Encyclopedia of Molecular Biology., pp. 2728-2729, John
Wiley.
Etzold, T., Ulyanov, A., & Argos, P. (1996), SRS: information retrieval system for molecular biology data banks. In
Russell F. Doolittle (eds), Computer Methods for Macromolecular Sequence Analysis., pp. 114-128, Academic Press.
-7-
Gabbe, J. D., London, T. B., Miller, R. E., & Beyer, J. D. (1978), Applications of Superimposed Coding to PartialMatch Retrieval. Computer Software and Applications Conference (COMPSAC’78), Chicago, U.S.A., pp. 464-469,
IEEE.
Krovetz, Robert (1993), Viewing Morphology as an Inference Process. Sixteenth Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, Pittsburgh, USA, pp. 191-202, ACM.
Lam, Wai & Ho, Chao Yang (1998), Using a Generalized Instance Set for Automatic Text Categorization. 21st
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98),
Melbourne, Australia, pp. 81-89.
Lewis, David D & Jones, Karen Sparck (1996), Natural Language Processing for Information Retrieval.
Communications of the ACM 39(1), pp. 92-101.
Lovins, Julie Beth (1968), Development of a Stemming Algorithm. Mechanical Translation and Computation 11,
pp. 22-31.
Pearson, William R. & Lipman, David J. (1988), Improved Tools for Biological Sequence Comparison. Proceedings
of the National Academy of Science U.S.A. 85, p. 2444−2448.
Porter, M. F. (1980), An Algorithm for Suffix Stripping. Program 14(3), pp. 130-137.
Roberts, Charles S. (1979), Partial-Match Retrieval via the Method of Superimposed Codes. Proceedings of the IEEE
67(12), p. 1624−42.
Schuler, G. D., Epstein, J. A., Ohkawa, H., & Kans, J. A. (1996), Entrez: molecular biology database and retrieval
system. In Russell F. Doolittle (eds), Computer Methods for Macromolecular Sequence Analysis., pp. 141-162,
Academic Press (Methods in Enzymology, Vol 266).
Setubal, Joãa & Meidanis, Joãa (1997), Introduction to Computational Molecular Biology. PWS Publishing.
Smith, T. F. & Waterman, M. S. (1981), Identification of Common Molecular Subsequences. Journal of Molecular
Biology 147, p. 195−197.
Watters, Aaron, van Rossum, Guido, & Ahlstrom, James C. (1996), Internet Programming with Python. M&T Books
(MIS Press).
-8-