Protein Annotators’ Assistant: A Novel Application of Information Retrieval Techniques1 Michael J. Wise Centre for Communications Systems Research2 Cambridge University [email protected] Abstract The Protein Annotators’ Assistant (or PAA), http://www.ebi.ac.uk/paa/, is a software system which assists protein annotators in the task of assigning functions to newly sequenced proteins. Working backwards from SwissProt, a database which describes known proteins, and a prior sequence similarity search that returns a list of known proteins similar to a query, PAA suggests keywords and phrases which may describe functions performed by the query. In a preprocessing step, a database is built from the protein names that appear in the SwissProt database, and against each protein are listed keywords and phrases that are extracted from the corresponding text records. Common words, either in general English usage or from the biological domain, are removed as the phrases are assembled. This process is assisted by the use of a simple stemming algorithm, which extends the list of stop-words (i.e. reject words), together with a list of accept-words. At runtime, the search algorithm, invoked by a user via a Web interface, takes a list of protein names and clusters the named proteins around keywords/phrases shared by members of the list. The assumption is that if these proteins have a particular keyword/phrase in common, and they are related to a query protein, then the keyword/phrase may also describe the query. Overall, PAA employs a number of IR techniques in a novel setting and is thus related to text categorization, where multiple categories may be suggested, except that in this case none of the categories are specified in advance. 1. Introduction Proteins are made up of long, folded chains (or sequences) of amino-acids. There are 20 amino-acids, and although they are, in fact, 3-dimensional molecules, for some purposes they can be represented as letters drawn from the set {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. Proteins, therefore, can be viewed as strings drawn from that alphabet. Information about proteins is found in text format databases such as SwissProt (Bairoch & Apweiler, 1999); each record contains the sequence together with related information such is it’s name, an accession number, bibliographic information, a descriptor and perhaps some keywords and comments. (For a complete description see Bairoch & Apweiler (1998)). A protein sequence is discovered either through physical/chemical means or, more commonly, through the translation of an Open Reading Frame, or ORF, i.e. a DNA sequence that is potentially protein-coding. Whichever way the protein is discovered, its function is not normally known so one way to hypothesize its function is to look at the functions of similar proteins. Similarity to other proteins is typically established using sequence alignment programs such as BLAST (Altschul et al, 1990)] or FASTA (Pearson & Lipman, 1988), which are approximations of the Smith-Waterman algorithm (Smith & Waterman, 1981).3 That is, using one of these systems the sequence is run as a query against the sequences in a protein sequence database and returns a list of matches in order of 1 This article appeared in the Journal of the American Society for Information Science (JASIS), 51(12), October 2000, pp. 1131-1136 2. Dr Wise is a Senior Research Fellow at Pembroke College, Cambridge and CCSR under a grant from Bristol-Myers Squibb. He is also a Visiting Researcher at the European Bioinformatics Institute. 3. For a tutorial introduction to biosequence comparison, see Setubal & Meidanis (1997). -1- decreasing similarity. Table 1 below contains edited output from the BLAST program where the known sequence CRB_DROME (accession number P10040) has been used as the query. (Protein identifiers are typically of the form <protein type>_<species>, where DROME is short for Drosophila melanogaster.) The final column of the Table 1: Sample BLAST Output Sequences producing High-scoring Segment Pairs: Score P(N) CRB_DROME NOTC_XENLA NTC1_RAT NOTC_DROME NOTC_BRARE NTC3_MOUSE FBP1_STRPU NTC4_MOUSE FBN1_HUMAN FBN2_MOUSE FBN2_HUMAN LI12_CAEEL SERR_DROME FBP3_STRPU TGFB_HUMAN DLL1_RAT DLL1_HUMAN TGFB_RAT DL_DROME GLP1_CAEEL YNX3_CAEEL SLIT_DROME LRP1_CHICK LRP1_HUMAN PGBM_MOUSE LML2_CAEEL PGBM_HUMAN LMA_DROME TENA_HUMAN 12030 1567 1494 1495 1414 1375 1698 1087 668 601 604 749 733 643 447 632 632 433 559 563 434 521 264 281 212 262 212 232 409 0.0 6.5e-232 2.4e-218 5.5e-215 1.8e-205 7.0e-196 6.1e-184 2.4e-158 5.4e-95 7.2e-89 4.2e-88 2.1e-74 1.8e-70 2.6e-64 3.5e-61 1.4e-59 2.6e-58 5.6e-58 4.1e-51 1.3e-50 1.2e-49 1.4e-48 9.2e-44 1.3e-42 3.2e-36 1.3e-35 1.6e-35 1.7e-34 2.2e-34 P10040 P21783 Q07008 P07207 P46530 Q61982 P10079 P31695 P35555 Q61555 P35556 P14585 P18168 P49013 P22064 P97677 O00548 Q00918 P10041 P13508 P34576 P24014 P98157 Q07954 Q05793 Q21313 P98160 Q00174 P24821 drosophila melanogaster (fruit fly).... xenopus laevis (african clawed frog)... rattus norvegicus (rat). neurogenic ... drosophila melanogaster (fruit fly).... brachydanio rerio (zebrafish) (zebra... mus musculus (mouse). neurogenic loc... strongylocentrotus purpuratus (purpl... mus musculus (mouse). neurogenic loc... homo sapiens (human). fibrillin 1 pr... mus musculus (mouse). fibrillin 2 pr... homo sapiens (human). fibrillin 2 pr... caenorhabditis elegans. lin-12 prote... drosophila melanogaster (fruit fly).... strongylocentrotus purpuratus (purpl... homo sapiens (human). latent transfo... rattus norvegicus (rat). delta-like ... homo sapiens (human). delta-like pro... rattus norvegicus (rat). latent tran... drosophila melanogaster (fruit fly).... caenorhabditis elegans. glp-1 protei... caenorhabditis elegans. hypothetical... drosophila melanogaster (fruit fly).... gallus gallus (chicken). low-density... homo sapiens (human). low-density li... mus musculus (mouse). basement membr... caenorhabditis elegans. laminin-like... homo sapiens (human). basement membr... drosophila melanogaster (fruit fly).... homo sapiens (human). tenascin precu... output contains estimates of the probability that the matches could have occurred by chance. If sufficient similarity is found between two sequences an inference is made that the two sequences are homologous, i.e. have diverged over evolutionary time from a common ancestor and therefore that they may be functionally related. In practice, if the experimenter has tried a query protein against SwissProt using BLAST or FASTA, and found that many of the hits have similar descriptors, then it is reasonable to assume that these may represent a common function. At other times the relationships between the database hits may not be clear, e.g. if the query protein is a mosaic protein, i.e. contains multiple functional domains which together contribute to the overall function. When this occurs, the task then becomes one of discerning the functions shared between the set of proteins which are similar to a query, and the query. The Protein Annotators’ Assistant, or PAA, assists biologists with the process of ascribing functions to unknown proteins. PAA, which will be described in detail below, leverages the excellent work done by the curators of SwissProt and performs keyword/phrase clustering based on a set of protein names (or accession identifiers), drawn from SwissProt, that the user enters via a Web interface. What is returned to the user is a list of keywords/phrases, and with each keyword/phrase, a list of the proteins containing that keyword/phrase. The -2- principle is that if these proteins have a particular keyword/phrase in common, and they are related to a query protein, then the keyword/phrase may also describe the query. A system with somewhat similar intent but rather different methodology is described in Andrade & Valencia (1998). In this system, Medline articles are retrieved for 71 families of closely related sequences4 and their keywords extracted and compared to a background distribution of keywords extracted from other, unrelated families of sequences. Specifically, for each word the frequency of its occurrences in the abstracts for a protein family is compared to the occurrences across the database as a whole. A very simple form of stemming is used, accepting words as being similar when their endings differ by at most two characters and the words are longer than 5 characters. A word is then considered significant if it appears across a number of abstracts within a limited number of families, but infrequently across the database as a whole. The members of the families are also constrained so that no mosaic proteins are included as these can span families. Another approach, adopted by many biologists, is to use the excellent genomic/proteomic database search engines SRS (Etzold et al, 1996) or Entrez (Schuler et al, 1996) to do manually much of what PAA is able to achieve automatically. Both SRS and Entrez accept either protein names or accession numbers, or combinations of keywords, and return the matching records. In practice, after the user has extracted a number of the BLAST/FASTA hits and noticed common words appearing, the same system can be used to extract other records with the same keywords. 2. System and Methods The software system that makes up PAA consists of two parts: an off-line keyword/phrase database creation application and the online query-processing application. 2.1 Keyword/Phrase Extraction Keyword/Phrase extraction from the source database, SwissProt (currently version 38, containing 80,000 sequences), is undertaken only once, by the systems administrator, upon the arrival of a new version of SwissProt. The object of this stage is to produce a subsidiary database which maps each protein record into a set of keywords/phrases that describe the protein. Data for this process is extracted from the KW (keyword), DE (descriptor) and CC (comment) fields. The KW field is structured - a semicolon list of keywords/phrases - and has a controlled vocabulary. Semicolon has therefore been adopted as the overall keyword/phrase separator. The DE and CC fields have a partially controlled syntax and vocabularies (Bairoch & Apweiler, 1998). In the first stage, data from these three fields are extracted and some syntactic preprocessing is carried out. For example, punctuation characters except for hyphen are converted to word/phrase separators and text split over two lines is rejoined; hyphens are converted to spaces. However, care must be taken even with this relatively simple task. For example, double quote can be removed but single quote is generally retained (3’ and 5’ are significant), except for possessives (i.e. ’s). Care must also be taken with chemical identities such as "NAD+" and "H(2)O", because parentheses and + are otherwise translated into separators. The second stage is where the keywords/phrases are recognized. The primary vehicle for this stage is a list of stop-words, i.e. words that are common English particles, which are rejected by being converted to separators. Because PAA undertakes categorization/keyword-clustering in a particular domain, a large stop-word list, 24,611 words, is being used. The stop-words list is based on a lexicon taken from one of the Linux distributions and includes most of the more common English words, which are of no interest in this context. Unfortunately, the source list contained a number of terms from chemistry, biochemistry and biology (e.g. "alanine", "acetate") that are potentially relevant, together with some specialist terms (often acronyms) that happen to also be domain4. The starting point are proteins with known 3-D structure taken from the Protein Data Bank (Bernstein et al, 1997) which have very low levels of similarity to each other. Around each of these seeds are grouped other proteins which have substantial similarity to their respective seeds. Each such family must also have at least 5 members. The result is 71 families. -3- relevant words, e.g "air", "box", "camp", "dump" and "gap". These have been removed from the stop-words list. In addition, the stop-words list is extended by the use of stemming rules, so, for example, because "group" is in the stop-words list, "groups", "grouping" and "grouped" are assumed to also be in the stop-words list and if encountered are converted to separators. This dictionary-based stemming is similar to Krovetz (1993), except that the latter applies stemming to the final dictionary, rather than the stop-words list, and there is no counterpart to the accept-words list as described below. The stemming rules currently implemented are simpler than Lovins (1968) or Porter (1980) and cover the following suffixes: • • • • -ing (including the change from a word with a final "e", and double final consonants) -s, (including "es" and "ies") -ed, (including "ied", simple addition of "d" to word already ending in "e", and double final consonants) -ly, (including "ily") It should be borne in mind that deficiencies in either the stop-words list or in the stemming algorithm are less serious when applied to a stop-words list than when applied to the final list of keywords/phrases; mistakes will generally only result in some additional words being included in the list of keywords/phrases and thus flowing on to the runtime clustering application. As well as the stop-words list, an accept-words list containing 176 words has also been created. The accept-words list has a twofold purpose: • It is possible that a word is significant, but a similar word also appears in the stop-words list and the significant word can be manufactured from the stop-word using the stemming rules. In that case, the significant word must be retained explicitly. One example is "seed", which is otherwise formed through the -ed stemming rule applied to "see". The other 3 words in this category are "humps", "red" and "nod", which are all acronyms; "hump", "re" and "no" are in the stop-words list. The alternative would be to complicate the stemming rules far beyond what is warranted by the application. • The majority of entries in the accept-words list are words that are marked as conditional. These are words, such as "acid", "protein" or "sequence", that may be significant as part of a phrase, but which are too common in this domain to convey useful information by themselves. Phrases are built up by the concatenation of words which are in the accept-words list or at least are not in the stop-words list. In a post-processing step, empty phrases are removed, as are any phrases which consist solely of words that have been marked as conditional in the accept-words list, e.g. "amino-acid sequence". Note that duplicate references to the same keyword/phrase are ignored. By way of example, processing of the record for AZUR_ALCDE (Azurin Precursor, AC P00280), yields the list of keywords/phrases: azurin, azurin precursor, copper, cytochrome c551, cytochrome oxidase, electron transpor t, periplasmic, plastocyanin, signal, transfer electrons 2.2 Runtime Application The runtime application is invoked by users interacting via the Internet with a CGI script. Specifically, users enter lists of protein names or accession numbers that they believe to be related. In the first stage, the application takes the protein identifiers or accession numbers and retrieves the corresponding lists of keywords/phrases. Then, working backwards from the keywords/phrases, for each keyword/phrase a list is created containing the names of the proteins which mention that keyword/phrase. In the next stage, the lists of protein names corresponding to sub-phrases of other phrases are collapsed based on commonality of stemmed whole words. For example, the union of the lists of proteins corresponding to cytochromes and cytochrome c551 becomes the list for cytochrome; the list for cytochrome c551 remains -4- untouched. If the shared sub-phrase does not already exist, a new keyword is created and has the union of the input phrases recorded against it, e.g. electron transpor t and oxygen transpor t yields the new keyword transpor t. On the other hand, the lists corresponding to tr ypsin and chymotr ypsin will not be collapsed. The process of finding shared sub-phrases and collapsing their lists is the most time consuming part of the runtime application because there are typically several hundred unique keywords/phrases across the set of input proteins. To speed up the process a superimposed code-word (Gabbe et al ,1978; Roberts, 1979), 256 bits in length, is created for the keywords/phrases, where the stemmed, unconditional keywords in the phrases each contribute one bit based on their hash-value. Then, as each pair of keywords/phrases are to be compared, a bitwise AND is taken of their respective code-words, and only if there is a non-zero result are their hash values compared and only if that is successful the words themselves are compared. In the case of the UROT_HUMAN example which will discussed below, 52,179 non-matches are caught by the code-word test and 288 non-matches are resolved by the hash-value comparisons on the component keywords, leaving 183 genuine cases of shared sub-phrases. Once the check for shared sub-phrases has been completed, keywords common to at least two proteins and their corresponding lists of proteins are reported. At this stage, the information input by the user can be further used to provide an ordering for the keywords/phrases. If just the names of proteins were entered, the list of keywords is ordered simply by the number of proteins which share a particular keyword/phrase. That is, each protein in a list scores 1. On the other hand, if the user also enters match probability values, e.g. by cutting-and-pasting from a BLAST output such as Table 1, these are converted into scores by taking − log2 (x), with a ceiling value of 100 so strong matches do not swamp weaker ones, and a floor value of 1, because a weak match should not attract a lower score than would be used if no probability information had been provided. Each keyword/phrase associated with a particular protein is given the same score, so from the point of view of the (shared) keywords/phrases, the total scores are the sums of the contributions from the proteins in their respective lists. 2.3 Implementation and What the User Sees The keyword/phrase extraction application and the runtime application have both been written in Python (www.python.org), which has greatly reduced the development time. As described above users interact with a Web page in which they enter the names or accession numbers of the proteins which are to be clustered. Once the data has been entered, a CGI script (also written in Python) is invoked which strips away any extraneous information leaving a list of pairs of proteins and probability values (or just 1, as described above). However, rather than passing the list directly to the clustering functions, it is transmitted via a socket pair to a server application which performs the clustering. (Socket programming in Python is discussed in Watters et al (1996). This implementation extends basic socket services by implementing buffered transmission of variable-length messages). Use of this internal client-server arrangement allows decoupling of the IO-bound CGI script from the more computer intensive clustering functions, so the two functions can now be placed on different computers. The server application is currently single-threaded, but conversion to multi-threaded execution is also possible and may be introduced in later versions of the application. Finally, the decoupling also helps improve the security of the clustering application. Once clustering has completed, users are presented, on a new Web page, with three lists together with the results of the clustering. One list contains those proteins not represented in the database while a second has those proteins which have no keywords/phrases in common with any other proteins in the input list. A third list simply confirms the input list supplied by the user. The principal table contains keywords/phrases shared by at least two of the proteins in the input list, and for each such keyword or phrase, a list of the proteins mentioning it. Users can then click on a protein corresponding to a particular keyword/phrase and have the SwissProt record for that protein presented with the keyword/phrase highlighted. (The retrieval of the SwissProt record and marking with the keyword/phrase is also done by the clustering server.) -5- 3. Examples To see how PAA might be used, the (known) protein UROT_HUMAN (Human Tissue Plasminogen Activator Precursor, AC P00750) is here used as a query. Tissue Plasminogen Activator Precursor, or TPA, is an example of a mosaic protein. Specifically, TPA acts to convert inactive plasminogen into its active form, plasmin, which in turn acts to dissolve blood clots. TPA contains five domains: a fibronectin type I domain, an EGF domain, two kringle domains and a serine protease domain (Ellis, 1999a). The list of keywords/phrases extracted for UROT_HUMAN is: activase, acute ischemic, acute myocardial infarction, ais, alteplase, alternative splicing, arginine, bm, cell migration, cleavage, destruction, disulfide bonds, ec3.4.21.68, egf, embolism, fibrin, fibrinolysis, fibronectin type, glycoprotein, hydrolase, hydrolyzing, kringle, pe, peptidase, plasma, plasmin, plasminogen, plasminogen activation, plasminogen activator precursor, protease domain, remodeling, retavase, reteplase, s1, serine protease, signal, t plasminogen activator, tpa, trypsin, valine bond The amino acid sequence for UROT_HUMAN was compared with the other proteins in SwissProt using the Washington University implementation of BLAST, WU-BLAST 2.0 (http://blast.wustl.edu/). From the ranked list of hits (similar in form to Table 1), ignoring all references to TPA from other species, to Salivary Plasminogen Activator and to the closely related Urokinase (Ellis, 1999b), representatives were selected from the remaining hits using the following strategy: when a new protein type is encountered in the list of hits, the first example is selected and all other hits of the same protein type but different species are ignored on the assumption that, given the level of control exercised by the SwissProt curators, all references for a given protein type will have similar annotations. Using the list of representatives, which contains 32 proteins, PAA is largely able to reconstruct the list of keywords/phrases taken directly from the record of UROT_HUMAN; the italicized entries in the list for UROT_HUMAN above are among the keywords/phrases found by PAA based on the representative protein set. Interestingly, the phrase blood coagulation does not appear in the record for TPA, but is discovered by PAA. Apart from being an interesting example of a mosaic protein, UROT_HUMAN was chosen because it is well characterized. Of course, PAA will normally be used in cases where far less is known about the proteins in question. However, as a sort of null hypothesis, groups of 100 proteins have been selected at random and their names presented to PAA. What typically occurs is that a small number are listed as being unrelated to any other protein in the input list, while the majority are only related via quite general keywords and even then, the groups tend to be small. For example, in one such experiment, 17 proteins of the 100 shared the keyword "transmembrane". 4. Discussion Many of the strengths of PAA stem directly from the use of SwissProt as the source database, and stand in contrast to the system of Andrade and Valencia, which draws its list of keywords from Medline abstracts. Because the content of the KW (keyword) fields comes from a controlled vocabulary, while the CC (comment) and DE (descriptor) fields have controlled syntax and semi-controlled vocabulary, common concepts are more likely to be rendered by the same words/phrases than is possible for free-text abstracts written by individual authors and collected by services such as Medline. This appears to contradict recent research which has queried the usefulness of controlled vocabularies for information retrieval, e.g. Lewis & Sparck Jones (1996). However, in this context, where the application is working backwards from proteins to the keywords that they might have in common, it is because the vocabulary is relatively controlled that clustering is able to take place in the face of terminology which itself is still a matter of debate; see for example, the continuing work of the International Union of Biochemistry and Molecular Biology (http://www.chem.qmw.ac.uk/iubmb/. Further robustness and generality is added by the runtime algorithm for collapsing lists of proteins based on shared sub-phrases because a keyword may be used slightly different contexts and it will still be recognized. On the other hand, a problem with abstracts generally is that they must be short, and are therefore unable to -6- canvass the assumed knowledge behind the specific topic being addressed by a complete paper. However, even full papers have generally to stay very focussed and therefore must assume that the reader already possesses much of the background knowledge. By contrast, the SwissProt curators bring a range of resources to the task of finding appropriate keywords and writing comments. For example, the SwissProt record for AZUR_ALCDE, discussed above, lists five papers discussing AZUR_ALCDE. While AZUR_ALCDE can be identified through the five Medline abstracts as a blue copper protein, the abstracts are predominantly interested in its structure; the functions that are characteristic of blue copper proteins are not discussed. The SwissProt record, however, does have some of this information. Looking further at the two systems, the technique employed by PAA, in which the user specifies the set of proteins that are said to be related, sidesteps the issue of gathering a sufficient number of similar proteins to form a family, whose words can be contrasted with those of other protein families in the database. On the other hand, PAA relies on the correctness of the a priori assignment of words into the stop-words database and to a lesser extent the accept-words database, but given the defined domain, this can be achieved at reasonable cost and will only change slowly. In addition, with PAA there is no limitation on the use of mosaic proteins, which might span families and PAA does not require families of related proteins to be found before an analysis can be carried out, which means that it can better cope with proteins which are unrelated to any members from known families. Most broadly, PAA illustrates what can be done if curated databases exist for particular domains. In its overall function, PAA most resembles the recent work on automated text categorization, e.g. Lam & Ho (1998) or Dumais et al (1998), which generally aims to assign texts such as news stories to one or more predefined categories. PAA is able to leverage the fact that it operates in a limited domain with a well curated resource and is able, in real time, to suggest multiple categories which have not been specified in advance. PAA can thus be viewed as performing unsupervised classification. 5. References Altschul, Stephen F., Gish, Warren, Miller, Webb, Myers, Eugene W., & Lipman, David J. (1990), Basic Local Alignment Search Tool. Journal of Molecular Biology 215(3), p. 403−410. Andrade, Miguel A. & Valencia, Alfonso (1998), Automatic Extraction of Keywords from Scientific Text: Application to the Knowledge Domain of Protein Families. Bioinformatics 14, pp. 600-607. Bairoch, A. & Apweiler, R. (1999), The SWISS-PROT Protein Sequence Data Bank and its Supplement TrEMBL in 1999. Nucleic Acids Research 27, pp. 49-54. Bairoch, Amos & Apweiler, Rolf (1998), The SWISS-PROT Protein Sequence Database User Manual (Release 37), (http://www.expasy.ch/txt/userman.txt), December 1998. Bernstein, F.C., Koetzle, T.F., Williams, G.J., Jr, E.E. Meyer, Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., & Tasumi, M. (1977), The Protein Data Bank: a computer-based archival file for macromolecular structures. Journal of Molecular Biology 112(3), pp. 535-542 (http://www.rcsb.org/pdb/). Dumais, Susan, Platt, John, Heckerman, David, & Sahami, Mehran (1998), Inductive Learning Algorithms and Representations for Text Categorization. Seventh International Conference on Information and Knowledge Management (CIKM’98), Bethesda, MD, USA, pp. 148-155, ACM. Ellis, V. (1999a), Plasminogen Activators. In Thomas E. Creighton (eds), Encyclopedia of Molecular Biology., pp. 1865-1866, John Wiley. Ellis, V. (1999b), Urokinase. In Thomas E. Creighton (eds), Encyclopedia of Molecular Biology., pp. 2728-2729, John Wiley. Etzold, T., Ulyanov, A., & Argos, P. (1996), SRS: information retrieval system for molecular biology data banks. In Russell F. Doolittle (eds), Computer Methods for Macromolecular Sequence Analysis., pp. 114-128, Academic Press. -7- Gabbe, J. D., London, T. B., Miller, R. E., & Beyer, J. D. (1978), Applications of Superimposed Coding to PartialMatch Retrieval. Computer Software and Applications Conference (COMPSAC’78), Chicago, U.S.A., pp. 464-469, IEEE. Krovetz, Robert (1993), Viewing Morphology as an Inference Process. Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, USA, pp. 191-202, ACM. Lam, Wai & Ho, Chao Yang (1998), Using a Generalized Instance Set for Automatic Text Categorization. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), Melbourne, Australia, pp. 81-89. Lewis, David D & Jones, Karen Sparck (1996), Natural Language Processing for Information Retrieval. Communications of the ACM 39(1), pp. 92-101. Lovins, Julie Beth (1968), Development of a Stemming Algorithm. Mechanical Translation and Computation 11, pp. 22-31. Pearson, William R. & Lipman, David J. (1988), Improved Tools for Biological Sequence Comparison. Proceedings of the National Academy of Science U.S.A. 85, p. 2444−2448. Porter, M. F. (1980), An Algorithm for Suffix Stripping. Program 14(3), pp. 130-137. Roberts, Charles S. (1979), Partial-Match Retrieval via the Method of Superimposed Codes. Proceedings of the IEEE 67(12), p. 1624−42. Schuler, G. D., Epstein, J. A., Ohkawa, H., & Kans, J. A. (1996), Entrez: molecular biology database and retrieval system. In Russell F. Doolittle (eds), Computer Methods for Macromolecular Sequence Analysis., pp. 141-162, Academic Press (Methods in Enzymology, Vol 266). Setubal, Joãa & Meidanis, Joãa (1997), Introduction to Computational Molecular Biology. PWS Publishing. Smith, T. F. & Waterman, M. S. (1981), Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, p. 195−197. Watters, Aaron, van Rossum, Guido, & Ahlstrom, James C. (1996), Internet Programming with Python. M&T Books (MIS Press). -8-