Download Report

Bioinformatics Search
Lichy Han, Winn Haynes, Maulik Kamdar, Alejandro Schuler
Results
Background
GitHub has over 20 million reposit-‐
ories and is growing exponen9ally [1]
[2]. While this presents a great opportunity, a major challenge is si]ing through vast amounts of source code. The bioinforma9cs domain relies heavily on shared so]ware packages like Bioconductor Fig 1. GitHub repository growth rate [1] to develop eﬀec9ve queries [3]. The 2000+ packages in Bioconductor are annotated with free text descrip9ons and structured classiﬁca9on into predeﬁned categories. Our search engine uses these data to priori9ze GitHub packages which are most relevant to bioinforma9cs users. The development of eﬀec9ve searching approaches for open source code repositories will enable u9liza9on of prior work to improve eﬃciency, promote collabora9on, and encourage innova9on. Fig 4. Command line and graphical user interface (GUI) with sample query Ontology
Using the exis9ng hierarchy of packages in Bioconductor, we created an OWL-‐based ontology using WebProtégé[4] For each package, we extracted the “biocViews”, which are keywords that Bioconductor assigns each package that served as the basis for our ontology. For example, if we were interested in clustering, we would look at the “Clustering” node which is a subclass of “Sta9s9calMethod” and “So]ware”. >> python project_search.py 'flow cytometry cluster
analysis' 10 Retrieving BioConductor package description search
engine from package_search_engine.pickle... Searching for "flow cytometry cluster analysis"... Searching document 2050/2051 Ranking your results... Retrieving GitHub project description search engine from
project_search_engine.pickle... Searching for "flow cytometry cluster analysis"... Searching document 18416/18417 Ranking your results... Retrieving project-to-package index from
package_index.pickle... 1. https://github.com/caipine/FlowC (1.416)
2. https://github.com/BIOFAB/fcs-analysis (1.408)
3. https://github.com/mizumot/cluster (1.309)
4. https://github.com/RGLab/flowViz (1.228)
5. https://github.com/RGLab/flowQ (1.070)
6. https://github.com/RGLab/flowStats (0.935)
7. https://github.com/cran/clustsig (0.869)
8. https://github.com/bioinfoxtra/cytometry (0.856)
9. https://github.com/cran/curvHDR (0.802)
10. https://github.com/cran/Rcell (0.772) In our ﬁrst itera9on, we implemented our project to be queried from the command line with the search term of interest and the number of results to display. For this, the user needed to have our ﬁles stored locally in order to successfully query the system. We then made a GUI which can be accessed at h^p://9nyurl.com/biosearch-‐ui, which overcomes the limita9ons of the command line interface. Both also display the score of each package, which was used to produce the ﬁnal ranked list. Evaluation
Fig 2. WebProtégé ontology Problem Solving Methods
Fig 6. Overall search engine preference Fig 5. Sample survey ques9ons Fig 3. Overview of problem solving methods 1 . The query is compared to all GitHub project documenta9on and Bioconductor package documenta9on, producing a query-‐to-‐project and a query-‐to-‐package similarity score for each project and package, respec9vely. Scores are calculated as the cosine distance of the query’s term frequency vector to precomputed term-‐frequency inverse-‐document-‐
frequency (TF-‐IDF) indices of the package and project descrip9ons. 2 The query-‐to-‐package scores are diﬀused across a pre-‐built ontology. This process is conceptually equivalent to smoothing where the distance between two packages is their distance in the ontology structure. 3 The project’s ﬁnal score is its project-‐to-‐query similarity score plus the smoothed query-‐to-‐package score for each package used in the project. T o e v a l u a t e t h e q u a l i t a 9 v e performance of our system, we amassed a group of 14 test users and asked them to ﬁll out a qualita9ve assessment. Our survey displayed results from GitHub’s search engine (A) and our search engine (B) for diﬀerent sample queries. Users were asked to rate the relevance of top results from both engines and which search engine they preferred overall. Fig 7. Relevance for individual results References
[1] Doll, B. (2013). 10 Million Repositories. Retrieved from h^ps://github.com/blog/1724-‐10-‐million-‐repositories [2] GitHub. (n.d.). GitHub API v3. Retrieved February 27, 2015, from h^ps://developer.github.com/v3/ [3] Bioconductor. (n.d.). Retrieved February 27, 2015, from h^p://www.bioconductor.org/packages/release/BiocViews.html#___So]ware [4] WebProtege. (n.d.). Retrieved February 27, 2015, from h^p://webprotege.stanford.edu/#List:coll=Home;