Bioinformatics Search Lichy Han, Winn Haynes, Maulik Kamdar, Alejandro Schuler Results Background GitHub has over 20 million reposit-‐ ories and is growing exponen9ally [1] [2]. While this presents a great opportunity, a major challenge is si]ing through vast amounts of source code. The bioinforma9cs domain relies heavily on shared so]ware packages like Bioconductor Fig 1. GitHub repository growth rate [1] to develop effec9ve queries [3]. The 2000+ packages in Bioconductor are annotated with free text descrip9ons and structured classifica9on into predefined categories. Our search engine uses these data to priori9ze GitHub packages which are most relevant to bioinforma9cs users. The development of effec9ve searching approaches for open source code repositories will enable u9liza9on of prior work to improve efficiency, promote collabora9on, and encourage innova9on. Fig 4. Command line and graphical user interface (GUI) with sample query Ontology Using the exis9ng hierarchy of packages in Bioconductor, we created an OWL-‐based ontology using WebProtégé[4] For each package, we extracted the “biocViews”, which are keywords that Bioconductor assigns each package that served as the basis for our ontology. For example, if we were interested in clustering, we would look at the “Clustering” node which is a subclass of “Sta9s9calMethod” and “So]ware”. >> python project_search.py 'flow cytometry cluster analysis' 10 Retrieving BioConductor package description search engine from package_search_engine.pickle... Searching for "flow cytometry cluster analysis"... Searching document 2050/2051 Ranking your results... Retrieving GitHub project description search engine from project_search_engine.pickle... Searching for "flow cytometry cluster analysis"... Searching document 18416/18417 Ranking your results... Retrieving project-to-package index from package_index.pickle... 1. https://github.com/caipine/FlowC (1.416) 2. https://github.com/BIOFAB/fcs-analysis (1.408) 3. https://github.com/mizumot/cluster (1.309) 4. https://github.com/RGLab/flowViz (1.228) 5. https://github.com/RGLab/flowQ (1.070) 6. https://github.com/RGLab/flowStats (0.935) 7. https://github.com/cran/clustsig (0.869) 8. https://github.com/bioinfoxtra/cytometry (0.856) 9. https://github.com/cran/curvHDR (0.802) 10. https://github.com/cran/Rcell (0.772) In our first itera9on, we implemented our project to be queried from the command line with the search term of interest and the number of results to display. For this, the user needed to have our files stored locally in order to successfully query the system. We then made a GUI which can be accessed at h^p://9nyurl.com/biosearch-‐ui, which overcomes the limita9ons of the command line interface. Both also display the score of each package, which was used to produce the final ranked list. Evaluation Fig 2. WebProtégé ontology Problem Solving Methods Fig 6. Overall search engine preference Fig 5. Sample survey ques9ons Fig 3. Overview of problem solving methods 1 . The query is compared to all GitHub project documenta9on and Bioconductor package documenta9on, producing a query-‐to-‐project and a query-‐to-‐package similarity score for each project and package, respec9vely. Scores are calculated as the cosine distance of the query’s term frequency vector to precomputed term-‐frequency inverse-‐document-‐ frequency (TF-‐IDF) indices of the package and project descrip9ons. 2 The query-‐to-‐package scores are diffused across a pre-‐built ontology. This process is conceptually equivalent to smoothing where the distance between two packages is their distance in the ontology structure. 3 The project’s final score is its project-‐to-‐query similarity score plus the smoothed query-‐to-‐package score for each package used in the project. T o e v a l u a t e t h e q u a l i t a 9 v e performance of our system, we amassed a group of 14 test users and asked them to fill out a qualita9ve assessment. Our survey displayed results from GitHub’s search engine (A) and our search engine (B) for different sample queries. Users were asked to rate the relevance of top results from both engines and which search engine they preferred overall. Fig 7. Relevance for individual results References [1] Doll, B. (2013). 10 Million Repositories. Retrieved from h^ps://github.com/blog/1724-‐10-‐million-‐repositories [2] GitHub. (n.d.). GitHub API v3. Retrieved February 27, 2015, from h^ps://developer.github.com/v3/ [3] Bioconductor. (n.d.). Retrieved February 27, 2015, from h^p://www.bioconductor.org/packages/release/BiocViews.html#___So]ware [4] WebProtege. (n.d.). Retrieved February 27, 2015, from h^p://webprotege.stanford.edu/#List:coll=Home;
© Copyright 2024