*Note:'This'cover'sheet'is'not'to'exceed'one'page.*' ' ALFRED'P.'SLOAN'FOUNDATION' www.sloan.org''|''proposal'guidelines' PROPOSAL'COVER'SHEET' ' Project'Information' Principal'Investigator' Holly'Bik,'Postdoctoral'Researcher''''''''''' UC'Davis'Genome'Center,'One'Shields'Ave,' Davis,'CA'95616''''''''''''''''''''''''''''''''''''''''' Phone:'530S752S8409,''''''''''''''''''''''''''' Email:'[email protected]' Grantee'Organization:' University'of'California,'Davis' Amount'Requested:' $247,189' 'Requested'Start'Date:' July'1,'2013' 'Requested'End'Date:' June'30,'20114' ' Project'Goal' The'overarching'goal'of'this'project'is'to'produce'a'new'webSbased'scientific'visualization'framework'for' the'analysis'of'highSthroughput'biological'sequence'data'(initially'focusing'on'rRNA'amplicon'data).' Objectives' Leveraging'a'close'collaboration'between'a'scientific'visualization'studio'(Pitch'Interactive)'and'researchers' (PI'Bik'and'case'study'participants),'we'aim'to'produce'intuitive,'interactive'visualization'tools'that'can'be' used'to'explore'and'analyze'biological'patterns'in'highSthroughput'environmental'datasets.'We'will'take' advantage'of'standard'file'formats'from'computational'pipelines'in'order'to'bridge'the'gap'between' biological'software'(e.g.'QIIME)'and'existing'data'visualization'capabilities'(harnessing'the'flexibility'and' scalability'of'WebGL'and'HTML5).' Proposed'Activities' The'major'project'activities'focus'on'software'engineering'(construction'of'database'framework)'and'user' interface'design'(construction'of'novel'visual'presentations'of'data);'this'will'be'accomplished'via'an' iterative'process'that'incorporates'feedback'and'requests'from'end'users'(researchers'including'PI'Bik'and' case'study'participants).'Other'activities'include'the'broader'dissemination'of'project'activities'via'a'portal' website'and'social'media'discussions,'and'presentation'of'project'outputs'(software'products'and'biological' findings)'at'scientific'conferences.' Expected'Products' Expected'products'include'1)'an'open'source'and'functional'webSbased'visualization'pipeline'2)'peerS reviewed'scientific'manuscripts'3)'webSbased'metrics'documenting'website'views,'software'usage,'social' media'sharing'and'dissemination'of'project'activities'via'mainstream'media'outlets.' Expected'Outcomes' The'success'of'this'project'will'set'new'paradigms'for'the'analysis'of'environmental'sequence'data.'We' anticipate'that'visualization'frameworks'will'help'to'address'ongoing'bottlenecks'in'the'analysis'of'large' sequence'datasets,'promoting'significant'improvements'in'research'efficiency.'We'anticipate'that'SloanS funded'Microbiology'of'the'Built'Environment'grantees'will'particularly'benefit'from'project'outcomes.' ' ! ! Project Title: A Research-Driven Data Visualization Framework for HighThroughput Environmental Sequence Data Primary issue to be addressed The advent of high-throughput sequencing data is now ushering in a veritable renaissance in biology. For the first time, we have the ability to deeply characterize the global biodiversity of historically neglected, microbial taxa via environmental sequencing approaches (investigations of bacteria, archaea, and microscopic eukaryotes using 454/Illumina sequencing platforms, e.g. Creer et al. 2010, Sogin et al 2006). However, the sheer volume of data produced from these new technologies will require fundamentally different approaches and new paradigms for effective data analysis (Akey & Shriver, 2011). Scientific visualization represents an innovative method towards tackling the current bottleneck in bioinformatics; in addition to giving researchers a unique approach for exploring large datasets, it stands to empower biologists with the ability to conduct powerful analyses without requiring a deep level of computational knowledge. Effective, sophisticated visualization tools (taking advantage of human cognition and human information-processing capabilities) can help to link information from disparate fields, propelling scientific insight and spurring new discoveries (Munzner et al, 2006). Computer algorithms face significant difficulty in identifying simple data patterns, and thus writing algorithms for complex, subtle patterns (the type that exist in biological systems) is almost impossible. The human eye, in contrast, is very adept at spotting subtle visual patterns, able to quickly notice trends and outliers (Heer et al, 2010), especially when presented with intuitive, well-designed software tools and user interfaces (Heer & Shneiderman 2012). The increasing scale of data (hundreds of millions of raw DNA sequences, equating to tens of thousands of rows in a “small” Excel spreadsheet after data processing) makes it unfeasible to conduct fine-scale analyses in most existing biological software packages. ! 1! ! ! Additionally, new sequencing technologies (454, Illumina) are fundamentally different from Sanger sequencing – the nature of the data requires specific computational considerations (e.g. accounting for intragenomic variation across 18S rRNA gene copies present in eukaryotic genomes, which inherently affect our interpretation of environmental sequence data; Bik et al. 2012a). Exploratory data visualization approaches are particularly well suited for highthroughput datasets, since we do not yet understand the regularities or “classes of behavior” inherent to the underlying biological sequences (Shoresh and Wong 2012). The development of specific visualization tools for high-throughput sequencing data would ideally complement the existing initiatives and long-term vision for the Sloan Data and Computational Research program. Such new tools represent an important next step for our ability to manage (and produce meaningful, fine-scale analyses) from increasingly large volumes of data. The move towards “big data” now impacts virtually all areas of science and society. As a few examples, global networks of remote sensing technology produce millions of pixels per year1, citizen science initiatives are maintaining increasingly large databases (eBird alone contains 63 million records1), and the social media behemoth Twitter now reports 200 million active users sending 400 million tweets per day2. For all such fields where large data volumes currently impede interpretation, visual tools could offer an easier way to parse data points and spur information discovery. Visual data exploration should drive the questions being asked, enabling users to conduct quick and efficient analyses without needing a high degree of programming knowledge. The overarching goal of this project is to build a system for connecting the outputs of standard data production pipelines to a modular visualization toolset. Such a generalized framework will be customizable for many diverse applications targeting a wide variety of users (researchers, educators, citizen scientists, journalists, etc.). Open source projects, ! 2! ! ! coupled with an emphasis on data sharing features and data provenance tracking within the visualization frameworks will importantly enable the wide dissemination of scholarly information and encourage best practices in data management and reproducibility. Major related work in the field There is a significant history of visualization in biological sciences, and the development of visual tools for data analysis has long been recognized as an inherent need for the effective interpretation of data (Wong 2012). Visual tools have gained particular popularity for the study of evolutionary relationships amongst species (the field of phylogenetics), where data must be viewed and interpreted in the context of a branching tree structure (Pavlopoulos et al. 2010). Tree viewer software tools exist by the dozens3 and new programs that claim to “solve” the treeviewing problem continue to emerge on a regular basis. On the other hand, the development of new visualization approaches for high-throughput sequencing data has been stagnant; most scientific publications continue to summarize data using simplistic pie charts or bar charts (example in Caporaso et. al 2010). Visualizations specifically designed for environmental sequencing data are largely limited to overview schematics such as Principal Coordinate Analysis, UPGMA clustering, rarefaction curves and OTU heatmaps4. These approaches can provide powerful biological inferences, but because they inherently summarize data at the community level (e.g. the entire pool of species present in a given sample), such visuals are not appropriate for investigating more fine-scale patterns (pinpointing specific taxa which underlie community differences). While phylogenetic placement tools can allow the exploration of individual lineages (e.g. pplacer; Matsen et al. 2010), use of these tools requires a certain level of computational knowledge, the file outputs are large and complex, and useful visualization of tree placement results remains problematic. ! 3! ! ! Many biologists continue to be frustrated by the existing suite of visual tools they must rely on for analysis of high-throughput sequencing data (a common theme that surfaced at the October 2012 QIIME/VAMPS meeting in Boulder, sponsored by the Sloan Microbiology of the Built Environment program). Biological software has historically been built by biologists; while these researchers have knowledge of the problems they want to address, they typically lack formal training in software engineering or user interface design. Thus, the majority of biological software tools are difficult to install, poorly documented, and require expert knowledge or training to use effectively. Well-designed, intuitive biological software tools do exist, but they are most often proprietary (closed-source products developed and sold by for-profit companies). Software such as Geneious, Sequencher, and CLC workbench are convenient tools for the analysis of DNA sequence data, but use of these programs requires the purchase of a software license costing hundreds to thousands of dollars. Scalable software tools that are able to handle high-throughput sequence datasets are prohibitively expensive. For example, a single-computer academic software license for the CLC Genomics Workbench is currently priced at $4,995. Proprietary software also operates in a non-transparent manner; users typically do not have full control over parameter settings (being limited to a set of options pre-defined by the software company) and are not able to gain access to the crucial information about how the software processes raw data (Nekrutenko and Taylor 2012). In addition to the tradeoff between high cost and poor software interfaces, most existing visualization tools in biology were designed before the advent of high-throughput sequencing platforms. The newest DNA sequencing technologies (Illumina and 454 platforms, amongst others) churn out biological data at an unprecedented scale, returning millions of sequence reads ! 4! ! ! per single run of an instrument. Datasets routinely become larger as new platforms and instrument upgrades come onto the market, but biological software capabilities have not scaled alongside this instrument progression. For example, most tree viewer software cannot handle evolutionary trees containing more than a few thousand taxa (Page 2012). Data formats have also changed alongside the evolution of sequencing technology. In the past, the comparatively small size of datasets (tens to hundreds of taxa) meant that biologists could manage manual curation of their sequence data and annotate any useful metadata as needed (for example, adding information about taxonomic names and sample site characteristics onto a phylogenetic tree built from DNA sequences). Since the advent of high-throughput sequencing, large-scale manual curation of sequence data and sample metadata is no longer possible. File sizes are too large, and data must be prepared and processed in computationally amenable formats (e.g. programming-friendly data values nested within tab-delimited text files). Finally, many file formats used for high-throughput sequence data are not supported by older biological visualization software, such as Illumina FASTQ files for DNA sequences and phylogenetic trees in XML format. In contrast, the outputs from computational pipelines specifically designed for highthroughput data are becoming increasing standardized, particularly for rRNA amplicon studies. The QIIME pipeline (Caporaso et al. 2010) is quickly becoming standard software for the processing and analysis of rRNA amplicon data; QIIME supports broad analyses across the tree of life, with workflows for analyzing 16S rRNA genes from bacteria/archaea, 18S rRNA genes from eukaryotes, and ITS rRNA from fungi. Raw sequence reads are typically clustered into Operational Taxonomic Units (OTUs, which can be thought of as molecular “species”) using a pairwise identity cutoff (e.g. 97%). Consensus OTU sequences are given a taxonomic assignment based on comparisons to public sequence databases (e.g. using naïve Bayesian ! 5! ! ! classifier tools such as the RDP classifier; Wang et al. 2007). QIIME supports some limited downstream visual analyses of data, primarily restricted to higher-level overviews of microbial community patterns, such as Principal Coordinate Analyses and UPGMA clustering used for comparing the similarity of samples in a given dataset. The majority of QIIME analyses produce computationally amenable, standard file formats such as OTU tables, which document the occurrence of OTU sequences across sample sites. QIIME has recently implemented the BIOM format (McDonald et al. 2012) for OTU tables; this new format takes advantage of the JSON file format, a file type that is easy to parse with wide support across multiple programming languages. This OTU information can then be linked to other types of standard format text files containing taxonomic assignments (taxonomy mapping files), environmental metadata about sample sites (metadata mapping files), and evolutionary relationships (phylogenetic trees). There continues to be a persistent, vast gap between data visualization fields and the biological sciences. Despite the “growing appetite for the visual display of information… advances in visualization are not adequately described and shared with the biological community” (Wong 2012). Flexible, robust visualization capabilities such as WebGL and HTML5 are commonly employed in the data visualization community, yet have seen limited applications in biological software. This project aims to bridge these two areas, leveraging the critical expertise within each field via an interdisciplinary team of visual artists and software engineers (Pitch Interactive) and practicing biologists (PI Bik and case studies detailed in Appendix 2). Our work will incorporate the existing body of knowledge5 related to effective software design (human-computer interaction, color theory, visual display of information) as we design a long-term, scalable framework for the visualization of environmental sequence data. This project will avoid redundancy across past and present visualization initiatives, minimizing ! 6! ! ! overlap with existing tools. We will specifically take advantage of standardized, computationally efficient file formats for DNA sequence data (OTU tables, metadata/taxonomy mapping files, and phylogenetic trees); our proposed visualization framework will pick up where QIIME functionality ends. This project will explore new, innovative ways to visualize high-throughput sequence data and simultaneously explore the biodiversity present in environmental samples. Potential avenues for these new visualizations include “Dot visualization” (Figure 2) and heatmapped “Treemaps” (Johnson and Shneiderman 1991), although we will explore a much wider breadth of graphical renderings. Since most environmental rRNA studies rely on arbitrarily clustered OTUs with taxonomy derived from annotations in public sequence databases (that are often incorrect or uninformative, e.g. Bik et al. 2012b), visualizing environmental sequences in the context of confidence values (e.g. as obtained through the RDP classifier, Wang et al. 2007) and phylogenetic trees will provide robust and much-needed tools for high-throughput studies. All visualizations will be highly interactive; users will be able to conduct real-time data filtering and updating of visual renderings. Constructing new visualization tools for high-throughput sequencing data will have significant and far-reaching impacts for biological research. A streamlined visualization workflow and sleek user interface enabling novel explorations of large datasets will immediately encourage researchers to use this application, regardless of their computational skill level. The ease of filtering data, coupled with data provenance tracking and the ability to export publication-quality graphics will promote effective and efficient research. In practice, data provenance capture equates to computational systems for tracking user interactions with the software and recording all commands, parameters, and data outputs. This simplifies data ! 7! ! ! management for users; most critically, it encourages best practices ensures scientific reproducibility for all analyses conducted. We will follow existing models such as VisTrails (Callahan et al. 2006) and iPython Notebook (http://ipython.org/notebook), both of which represent interactive and streamlined systems for data visualization and provenance capture. Qualifications of project team In this project, Pitch Interactive (a sought-after name in data visualization) will take the lead role in the design and development of visual frameworks (Figure 1), working closely with biologists (PI Bik and case study participants) to define research questions and prioritize the implementation of specific features. Our project team represents a highly integrative collaboration between research scientists and visual communicators; such tight interdisciplinary collaborations are exceedingly rare in biological research (Wong 2012). Biologists will not build software themselves, but will instead work in close collaboration with a proven and distinguished industry partner (Pitch Interactive). By relying on the expertise of Pitch Interactive, we will avoid common software pitfalls (unscalable tools with difficult user interfaces) while also being able to deeply explore the utility of diverse and novel visual renderings of data. PI Holly Bik will lead the project from a biological perspective. As demonstrated by her CV, she has striven to develop and maintain a strong interdisciplinary viewpoint during her scientific career. This broad perspective has fostered “big picture” thinking, allowing her to tackle pressing scientific problems from a unique perspective—at the interface of many disciplines. While PI Bik’s doctoral research centered on deep-sea nematode taxonomy and molecular phylogeny, she transitioned fields during her postdoctoral career, currently focusing on high-throughput environmental sequencing of microbial eukaryotes and computational biology research. In her current role at UC Davis, she works closely with computer scientists and ! 8! ! ! software engineers to inform the development of cutting-edge tools for the primary analysis of large sequence datasets (millions of reads), contributing to the development of the PhyloSift pipeline (software for the phylogenetic analysis of genomes and metagenomes; Darling et al. submitted). She is personally and professionally committed towards enabling efficient science, contributing needed tools and cyberresources to the research community, and aiming to assist diverse groups of biologists in leveraging high-throughput sequencing approaches. PI Bik’s interdisciplinary background will be ideal for leading the development of the proposed visualization framework, enabling the project team to evaluate and balance computational considerations with biological research needs. In addition, her professional network (particularly her connection the Sloan Microbiology of the Built Environment program) and social media presence will enable wide dissemination of all project activities and outputs. The Pitch Interactive team will take the lead on software development, visual design, and project management. Pitch Interactive’s inventive and pioneering work is attributed to out-ofthe-box creative thinking derived from a small, forward-thinking team representing a range of different yet complimentary backgrounds. Their team is comprised of Wesley Grubbs (Creative Director, Technical Director, owner), Nicholas Yahnke (Software Engineer) and Mladen Balog (Concept Artist). Team leader Wesley Grubbs comes from an academic background in international economics and information systems followed by many years in the advertising world; during his career, he has always maintained a keen interest in science that, in part, stems from his upbringing by a Geologist. Pitch Interactive’s work has been seen in WIRED magazine, Esquire, Scientific American, Popular Science, Fortune, Princeton Press books, and most recently at the Museum of Modern Art’s “Talk To Me” exhibit in New York City. Their client roster includes Activision, The Big History Project, ESPN, General Electric, General Motors, ! 9! ! ! Google, Oracle, The Russian Avant-garde Foundation, ThermoFisher Scientific, Tomotheraphy and the Wisconsin Institutes for Discovery. Pitch Interactive has created work for interactive installations, touchscreen kiosks, smart phones, tablet devices, console games, websites, standalone applications, museum exhibits, projections, textiles and print. The team has experience in a wide variety of industries including banking, economics, health, sports, scientific research, advertising, politics and art. In addition, the Pitch Interactive team is active participants in the data visualization community, speaking at conferences worldwide as well as giving lectures and workshops at academic institutions such as UC Berkeley, Stanford and NYU. Pitch Interactive’s daily operations are primarily involved with analysis of collected data, consultation in data collection, visualizing data, software development and project management. The studio follows a rigorous project workflow that closely resembles the life cycle of software development. Reflecting the philosophy that data visualization should be strongly connected to the lives and events from which it is derived, Pitch Interactive dissects large data sets in search of meaningful and often hidden patterns that serve to determine the shape and form that best tells a story. The team aims for visual depictions that not only inform, but also bridge the divide between science and art through a visual narrative in order to inspire, stimulate and engage minds. In working on the proposed visualization framework, Pitch Interactive will offer fresh perspectives and explore groundbreaking new ways to explore the immense scale of biodiversity and the evolutionary processes that have shaped life on Earth. Project activities and description The primary goal of this project is to produce a long-term, open source and scalable framework for biological visualizations of environmental sequence data, where the underlying format of the data is structured in a way that is agnostic of any downstream visualization. We ! 10! ! ! will leverage standard QIIME file formats for high-throughput sequencing data as input: these formats will primarily include OTU tables and their associated metadata/taxonomy mapping files, and eventually the visualization framework will be expanded to support phylogenetic trees. Our project timeline includes two discrete phases of development (Figure 1). The first phase (Tier One) will be the development of the underlying database framework, built specifically to parse standard input files and arrange data in a format that is amenable to downstream visualizations. This framework will be specifically engineered to handle extremely large and complex datasets (see Appendix 1 for further technical specifications), and will be constructed with a long-term vision to ensure that this framework is generalizable for other wellstructured data formats (e.g. building in support for increasingly ubiquitous, languageindependent file formats such as JSON and delimiter-separated text files). The second phase (Tier Two) will explore the visual presentation of data, iteratively assessing the most effective visual methods to facilitate exploratory data analysis, address research questions, and enable novel biological discoveries from large sequence datasets. Users will access all tools (data parsing and visualizations) via a web browser. Data visualizations will have a built-in sharable component (custom links that preserve a specific visual rendering and can be e-mailed to colleagues), and all shared visuals will accompanied by compatible versions for accessing on touchscreen devices. (e.g. iPads). We will leverage the flexible, scalable capabilities of HTML5 and WebGL (and where necessary, using other data visualization programing languages such as Processing) to promote maximum accessibility; this scenario is favorable over requiring users to download and install a specific application, and enables us to leverage processing power of the Pitch Interactive servers, if needed, to maximize the speed of visual renderings and minimize computational demands on users’ computers. ! 11! Project Schematic ! ! Tier One: Building a data processing tool that converts the raw data of potentially several hundred thousands of rows of data into a workable, formated dataset to be used in the visualizations. With the use of a User Interface tool, a user can specify filters to help condense and extract only necessary parts of the raw data into the formated dataset that will be used in the visualizations. Project Schematic Tier One: Building a data processing tool that converts the raw data of potentially several hundred thousands of rows of data into a workable, formated dataset to be used in the visualizations. With the use of a User Interface tool, a user can specify filters to help condense and extract only necessary parts of the raw data into the formated dataset that will be used in the visualizations. Raw data Raw data User Interface for setting filters for data parsing Formatted data structured for visualization User Interface for setting filters for data parsing Formatted data structured for visualization Tier Two: Data visualization components. We start by building 2 - 4 data visualizations. From the framework established witht the data parser, additional visualizations can be added over time. Each visualization will have a set of filters and sliders that the user can adjust and then save or share their findings with other researchers or with the public. Tier Two: Data visualization components. We start by building 2 - 4 data visualizations. From the framework established witht the data parser, additional visualizations can be added over time. Each visualization will have a set of filters and sliders that the user can adjust and then save or share their findings with other researchers or with the public. Start by building 2 - 4 visualizations. Can add more later. Each visualization will have it’s own controls to adjust presets. This can then be shared or saved as PNG to embed in research papers. Start by building 2 - 4 Each visualization will have more later. presets. This can then be shared or saved as PNG to embed in research papers. visualizations. Can add it’s own controls process to adjust Figure 1: Schematic detailing the software development for the proposed project As we work to construct the Tier One framework, we will define a list of scientific questions and priorities for data visualization. All downstream visualizations will fundamentally access this underlying framework in the same manner (via a purpose-built API), but the visual rendering and options for interacting with the data will be directly dependent on the scientific questions being asked (e.g. case study user needs, Appendix 2) and the exploration scale defined by the user (number of data points selected, higher versus lower level taxonomy, e.g. viewing patterns at the Phylum vs. Genus level). Our priorities for visual tools will primarily focus on 1) maximizing interactivity between users and their datasets, and 2) including capabilities for ! 12! ! ! filtering data and exploring biological patterns at fine-scale resolution. Many existing tools for visualizing high-throughput data are static in nature (e.g. 2D plots and pie charts) and return high-level overviews of biological patterns (taxonomic summaries at the Phylum level, or relationships between samples presented as ordination plots displaying overall microbial community similarity). The proposed fine-scale resolution would allow users to hone in on specific taxonomic lineages, investigate patterns of OTU abundance across sample sites, and interact with overview summaries of microbial communities (e.g. expanding pie chart wedges, akin to user interactions in the HTML5-based Krona software; Ondov et al. 2011). In the Tier Two project phase, we will undertake an iterative approach towards developing visualizations. One must start with questions, look into the data, ask more questions, and then repeat this process many times in order to find the best perspective and solution to visualize the information. We anticipate that the nature of data visualizations ! Figure'2:!Mockup!of!dot!visualization!for!high?throughput! sequence!data.!Size!=!OTU!abundance,!color!=!sample!site,! shape! =! habitat! metadata! (temperature,! pH).! Mockup! example!taken!from!http://www.wefeelfine.org/!! will progress and evolve over time. While it is difficult to predict the ultimate form of visualizations, we will begin by exploring visual presentations similar to existing web-based tools that are designed to incorporate large and complex volumes of data. Some examples include We Feel Fine6 a real- time presentation of human emotions derived from Twitter, and 100,000 Stars7, a Web GL astronomy visualization built by the Google data arts team. Figure 2 presents a mockup of one such prospective biological visualization, based on We Feel Fine. In this example, DNA data normally presented ! 13! ! ! as basic text files (OTU tables and metadata/taxonomy mapping files) are instead explored in a visual context, where different shapes, sizes and colors represent distinct data attributes. During software development, we will emphasize three foci to maximize software accessibility and data reproducibility for end-users. First, we will take advantage of gestural interfaces for data filtration and exploration of visual renderings (e.g. “slider bars” as depicted in the center figure of the Figure 1 Tier One workflow). Initially, gestural interfaces will be primarily accessible through a web browser (manipulation via trackpad or mouse on desktop computers, and touchscreen interaction for shared visuals accessed on mobile device web browsers), but a long-term goal is the development of a dedicated mobile app to further leverage touchscreen interaction capabilities (since existing biological software has not yet taken advantage of touchscreen capabilities; Page 2012). As a second focus, we will track and record data provenance for all user interactions within the software. This feature will automatically generate outputs providing details of data filtering and manipulation (allowing researchers to access filtered data files and in-depth records of computational processing if needed), without ever requiring users to interact with software on the level of the underlying code. Finally, the overall framework and will be specifically designed to promote and facilitate novel scientific discoveries. The fundamental rational for the proposed visualization framework is to increase the pace of scientific discovery. Instead of forcing biologists to struggle with Perl scripts and Unix commands, we aim to provide researchers with a powerful and easy-to-use framework for exploring hypotheses and generating/testing new scientific questions on the fly. All software and research products will be maintained as open-source and open access, allowing us to build a strong community of users and developers. Building up a community around visualization tools will promote constructive conversations across user groups (e.g. ! 14! ! ! amongst our project team and target audiences for different tools), encourage developers to leverage our database framework for new software that is outside the scope of this proposal, and lead to broad dissemination of scientific products (software and publications). Management and Staffing plan Pitch Interactive will drive software development of the Tier One and Two frameworks (programming and database construction outlined in Figure 1). PI Bik will drive the biological side of software development (prioritizing research questions to be address and features to be implemented), liaising with Pitch Interactive to explain data structures and give feedback on database frameworks and user interface design. PI Bik will additionally coordinate all interactions with case study participants. PI Bik and case study participants will test software products and report back to the Pitch Interactive team with feedback on user interface design, tool functionality, and visual presentation of data. Both Pitch Interactive and PI Bik will contribute to the construction and upkeep of the project portal website, and disseminate project updates and products via social media and blog updates. At the outset of this project, the Pitch Interactive Team and PI Bik will meet in person to prioritize project goals, discuss specific user groups and audience needs, and review the anticipated “test cases” (Appendix 2) to be used as models towards defined project goals. This in-person meeting will be repeated every month; each meeting will begin with assessment of the previous month’s goals and redefine new targets based on successful paths. Daily and weekly correspondence between Pitch Interactive and UC Davis will be conducted via E-mail and Skype. For all projects, Pitch Interactive follows a Systems Development Life Cycle (SDLC) process8 for building digital projects and interactive data visualizations. This process involves significant upfront project planning and requirements definition that helps keep large projects ! 15! ! ! manageable and maintainable. During the planning phase, key variables are defined, such as project milestones, technologies, programming methodologies, documentation requirements, and other task breakdowns in order to organize and manage the project and its deliverables. For everyday project management tasks such as maintaining project-related correspondence, discussions about features, to do lists, sharing files, group coordination, and task accountability this project will use Basecamp9, a proven project management tool used by Pitch Interactive for several years on dozens of projects. ! Figure 3: Project milestones and timeline Broader impacts activities (plans for dissemination and sustainability) Project activities and software tools will primarily be disseminated via a dedicated portal website, while also leveraging social media tools (Twitter, Google+) and in-person presentations at workshops and scientific conferences to announce software updates and research products (peer-reviewed manuscripts). The project website will be built using Wordpress software, a flexible and customizable platform that will be used to host a project blog and software tutorials, pull in discussions from Twitter, link to our open source code repositories (GitHub, http://github.com, the repository we plan to use for managing software development and version control), and most importantly direct users to the visualization frameworks outlined in Figure 1. We also anticipate harnessing PressForward10 (a Sloan-funded project), a Wordpress plugin that will allow us to amalgamate relevant content about scientific visualization on the project website. ! 16! ! ! Both Pitch Interactive and PI Bik (a postdoc in the lab of Jonathan Eisen at UC Davis) maintain a significant online presence (including Twitter, blogs, and professional websites), and are well suited to take a lead role in promoting broad dissemination of visualization software developed during this project. Pitch Interactive maintains strong connections across major media sources (including scientific magazines, news outlets, and technology magazines), and website traffic shows major peaks when high-profile visualizations are released (400,000 views in one week for a recent project). The Eisen lab is a leading voice in the movement towards open access science and social media-based online outreach; lab members consistently produce blog content for both scientific and general audiences. Lab head J. Eisen is active on his popular Tree-of-Life blog11 with 2000+ subscribers and 20-50,000 site visits per month and also has a high profile Twitter microblog (@phylogenomics) with >15,000 followers. Project PI Holly Bik contributes to the leading marine science blog Deep Sea News12, with 100,000-300,000 site visits per month) and maintains a Twitter account (@Dr_Bik) with >2400 followers. Pitch Interactive and the Eisen Lab at UC Davis maintain a distinct base of followers in the technology and biological research sectors, respectively, and additionally maintain a strong network of media/journalism contacts; disseminating project activities via non-overlapping online channels will encourage engagement and participation from diverse audiences. Harnessing social media tools during this project will be critical for promoting open science and disseminating all project outputs. Because Pitch Interactive is well known and well-respected in the field of data visualization (Wesley Grubbs maintains a strong network across the technology and design sectors), their status will additionally allow us to solicit interest for continued support of visualization frameworks well beyond the initial project development phase. We anticipate that our mature software products, supported by a diverse user community and disseminated via a ! 17! ! ! strong web presence (social media and coverage in science blogs/print journalism), will allow us to secure a financial partner (private sponsor, federal research funding or similar) and ensure long-term sustainability for the project. Project outputs This project aims to produce a variety of discrete outputs, including software products, peer-reviewed manuscripts, public outreach components, and links to the Sloan Microbiology of the Built Environment (MBE) Program. The primary output will be a two tier, web-based interface for visualizing high-throughput sequencing data (Figure 1, as described previously). Tier One will be a scalable database framework; users will upload their standard-format data files within a web browser and subsequently adjust settings and select criteria for data filtering. Once these criteria are submitted, users will be seamlessly transferred to the Tier Two visualization framework; the software will render 2-4 discrete visualizations, from which users will select a visual presentation to expand and explore more deeply. In addition to software frameworks, this project will produce a parallel suite of research products. PI Bik and case study participants will evaluate visualization tools using their own environmental sequence datasets, generated from a diversity of habitats (marine ecosystems to the Built Environment). Since the goal of visualization tools is to provide deeper insights into biological patterns, we anticipate that users will discover surprising patterns while exploring their own data (interesting taxonomic patterns, subtle differences in microbial communities across different types of sample sites). Thus, another priority within this project will be to translate these biological findings into peer-reviewed scientific manuscripts; we will assist and encourage case study participants in this process as needed. Once visualization tools are mature, we also intend to publish our framework as a software note in a relevant open-access journal ! 18! ! ! (e.g. BMC Bioinformatics); this will enable researchers to reference an appropriate citation in any future mention of visualization tool use. For measuring “broader impact” outputs, mechanisms for tracking/analyzing web-based resources will be spearheaded by both PI Bik and Pitch Interactive. Given the pace of technology, the type and nature of web tracking data is likely to evolve over the course of the project, but at minimum we will collect basic statistics related to webpage views (e.g. tools such as StatCounter13 and Google Analytics which record page views and their geographic origin, referring links and relevant Google search terms) and wider dissemination of website/blog content (Tweets, reposts, formal coverage in science journalism). In addition, we plan to utilize ImpactStory14 (a Sloan-funded project) to further track social media sharing and altmetrics related to our GitHub software repository; eventually, we anticipate that software users will also be able to track metrics for their shared visualizations via ImpactStory. All collected data will be amalgamated and housed on shared cloud-based servers (e.g. Dropbox) and locally backed up at UC Davis. As our project evolves, we will use web-tracking data to identify the most effective methods for content dissemination and target specific channels to ensure the broadest reach. Finally, this project will collate survey data and feedback from case study participants, including requests for new features and support for other types of standard file formats (see Appendix 2). Interacting with researchers in the Sloan MBE program represents an integral component of this approach. These data will be collected with a long-term outlook and assumption of continued future development and expansion of the visualization frameworks outlined in Figure 1. All feature requests and bugs will be collated and monitored in a dedicated issue tracking system that Pitch Interactive has used extensively for several years called Trello15. ! 19! Appendix 1 – Technical Specifications To undertake this project, several technical aspects must be considered. This document helps address the technical specifications and needs. 1. User Data (Initial Inputs) The data required to run on this system will be static JSON files (OTU tables) and tabdelimited .csv files (taxonomy/metadata mapping files) generated by researchers using the QIIME pipeline (http://qiime.org) either locally or on Amazon EC2 cloud servers. Using pregenerated data files is a cost-effective solution that eliminates calls to an external server and it allows for the visualization system to run on- or off-line. The JSON and .csv files will follow specific formats that will be read by the data parsing system so that the data can be updated as frequently as necessary. 2. Data Parsing For the Tier One framework (Figure 1), raw data files will be parsed on the client browser. JavaScript libraries, such as D3.js support methodologies of parsing even very large data files. Once the data has been parsed, a new, smaller and optimized JSON formatted data file will be created that will be used to generate the data visualizations quicker and more efficiently. 3. Data Storage There are two options for approaching data storage. We may select both options or only one depending on the needs of the researchers to address privacy concerns vs. the ease of shareability of an analysis, etc. Option 1: The parsed, optimized JSON file will be saved to users’ local hard drive. In order to share their analysis, the researcher must share the local file with fellow researchers. This can be good for the purpose of working offline and would address any privacy issues a research may have if they are uncomfortable with their files being uploaded to a server. Option 2: The parsed, optimized JSON file will be uploaded to a server. This would be optimal for sharing with multiple researchers and will address the need to keep files centralized. However, any researcher would be required to work online to view the data file. 4. Data Visualization using Web-based Technologies The primary target medium for the data visualization components will be modern web browsers that support WebKit and WebGL for advance graphics rendering (e.g. Google Chrome). However, while building for web browsers, there will be some functionality engineered specifically for the iPad Safari browser and Android tablet device browsers. The tablet browsers may not currently have the capacity to render high-level graphics the same way that a WebKit browser like Chrome can, but based on the advancing trends of these devices, it would be naïve to neglect these technologies for future purposes. ! 27! Appendix 2 - User Case Studies In this project we will maintain close interactions with three primary case study groups. Case study participants are integral to the development of effective visualization tools, as they represent the target end-users of software products resulting from this project. Case study groups encompass a wide breadth of research expertise (biologists to computer scientists), and will serve two primary purposes: 1. Test software functionality and user interface design. Are tools easy and intuitive to use? Is there sufficient documentation to solve problems if users get stuck? Are users able to upload their standard data files as anticipated? 2. Give feedback on visual interfaces for data exploration (Tier Two visualizations). Are these visualizations appropriate given the nature of the data? Can researchers ask questions and explore hypotheses in an efficient and intuitive way? Are there any aspects of the data or metadata that users would like to explore, but cannot currently visualize? Case Study Group 1: Eisen Lab members The most immediate case study participants will be members of Jonathan Eisen’s lab at UC Davis (where PI Bik is a postdoc). Eisen lab members work with a variety of highthroughput sequencing data types, including shotgun metagenomes, 16S/18S rRNA amplicon data, and isolate microbial genomes. Graduate students and postdocs will be asked to test prototype visualization frameworks using their own computers and datasets. The close proximity of Eisen lab participants will allow our project team to solicit consistent and informal feedback on a daily/weekly basis. Although the activities in this proposal focus on the visualization of rRNA amplicon data, our long-term vision would be to eventually expand this software to ! 28! support other standard data types (e.g. from shotgun metagenomic analyses). Discussions with Eisen Lab members will enable us to gather ideas and visualization requests for planning future software development. Case Study Group 2: Sloan Grantees within the Built Environment Program This project will establish strong links to the Sloan Microbiology of the Built Environment (MBE) program, using researchers funded through this initiative as case study participants. PI Bik is already heavily involved with MBE initiatives; she contributes to the microBEnet project (http://www.microbe.net - a portal website established to catalyze interest in the MBE program and connect researchers working in diverse disciplines), and regularly interacts with MBE researchers at Sloan meetings (QIIME/VAMPS meeting in Boulder in October 2012, Annual Sloan MBE meetings, “Evolution of the Indoor Biome” meeting at NESCent in June 2013). These links to the Sloan MBE program will provide important test cases for visualization tools, enabling researchers working on diverse projects to test out prototypes and deliver feedback. Since many Sloan MBE projects are looking to conduct fine-scale analyses on microbial communities in the Built Environment, we believe that the proposed data visualization tools could satisfy some of their main frustrations with existing computational tools; we aim to facilitate more rapid analysis and publication of data collected as part of MBE projects. We specifically propose to work closely with the Sloan-funded Wild Life of Our Homes (WLOH) project (Rob Dunn and Holly Menninger at North Carolina State University). Researchers Dunn and Menninger has expressed a keen interest in trialing visualization tools using amplicon data that has already been generated as part of this project. The WLOH project is ! 29! looking to assess subtle patterns in bacterial and fungal communities across residential dwellings in their sample set, but faces current difficulties finding feasible and informative ways to compare microbial communities. Thus, the WLOH project has ideas for specific feature requests that will also represent user needs for other Built Environment projects, including: 1) Filtering data to look at outputs from single samples or small subsets of samples, 2) Visual presentation of taxonomic information that allows users to infer knowledge about an organism’s biology and ecology, 3) Real-time modification of OTU clustering parameters for a subset of taxa, for example when biological information suggests that a 97% pairwise identity clustering cutoff may be lumping several bacterial strains together, and 4) Display of geographic data that will allow novel spatial inferences of microbial species Case Study Group 3: Projects within the NSF-funded program “Advancing and Visualizing the Tree of Life” The original basis for the visualization tools outlined in this proposal came from PI Bik’s participation in the National Science Foundation’s Ideas Lab for Analyzing and Visualizing the Tree of Life (AVAToL) held in August 2011 in Lake Placid, NY (Collins et al. 2013). Although visualization tools were not ultimately supported by NSF funding (despite several subsequent proposals submitted to the NSF Tree of Life program by PI Bik), new ways of visualizing data were recognized as a critical need for advancing our understanding of global biodiversity. PI Bik maintains close contact with PIs leading projects funded under the NSF AVAToL initiave (http://avatol.org ), and will liaise with these researchers to disseminate visualization software products and solicit feedback. A long-term goal will be to facilitate cooperation and integration between this visualization project and NSF AVAToL projects (none of which are primarily focusing on exploratory visual tools), as follows: ! 30! 1. Open Tree of Life (led by PI Karen Cranston at NESCent) – This project will provide the first reference guide tree (with critical metadata) for the entire Tree of Life. We propose to leverage the open-source Open Tree API to pull down relevant metadata (phylogenies, taxonomic annotations, and environmental metadata) to supplement and populate visualization tools described in this proposal. 2. Arbor (led by PI Luke Harmon at the University of Idaho) – This project will leverage phylogenetic trait data and species distribution data to design a software platform that allows the investigation of a) Evolutionary processes of spatial diversification, b) Evolution of symbiotic communities, and c) Evolution of complex interactions. We will work with the Arbor team to encourage the development of APIs that will be able to interact and access data from each other. 3. Next Generation Phenomics (led by PI Maureen O’Leary at Stony Brook University) – This project aims to digitize morphological, taxonomic and paleontological data using machine learning, computer vision and crowd sourcing approaches. PI Bik is a “data provider” for the Phenomics projects and will work with this project team to develop computationally-friendly metadata formats that are accessible for the visualization framework outlined in this proposal . In the long term, we specifically aim to pull in species image data and phenomic matrix data to populate interactive visualizations. Collection of survey and assessment data for all case studies: User surveys and evaluation will be driven by Pitch Interactive, who have substantial experience in determining target audiences and assessing user needs for software tools (e.g. via questionnaires at SurveyMonkey http://www.surveymonkey.com or based within GoogleDocs) These data will be incorporated into project management software and used to specifically define targeted outcomes. ! 31! Appendix 3 References Cited Akey, J.M. and Shriver, M.D. (2011) A grand challenge in evolutionary and population genetics: new paradigms for exploring the past and charting the future in the post-genomic era. Frontiers in Genetics, 2:1-2. Bik, H.M., Porazinska, D., Caporaso, J.G, Knight, R., Thomas, W.K. (2012a) Sequencing our way towards understanding global eukaryotic biodiversity. Trends in Ecology and Evolution, 27(4):233-243. Bik HM, Halanych KM, Sharma J, Thomas WK. (2012b) Dramatic Shifts in Benthic Microbial Eukaryote Communities following the Deepwater Horizon Oil Spill. PLoS ONE, 7(6):e38550. Callahan SP, Freire J, Santos E, Scheidegger CE, Silva CT, Vo HT. (2006) VisTrails: Visualization meets Data Management. Proceedings of ACM SIGMOD, June 27-29, 2006, ACM Press, New York, NY. pp. 745–747. Collins T, Kearney M, Maddison D. (2013) The Ideas Lab Concept, Assembling the Tree of Life, and AVAToL. PLOS Currents Tree of Life. Published online March 7, 2013. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods. 7(5):335–336. Creer S, et al. (2010) Ultrasequencing of the meiofaunal biosphere: practice, pitfalls and promises. Molecular Ecology.19(s1):4–20. Darling, A., Jospin, G., Matsen, F., Lowe, E. Bik, H.M., & Eisen, J.A. (Submitted) PhyloSift: A pipeline for phylogenetic taxonomy assignments from environmental metagenome data. Heer, J., Bostock, M, & V. Ogievetsky. (2010) A tour through the visualization zoo. Association for Computing Machinery http://queue.acm.org/detail.cfm?id=1805128 Heer, J. & Shneiderman, B. (2012) Interactive dynamics for visual analysis Association for Computing Machinery http://queue.acm.org/detail.cfm?id=2146416 Johnson B, Shneiderman B (1991)Tree-maps: a space-filling approach to the visualization of hierarchical information structures. Visualization 1991, Proceedings, IEEE Conference on 1991, 284-291. Matsen, F.A. et al. (2010) pplacer: linear time maximum-likelihood Bayesian phyogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics. 11, 538 McDonald D, Clemente JC, Kuczynski J, Rideout J, Stombaugh J, Wendel D, et al. (2012) The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Giga Science, 1(1):7. Munzner, T., et al. (2006) NIH-NSF visualization research challenges report summary. IEEE Computer Graphics and Applications, 26(2):20–24. Nekrutenko A, Taylor J. (2012) Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nature Reviews Genetics, 13(9):667–72. Ondov BD, Bergman NH, Phillippy AM. (2011) Interactive metagenomic visualization in a Web browser. BMC bioinformatics, 12(1):385. Page RDM. (2012) Space, time, form: viewing the Tree of Life. Trends in Ecology & Evolution, 27(2):113–20. Pavlopoulos GA, Soldatos TG, Barbosa-Silva A, Schneider R. (2010) A reference guide for tree analysis and visualization. BioData Mining, 3(1):1. 32# Shoresh N, Wong B. (2012) Points of view: Data exploration. Nature Methods, 9(1):5. Sogin ML, et al. (2006) Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci USA.103(32):12115–12120. Wang Q, Garrity GM, Tiedje JM, Cole JR. (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16):5261–7. Wong B. (2012) Points of view: Visualizing biological data. Nature Methods, 9(12):1131. Footnotes (websites referenced in proposal text) 1 Data estimates and an overview of how "Big Data" is impacting diverse fields can be found at the NEON website http://www.neoninc.org/news/big-data-part-i 2 Twitter statistics were obtained from a Washington Post article published on 3/21/13: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter 3 An extensive list of visualization software for phylogenetic trees can be found at http://evolution.genetics.washington.edu/phylip/software.html 4 Examples of all microbial community visualizations listed in the proposal text can be found at http://qiime.org/tutorials/tutorial.html 5 An extensive list of publications related to software design and human-computer interactions can be found at http://www.infovis-wiki.net/index.php?title=Category:Publications 6 We Feel Fine Twitter visualization can be viewed at http://www.wefeelfine.org/ 7 100,000 Stars visualization can be viewed at http://workshop.chromeexperiments.com/stars/ 8 Full details of the Systems Development Life Cycle (SDLC) process can be found at http://en.wikipedia.org/wiki/Systems_development_life-cycle 9 Futher details about the Basecamp software can be found at http://www.basecamp.com/ 10 Futher details about the PressForward plugin can be found at http://pressforward.org/ 11 Jonathan Eisen’s Tree-of-Life blog can be accessed at http://phylogenomics.blogspot.com/ 12 The Deep Sea News marine blog can be accessed at http://deepseanews.com/ 33# 13 Further details about the StatCounter web tracking tool found at http://statcounter.com 14 Further details about ImpactStory can be found at http://impactstory.it 15 Further details about Trello can be found at http://www.trello.com/ 34# UNIVERSITY OF CALIFORNIA, DAVIS BERKELEY ● DAVIS ● IRVINE ● LOS ANGELES ● MERCED ● RIVERSIDE ● SAN DIEGO ● SAN FRANCISCO ● SANTA BARBARA ● SANTA CRUZ UC DAVIS GENOME CENTER ONE SHIELDS AVENUE DAVIS, CALIFORNIA 95616 May 8, 2013 Dear Josh, We would like to thank the external reviewers and Sloan staff members for their extensive and insightful comments on our proposal, “A Research-Driven Data Visualization Framework for High-Throughput Environmental Sequence Data.” Our responses to reviewer comments are as follows: Response to Reviewers 1-3: Rationale for choosing a small team, focused approach at the project outset We appreciate reviewer concerns about engaging a larger community in this project, both in terms of software development and user feedback/testing. In the long term, both myself and Pitch Interactive are committed to ensuring broad reach of visualization tools, and soliciting input from diverse users. However, we believe that including a larger community during the early development of this project would hinder project progress and endanger the chance of success. Based on their experience, Pitch Interactive believes that when a project tries to get as much feedback as possible at the outset, the decision making process weakens and responsibilities fail. Thus, we have made a conscious choice to keep the project very focused from the start. Thus, we plan to initially work with a small, agile team of scientists (myself and case study participants) to help us address key issues. Once we build a stable prototype, the greater community can add to this framework and develop it further. Building a community of developers (and engaging with developers working on other existing software packages) will be a long-term, multiyear process; both myself and Pitch Interactive believe that at our current stage, it is too early to discuss sustainability through the creation of a developer community. We envision this community building as “Stage 2” (contingent on continued funding), which would occur after the release of our final prototype at the end of project year 1. Our focused, small team approach is a successful model that has been previously applied in data visualization fields. For example, Processing (a popular programming language for visualization) was initially developed by Ben Fry and Casey Reas in the MIT Media Lab. Following its conception by this small team in 2001, Processing has expanded to become a very rich toolset full of extensions and libraries created by the community, and maintains a strong following of developers. Another critical aspect of this proposal is that software development and project management will be led by Pitch Interactive, an “outside” organization that is separate from the academic research community. Pitch Interactive will effectively provide a unique and detached view of the project; we will have ample opportunity to look at data and approach problems from new perspectives. Both myself and Wesley Grubbs (CEO of Pitch Interactive) have had extensive discussions about the challenges of scientific visualization and the potential for new tools to transform the process of scientific research. We are excited about the opportunity to collaborate. Even though Pitch Interactive is a private organization, we both view this project as an intensely creative endeavor that is driven by mutual scientific passion and curiosity, i rather than simply a contract service carried out by a “for-profit” company (a point raised by several reviewers). Response to comments regarding technical specifications Regarding WebGL, a majority of web browsers do currently support this technology (see http://en.wikipedia.org/wiki/Usage_share_of_web_browsers). As of April 2013, more than 70% of browsers support WebGL. Internet Explorer is the only remaining popular browser to not support WebGL, however the next version (IE11) will support it. The point of this project is to be forward thinking and plan for public usage around summer 2014. It is worth noting that the current WebGL specification of 1.0 that is widely used was released 2 years ago, in March 2011 (http://en.wikipedia.org/wiki/WebGL). At this point, we do not believe the accessibility of WebGL is a major concern, since we anticipate that our users will all have access to WebGL browsers. Similarly, the discussions of “offline support” (Reviewer 1) are highly pertinent here, since many scientists (including myself) carry data files around on their laptops and have easy access to WebGL browsers. Here we meant to emphasize that the use of visualization tools will not necessarily require an internet connection (even though the toolkit will be accessed in a web browser). Although Reviewer 1 asked for specific details about software libraries (e.g. D3), this will be determined during the course of the project as Pitch Interactive gains a better understanding of scientific data structures. We will likely use Three.js, D3, and jQuery but beyond that, it will depend on the requirements building that we do in the planning phase. D3 is more of a visualization toolset, and we would not likely use it for data parsing. Three.js is used for WebGL and 3D rendering, while Two.js and jQuery are for general web support, such form handling, animations, etc. Web services to interact with online data (e.g. housed on Dropbox or GenBank) will not be an initial focus of this project – we had discussed this idea with Pitch Interactive while writing the proposal, and decided that accommodating such features would require a non-trivial amount of programming and auxiliary software development. This proposal aims to produce a prototype visualization toolkit in a short period of time, and thus to simplify development it makes the most sense to initially leverage local data files on a user’s computer. In the long-term, however, we do hope to implement tools that will allow users to access online data and public databases. Similarly, the integration of statistical analysis and visualization (e.g. support for data analysis using R) is another aspect that has been extensively discussed by myself and Pitch Interactive. We believe this is too lofty a goal to be addressed during the initial prototyping phase, and would be better tackled in a subsequent project phase (e.g. a “wish list” feature that could be advertised to developers during the community-building phase). Image export will offer options for high-resolution graphics (e.g. vector graphics such as SVG, etc. that are needed for journal submissions) in addition to other image outputs such as PNG. Server based storage has been fully budgeted for within Pitch Interactive’s subcontract expenses. Data collection, research questions, and use cases of visualization tools Both reviewers and staff members enquired about the specific scientific questions that visualization tools would be designed to answer. These questions will be narrowed down during the initial phase of the project when we discuss user needs with case study participants. However, in Case Study 2 (proposal page 30) we have included four examples of features that are not currently easy implement in existing data analysis tools. These feature requests can be thought of as the scientific questions driving data ii analysis. For example, we could ask: 1) Can we obtain novel information about the spatial distribution of species by concurrently visualizing geographic data? or 2) If we can easily filter data according to taxonomy, can we discover hidden ecological patterns across sample sites? Such questions will be generally applicable to high-throughput environmental data regardless of study system, since they are common questions asked by scientists when designing a study. The scientific questions that we will prioritize will represent the common themes that emerge when we talk to different case study participants. We anticipate that many researchers will want to conduct similar types of data filtering and visualization, so the needs of case study participants will be extendable to the greater community. As described in the proposal, early stage discussions with other community members (case study participants) will begin immediately at the outset of the project; we will collect research needs, identify use-cases, and prioritize research questions during this process. We propose this restricted approach, as opposed to the extensive interactions with developers suggested by Reviewer 3, since we are convinced that this is in the best interests of the project (see above justification for our rationale outlining our small team approach). The timeline presented in Figure 3 is intentionally simplified; our user testing will happen throughout the implementation, as we liaise with case study participants on a regular basis from the outset of the project. Visualization tools will be designed for downstream analysis – we will pick up where QIIME functionality ends (e.g. after users have filtered and processed their data via standard workflows such those described in the QIIME overview tutorial: http://qiime.org/tutorials/tutorial.html). As Reviewer 3 stressed, we will not begin with raw data, since there are many robust tools that already exist for this purpose (QIIME, MG-RAST, mothur, etc.), and users will be expected to have carried out these workflows beforehand. Software products and innovations that will result from this project Several reviewers mentioned the lack of specific details regarding user interfaces, software prototypes and the nature of scientific visualizations that will result from this proposal. We specifically avoided laying out a detailed schematic for the final toolkit (apart from some suggestions of existing tools and one mockup image). Our overarching goal is to build a visualization framework that improves research efficiency and enables rapid peer-to-peer communication of ideas and findings. Thus, while mockups and specific details of visualizations may look pretty in a grant proposal, they most likely will not represent the most effective paths towards achieving these project goals. We emphasize here that the direction we will take in software development cannot be determined until we start working with the data. This is the normal development process employed by Pitch Interactive; both myself and Wesley Grubbs believe that Pitch’s track record has proven that this path is a highly successful one. Response to Staff Comments: Support from lab head Jonathan Eisen – A letter of support has now been included from lab head Jonathan Eisen; as my current postdoctoral mentor, Jonathan has been kept fully informed of this Sloan proposal, and has enthusiastically supported the development of all visualization tools outlined. Postdoctoral salary support – The budget requests salary support over the full course of the 1-year project; with a small team and short project timeline, this salary support will allow me to devote my full attention to the development of visualization tools (whereas partial funding would require me to split my time with other commitments). This scenario will ensure the successful development of software tools and iii the production of scientific outputs (e.g. research papers, see below) that will facilitate my pursuit of an academic career. Initially, my efforts will focus on contacting case study participants (phone conversations, emails, user surveys), organizing their thoughts and requests, and conveying this information to Pitch Interactive. I will also be required to process my own existing high-throughput datasets (rRNA amplicon and metagenomic data) and deliver/explain these data to the Pitch Interactive team. Developing and updating the project website will be another significant time commitment. As the project matures, I will also be responsible for testing user interfaces and visualization tools, and publicizing the software across the scientific community (blog posts, conference presentations, etc.). The wide variety of tasks to be completed, coupled with the iterative nature of software development, will require my full attention throughout the year-long project timeline. Academic career plans – I am dedicated to obtaining a tenure-track faculty position at a major research university, and I have carefully considered how this project fits in with my long-term career goals. Publications are thus a primary consideration, and I plan on leading the development of several manuscripts during this project (at minimum, a software paper describing the visualization framework, and 1-2 manuscripts where visualization tools are used to dig deeper into my existing datasets to build a deeper view of microbial eukaryote communities in understudied environments such as deep-sea abyssal plains). Project tools will increase my own efficiency in conducting research and lead to traditional scientific outputs. Spearheading this project will also allow me to gain valuable career expertise in preparation for a future faculty role, in the form of grant management and project management. Scientific visualization represents one of my long-term research interests; by leading this project I will continue to develop a unique niche as an independent scientist (distinct from the research interests of my postdoctoral mentors), providing further preparation for a future transition to an Assistant Professor position. Demand and challenges for visualization software – with regard to staff member D’s comments, we believe we have addressed many of these questions in the “Major Related Work in the Field” proposal section (pages 3-7), outlining the issues with proprietary software and non-scalable nature of much existing biological software. In my view, the private sector does not have a deep understanding of the daily challenges that biologists face in data analysis (struggling with data formats and unwieldy software tools is common in the Eisen lab); many proprietary tools are designed for clinical/medical applications where the user base is larger and there is more potential to profit from the sale of software. Useful software is not necessarily difficult to produce (scalable, forward-thinking technology exists, such as WebGL), but rather, it is difficult to find funding to embark on close collaborations between users who understand the data (biologists) and developers who are knowledgeable about the most cutting-edge technology and capable of building well-engineered software (studios such as Pitch Interactive). We will overcome this particular challenge given the nature of our project team. We will also design our visualization framework with a “horizontal-looking” view – by consciously gaining an awareness of similar data types in other fields and disciplines (JSON files, etc.) that would be amenable to analyze within into our framework, and eventually allow users to input many different types of “big data” and adopt the software for their specific research needs. Encouraging such diverse use would be an obvious goal of any future “Stage 2” community building effort. Final comments: Finally, we appreciate the suggestions for tracking the impact and adoption of the visualization tools that result from this project. We believe this a very important aspect to consider when developing such a novel biological tool, and we are committed to tracking both traditional metrics (peer-reviewed journal articles, citations, etc.) and altmetrics (social media tracking, contributions to code repositories, website and tool iv usage statistics, etc.). If funded, it will be critical to discuss and define a suite of metrics with the Sloan Foundation at the outset of the project, for periodic assessment and evaluation of success. As stated in our proposal, we aim to produce an innovative and functional software toolkit at the end of year 1 that would be immediately useful for the scientific community (even in the absence of continued funding for this project). We believe that we currently have all the resources in place to accomplish this goal, and look forward to getting started. We thank you again for the comments, and look forward to your response. Regards, Holly Bik Postdoctoral Researcher, UC Davis v
© Copyright 2024