ALFRED'P.'SLOAN'FOUNDATION' PROPOSAL'COVER'SHEET '

*Note:'This'cover'sheet'is'not'to'exceed'one'page.*'
'
ALFRED'P.'SLOAN'FOUNDATION'
www.sloan.org''|''proposal'guidelines'
PROPOSAL'COVER'SHEET'
'
Project'Information'
Principal'Investigator'
Holly'Bik,'Postdoctoral'Researcher'''''''''''
UC'Davis'Genome'Center,'One'Shields'Ave,'
Davis,'CA'95616'''''''''''''''''''''''''''''''''''''''''
Phone:'530S752S8409,'''''''''''''''''''''''''''
Email:'[email protected]'
Grantee'Organization:' University'of'California,'Davis'
Amount'Requested:' $247,189'
'Requested'Start'Date:' July'1,'2013'
'Requested'End'Date:' June'30,'20114'
'
Project'Goal'
The'overarching'goal'of'this'project'is'to'produce'a'new'webSbased'scientific'visualization'framework'for'
the'analysis'of'highSthroughput'biological'sequence'data'(initially'focusing'on'rRNA'amplicon'data).'
Objectives'
Leveraging'a'close'collaboration'between'a'scientific'visualization'studio'(Pitch'Interactive)'and'researchers'
(PI'Bik'and'case'study'participants),'we'aim'to'produce'intuitive,'interactive'visualization'tools'that'can'be'
used'to'explore'and'analyze'biological'patterns'in'highSthroughput'environmental'datasets.'We'will'take'
advantage'of'standard'file'formats'from'computational'pipelines'in'order'to'bridge'the'gap'between'
biological'software'(e.g.'QIIME)'and'existing'data'visualization'capabilities'(harnessing'the'flexibility'and'
scalability'of'WebGL'and'HTML5).'
Proposed'Activities'
The'major'project'activities'focus'on'software'engineering'(construction'of'database'framework)'and'user'
interface'design'(construction'of'novel'visual'presentations'of'data);'this'will'be'accomplished'via'an'
iterative'process'that'incorporates'feedback'and'requests'from'end'users'(researchers'including'PI'Bik'and'
case'study'participants).'Other'activities'include'the'broader'dissemination'of'project'activities'via'a'portal'
website'and'social'media'discussions,'and'presentation'of'project'outputs'(software'products'and'biological'
findings)'at'scientific'conferences.'
Expected'Products'
Expected'products'include'1)'an'open'source'and'functional'webSbased'visualization'pipeline'2)'peerS
reviewed'scientific'manuscripts'3)'webSbased'metrics'documenting'website'views,'software'usage,'social'
media'sharing'and'dissemination'of'project'activities'via'mainstream'media'outlets.'
Expected'Outcomes'
The'success'of'this'project'will'set'new'paradigms'for'the'analysis'of'environmental'sequence'data.'We'
anticipate'that'visualization'frameworks'will'help'to'address'ongoing'bottlenecks'in'the'analysis'of'large'
sequence'datasets,'promoting'significant'improvements'in'research'efficiency.'We'anticipate'that'SloanS
funded'Microbiology'of'the'Built'Environment'grantees'will'particularly'benefit'from'project'outcomes.'
'
!
!
Project Title: A Research-Driven Data Visualization Framework for HighThroughput Environmental Sequence Data
Primary issue to be addressed
The advent of high-throughput sequencing data is now ushering in a veritable renaissance
in biology. For the first time, we have the ability to deeply characterize the global biodiversity of
historically neglected, microbial taxa via environmental sequencing approaches (investigations
of bacteria, archaea, and microscopic eukaryotes using 454/Illumina sequencing platforms, e.g.
Creer et al. 2010, Sogin et al 2006). However, the sheer volume of data produced from these new
technologies will require fundamentally different approaches and new paradigms for effective
data analysis (Akey & Shriver, 2011). Scientific visualization represents an innovative method
towards tackling the current bottleneck in bioinformatics; in addition to giving researchers a
unique approach for exploring large datasets, it stands to empower biologists with the ability to
conduct powerful analyses without requiring a deep level of computational knowledge.
Effective, sophisticated visualization tools (taking advantage of human cognition and
human information-processing capabilities) can help to link information from disparate fields,
propelling scientific insight and spurring new discoveries (Munzner et al, 2006). Computer
algorithms face significant difficulty in identifying simple data patterns, and thus writing
algorithms for complex, subtle patterns (the type that exist in biological systems) is almost
impossible. The human eye, in contrast, is very adept at spotting subtle visual patterns, able to
quickly notice trends and outliers (Heer et al, 2010), especially when presented with intuitive,
well-designed software tools and user interfaces (Heer & Shneiderman 2012).
The increasing scale of data (hundreds of millions of raw DNA sequences, equating to
tens of thousands of rows in a “small” Excel spreadsheet after data processing) makes it
unfeasible to conduct fine-scale analyses in most existing biological software packages.
!
1!
!
!
Additionally, new sequencing technologies (454, Illumina) are fundamentally different from
Sanger sequencing – the nature of the data requires specific computational considerations (e.g.
accounting for intragenomic variation across 18S rRNA gene copies present in eukaryotic
genomes, which inherently affect our interpretation of environmental sequence data; Bik et al.
2012a). Exploratory data visualization approaches are particularly well suited for highthroughput datasets, since we do not yet understand the regularities or “classes of behavior”
inherent to the underlying biological sequences (Shoresh and Wong 2012).
The development of specific visualization tools for high-throughput sequencing data
would ideally complement the existing initiatives and long-term vision for the Sloan Data and
Computational Research program. Such new tools represent an important next step for our ability
to manage (and produce meaningful, fine-scale analyses) from increasingly large volumes of
data. The move towards “big data” now impacts virtually all areas of science and society. As a
few examples, global networks of remote sensing technology produce millions of pixels per
year1, citizen science initiatives are maintaining increasingly large databases (eBird alone
contains 63 million records1), and the social media behemoth Twitter now reports 200 million
active users sending 400 million tweets per day2. For all such fields where large data volumes
currently impede interpretation, visual tools could offer an easier way to parse data points and
spur information discovery. Visual data exploration should drive the questions being asked,
enabling users to conduct quick and efficient analyses without needing a high degree of
programming knowledge. The overarching goal of this project is to build a system for connecting
the outputs of standard data production pipelines to a modular visualization toolset. Such a
generalized framework will be customizable for many diverse applications targeting a wide
variety of users (researchers, educators, citizen scientists, journalists, etc.). Open source projects,
!
2!
!
!
coupled with an emphasis on data sharing features and data provenance tracking within the
visualization frameworks will importantly enable the wide dissemination of scholarly
information and encourage best practices in data management and reproducibility.
Major related work in the field
There is a significant history of visualization in biological sciences, and the development
of visual tools for data analysis has long been recognized as an inherent need for the effective
interpretation of data (Wong 2012). Visual tools have gained particular popularity for the study
of evolutionary relationships amongst species (the field of phylogenetics), where data must be
viewed and interpreted in the context of a branching tree structure (Pavlopoulos et al. 2010).
Tree viewer software tools exist by the dozens3 and new programs that claim to “solve” the treeviewing problem continue to emerge on a regular basis.
On the other hand, the development of new visualization approaches for high-throughput
sequencing data has been stagnant; most scientific publications continue to summarize data using
simplistic pie charts or bar charts (example in Caporaso et. al 2010). Visualizations specifically
designed for environmental sequencing data are largely limited to overview schematics such as
Principal Coordinate Analysis, UPGMA clustering, rarefaction curves and OTU heatmaps4.
These approaches can provide powerful biological inferences, but because they inherently
summarize data at the community level (e.g. the entire pool of species present in a given
sample), such visuals are not appropriate for investigating more fine-scale patterns (pinpointing
specific taxa which underlie community differences). While phylogenetic placement tools can
allow the exploration of individual lineages (e.g. pplacer; Matsen et al. 2010), use of these tools
requires a certain level of computational knowledge, the file outputs are large and complex, and
useful visualization of tree placement results remains problematic.
!
3!
!
!
Many biologists continue to be frustrated by the existing suite of visual tools they must
rely on for analysis of high-throughput sequencing data (a common theme that surfaced at the
October 2012 QIIME/VAMPS meeting in Boulder, sponsored by the Sloan Microbiology of the
Built Environment program). Biological software has historically been built by biologists; while
these researchers have knowledge of the problems they want to address, they typically lack
formal training in software engineering or user interface design. Thus, the majority of biological
software tools are difficult to install, poorly documented, and require expert knowledge or
training to use effectively.
Well-designed, intuitive biological software tools do exist, but they are most often
proprietary (closed-source products developed and sold by for-profit companies). Software such
as Geneious, Sequencher, and CLC workbench are convenient tools for the analysis of DNA
sequence data, but use of these programs requires the purchase of a software license costing
hundreds to thousands of dollars. Scalable software tools that are able to handle high-throughput
sequence datasets are prohibitively expensive. For example, a single-computer academic
software license for the CLC Genomics Workbench is currently priced at $4,995. Proprietary
software also operates in a non-transparent manner; users typically do not have full control over
parameter settings (being limited to a set of options pre-defined by the software company) and
are not able to gain access to the crucial information about how the software processes raw data
(Nekrutenko and Taylor 2012).
In addition to the tradeoff between high cost and poor software interfaces, most existing
visualization tools in biology were designed before the advent of high-throughput sequencing
platforms. The newest DNA sequencing technologies (Illumina and 454 platforms, amongst
others) churn out biological data at an unprecedented scale, returning millions of sequence reads
!
4!
!
!
per single run of an instrument. Datasets routinely become larger as new platforms and
instrument upgrades come onto the market, but biological software capabilities have not scaled
alongside this instrument progression. For example, most tree viewer software cannot handle
evolutionary trees containing more than a few thousand taxa (Page 2012). Data formats have also
changed alongside the evolution of sequencing technology. In the past, the comparatively small
size of datasets (tens to hundreds of taxa) meant that biologists could manage manual curation of
their sequence data and annotate any useful metadata as needed (for example, adding information
about taxonomic names and sample site characteristics onto a phylogenetic tree built from DNA
sequences). Since the advent of high-throughput sequencing, large-scale manual curation of
sequence data and sample metadata is no longer possible. File sizes are too large, and data must
be prepared and processed in computationally amenable formats (e.g. programming-friendly data
values nested within tab-delimited text files). Finally, many file formats used for high-throughput
sequence data are not supported by older biological visualization software, such as Illumina
FASTQ files for DNA sequences and phylogenetic trees in XML format.
In contrast, the outputs from computational pipelines specifically designed for highthroughput data are becoming increasing standardized, particularly for rRNA amplicon studies.
The QIIME pipeline (Caporaso et al. 2010) is quickly becoming standard software for the
processing and analysis of rRNA amplicon data; QIIME supports broad analyses across the tree
of life, with workflows for analyzing 16S rRNA genes from bacteria/archaea, 18S rRNA genes
from eukaryotes, and ITS rRNA from fungi. Raw sequence reads are typically clustered into
Operational Taxonomic Units (OTUs, which can be thought of as molecular “species”) using a
pairwise identity cutoff (e.g. 97%). Consensus OTU sequences are given a taxonomic
assignment based on comparisons to public sequence databases (e.g. using naïve Bayesian
!
5!
!
!
classifier tools such as the RDP classifier; Wang et al. 2007). QIIME supports some limited
downstream visual analyses of data, primarily restricted to higher-level overviews of microbial
community patterns, such as Principal Coordinate Analyses and UPGMA clustering used for
comparing the similarity of samples in a given dataset. The majority of QIIME analyses produce
computationally amenable, standard file formats such as OTU tables, which document the
occurrence of OTU sequences across sample sites. QIIME has recently implemented the BIOM
format (McDonald et al. 2012) for OTU tables; this new format takes advantage of the JSON file
format, a file type that is easy to parse with wide support across multiple programming
languages. This OTU information can then be linked to other types of standard format text files
containing taxonomic assignments (taxonomy mapping files), environmental metadata about
sample sites (metadata mapping files), and evolutionary relationships (phylogenetic trees).
There continues to be a persistent, vast gap between data visualization fields and the
biological sciences. Despite the “growing appetite for the visual display of information…
advances in visualization are not adequately described and shared with the biological
community” (Wong 2012). Flexible, robust visualization capabilities such as WebGL and
HTML5 are commonly employed in the data visualization community, yet have seen limited
applications in biological software. This project aims to bridge these two areas, leveraging the
critical expertise within each field via an interdisciplinary team of visual artists and software
engineers (Pitch Interactive) and practicing biologists (PI Bik and case studies detailed in
Appendix 2). Our work will incorporate the existing body of knowledge5 related to effective
software design (human-computer interaction, color theory, visual display of information) as we
design a long-term, scalable framework for the visualization of environmental sequence data.
This project will avoid redundancy across past and present visualization initiatives, minimizing
!
6!
!
!
overlap with existing tools. We will specifically take advantage of standardized, computationally
efficient file formats for DNA sequence data (OTU tables, metadata/taxonomy mapping files,
and phylogenetic trees); our proposed visualization framework will pick up where QIIME
functionality ends.
This project will explore new, innovative ways to visualize high-throughput sequence
data and simultaneously explore the biodiversity present in environmental samples. Potential
avenues for these new visualizations include “Dot visualization” (Figure 2) and heatmapped
“Treemaps” (Johnson and Shneiderman 1991), although we will explore a much wider breadth of
graphical renderings. Since most environmental rRNA studies rely on arbitrarily clustered OTUs
with taxonomy derived from annotations in public sequence databases (that are often incorrect or
uninformative, e.g. Bik et al. 2012b), visualizing environmental sequences in the context of
confidence values (e.g. as obtained through the RDP classifier, Wang et al. 2007) and
phylogenetic trees will provide robust and much-needed tools for high-throughput studies. All
visualizations will be highly interactive; users will be able to conduct real-time data filtering and
updating of visual renderings.
Constructing new visualization tools for high-throughput sequencing data will have
significant and far-reaching impacts for biological research. A streamlined visualization
workflow and sleek user interface enabling novel explorations of large datasets will immediately
encourage researchers to use this application, regardless of their computational skill level. The
ease of filtering data, coupled with data provenance tracking and the ability to export
publication-quality graphics will promote effective and efficient research. In practice, data
provenance capture equates to computational systems for tracking user interactions with the
software and recording all commands, parameters, and data outputs. This simplifies data
!
7!
!
!
management for users; most critically, it encourages best practices ensures scientific
reproducibility for all analyses conducted. We will follow existing models such as VisTrails
(Callahan et al. 2006) and iPython Notebook (http://ipython.org/notebook), both of which
represent interactive and streamlined systems for data visualization and provenance capture.
Qualifications of project team
In this project, Pitch Interactive (a sought-after name in data visualization) will take the
lead role in the design and development of visual frameworks (Figure 1), working closely with
biologists (PI Bik and case study participants) to define research questions and prioritize the
implementation of specific features. Our project team represents a highly integrative
collaboration between research scientists and visual communicators; such tight interdisciplinary
collaborations are exceedingly rare in biological research (Wong 2012). Biologists will not build
software themselves, but will instead work in close collaboration with a proven and distinguished
industry partner (Pitch Interactive). By relying on the expertise of Pitch Interactive, we will
avoid common software pitfalls (unscalable tools with difficult user interfaces) while also being
able to deeply explore the utility of diverse and novel visual renderings of data.
PI Holly Bik will lead the project from a biological perspective. As demonstrated by her
CV, she has striven to develop and maintain a strong interdisciplinary viewpoint during her
scientific career. This broad perspective has fostered “big picture” thinking, allowing her to
tackle pressing scientific problems from a unique perspective—at the interface of many
disciplines. While PI Bik’s doctoral research centered on deep-sea nematode taxonomy and
molecular phylogeny, she transitioned fields during her postdoctoral career, currently focusing
on high-throughput environmental sequencing of microbial eukaryotes and computational
biology research. In her current role at UC Davis, she works closely with computer scientists and
!
8!
!
!
software engineers to inform the development of cutting-edge tools for the primary analysis of
large sequence datasets (millions of reads), contributing to the development of the PhyloSift
pipeline (software for the phylogenetic analysis of genomes and metagenomes; Darling et al.
submitted). She is personally and professionally committed towards enabling efficient science,
contributing needed tools and cyberresources to the research community, and aiming to assist
diverse groups of biologists in leveraging high-throughput sequencing approaches. PI Bik’s
interdisciplinary background will be ideal for leading the development of the proposed
visualization framework, enabling the project team to evaluate and balance computational
considerations with biological research needs. In addition, her professional network (particularly
her connection the Sloan Microbiology of the Built Environment program) and social media
presence will enable wide dissemination of all project activities and outputs.
The Pitch Interactive team will take the lead on software development, visual design, and
project management. Pitch Interactive’s inventive and pioneering work is attributed to out-ofthe-box creative thinking derived from a small, forward-thinking team representing a range of
different yet complimentary backgrounds. Their team is comprised of Wesley Grubbs (Creative
Director, Technical Director, owner), Nicholas Yahnke (Software Engineer) and Mladen Balog
(Concept Artist). Team leader Wesley Grubbs comes from an academic background in
international economics and information systems followed by many years in the advertising
world; during his career, he has always maintained a keen interest in science that, in part, stems
from his upbringing by a Geologist. Pitch Interactive’s work has been seen in WIRED magazine,
Esquire, Scientific American, Popular Science, Fortune, Princeton Press books, and most
recently at the Museum of Modern Art’s “Talk To Me” exhibit in New York City. Their client
roster includes Activision, The Big History Project, ESPN, General Electric, General Motors,
!
9!
!
!
Google, Oracle, The Russian Avant-garde Foundation, ThermoFisher Scientific, Tomotheraphy
and the Wisconsin Institutes for Discovery. Pitch Interactive has created work for interactive
installations, touchscreen kiosks, smart phones, tablet devices, console games, websites, standalone applications, museum exhibits, projections, textiles and print. The team has experience in a
wide variety of industries including banking, economics, health, sports, scientific research,
advertising, politics and art. In addition, the Pitch Interactive team is active participants in the
data visualization community, speaking at conferences worldwide as well as giving lectures and
workshops at academic institutions such as UC Berkeley, Stanford and NYU.
Pitch Interactive’s daily operations are primarily involved with analysis of collected data,
consultation in data collection, visualizing data, software development and project management.
The studio follows a rigorous project workflow that closely resembles the life cycle of software
development. Reflecting the philosophy that data visualization should be strongly connected to
the lives and events from which it is derived, Pitch Interactive dissects large data sets in search of
meaningful and often hidden patterns that serve to determine the shape and form that best tells a
story. The team aims for visual depictions that not only inform, but also bridge the divide
between science and art through a visual narrative in order to inspire, stimulate and engage
minds. In working on the proposed visualization framework, Pitch Interactive will offer fresh
perspectives and explore groundbreaking new ways to explore the immense scale of biodiversity
and the evolutionary processes that have shaped life on Earth.
Project activities and description
The primary goal of this project is to produce a long-term, open source and scalable
framework for biological visualizations of environmental sequence data, where the underlying
format of the data is structured in a way that is agnostic of any downstream visualization. We
!
10!
!
!
will leverage standard QIIME file formats for high-throughput sequencing data as input: these
formats will primarily include OTU tables and their associated metadata/taxonomy mapping
files, and eventually the visualization framework will be expanded to support phylogenetic trees.
Our project timeline includes two discrete phases of development (Figure 1). The first
phase (Tier One) will be the development of the underlying database framework, built
specifically to parse standard input files and arrange data in a format that is amenable to
downstream visualizations. This framework will be specifically engineered to handle extremely
large and complex datasets (see Appendix 1 for further technical specifications), and will be
constructed with a long-term vision to ensure that this framework is generalizable for other wellstructured data formats (e.g. building in support for increasingly ubiquitous, languageindependent file formats such as JSON and delimiter-separated text files).
The second phase (Tier Two) will explore the visual presentation of data, iteratively
assessing the most effective visual methods to facilitate exploratory data analysis, address
research questions, and enable novel biological discoveries from large sequence datasets. Users
will access all tools (data parsing and visualizations) via a web browser. Data visualizations will
have a built-in sharable component (custom links that preserve a specific visual rendering and
can be e-mailed to colleagues), and all shared visuals will accompanied by compatible versions
for accessing on touchscreen devices. (e.g. iPads). We will leverage the flexible, scalable
capabilities of HTML5 and WebGL (and where necessary, using other data visualization
programing languages such as Processing) to promote maximum accessibility; this scenario is
favorable over requiring users to download and install a specific application, and enables us to
leverage processing power of the Pitch Interactive servers, if needed, to maximize the speed of
visual renderings and minimize computational demands on users’ computers.
!
11!
Project Schematic
!
!
Tier One: Building a data processing tool that converts the raw data of potentially several hundred
thousands of rows of data into a workable, formated dataset to be used in the visualizations. With
the use of a User Interface tool, a user can specify filters to help condense and extract only
necessary parts of the raw data into the formated dataset that will be used in the visualizations.
Project Schematic
Tier One: Building a data processing tool that converts the raw data of potentially several hundred
thousands of rows of data into a workable, formated dataset to be used in the visualizations. With
the use of a User Interface tool, a user can specify filters to help condense and extract only
necessary parts of the raw data into the formated dataset that will be used in the visualizations.
Raw data
Raw data
User Interface for
setting filters for
data parsing
Formatted data
structured for
visualization
User Interface for
setting filters for
data parsing
Formatted data
structured for
visualization
Tier Two: Data visualization components. We start by building 2 - 4 data visualizations. From the
framework established witht the data parser, additional visualizations can be added over time.
Each visualization will have a set of filters and sliders that the user can adjust and then save or
share their findings with other researchers or with the public.
Tier Two: Data visualization components. We start by building 2 - 4 data visualizations. From the
framework established witht the data parser, additional visualizations can be added over time.
Each visualization will have a set of filters and sliders that the user can adjust and then save or
share their findings with other researchers or with the public.
Start by building 2 - 4
visualizations. Can add
more later.
Each visualization will have
it’s own controls to adjust
presets. This can then be
shared or saved as PNG
to embed in research papers.
Start by building 2 - 4
Each visualization will have
more later.
presets. This can then be
shared or saved as PNG
to embed in research papers.
visualizations.
Can add
it’s own controls process
to adjust
Figure 1: Schematic
detailing
the software development
for the proposed project
As we work to construct the Tier One framework, we will define a list of scientific
questions and priorities for data visualization. All downstream visualizations will fundamentally
access this underlying framework in the same manner (via a purpose-built API), but the visual
rendering and options for interacting with the data will be directly dependent on the scientific
questions being asked (e.g. case study user needs, Appendix 2) and the exploration scale defined
by the user (number of data points selected, higher versus lower level taxonomy, e.g. viewing
patterns at the Phylum vs. Genus level). Our priorities for visual tools will primarily focus on 1)
maximizing interactivity between users and their datasets, and 2) including capabilities for
!
12!
!
!
filtering data and exploring biological patterns at fine-scale resolution. Many existing tools for
visualizing high-throughput data are static in nature (e.g. 2D plots and pie charts) and return
high-level overviews of biological patterns (taxonomic summaries at the Phylum level, or
relationships between samples presented as ordination plots displaying overall microbial
community similarity). The proposed fine-scale resolution would allow users to hone in on
specific taxonomic lineages, investigate patterns of OTU abundance across sample sites, and
interact with overview summaries of microbial communities (e.g. expanding pie chart wedges,
akin to user interactions in the HTML5-based Krona software; Ondov et al. 2011).
In the Tier Two project phase, we will undertake an iterative approach towards
developing visualizations. One must start with
questions, look into the data, ask more
questions, and then repeat this process many
times in order to find the best perspective and
solution to visualize the information. We
anticipate that the nature of data visualizations
!
Figure'2:!Mockup!of!dot!visualization!for!high?throughput!
sequence!data.!Size!=!OTU!abundance,!color!=!sample!site,!
shape! =! habitat! metadata! (temperature,! pH).! Mockup!
example!taken!from!http://www.wefeelfine.org/!!
will progress and evolve over time. While it is
difficult to predict the ultimate form of
visualizations, we will begin by exploring
visual presentations similar to existing web-based tools that are designed to incorporate large and
complex volumes of data. Some examples include We Feel Fine6 a real- time presentation of
human emotions derived from Twitter, and 100,000 Stars7, a Web GL astronomy visualization
built by the Google data arts team. Figure 2 presents a mockup of one such prospective
biological visualization, based on We Feel Fine. In this example, DNA data normally presented
!
13!
!
!
as basic text files (OTU tables and metadata/taxonomy mapping files) are instead explored in a
visual context, where different shapes, sizes and colors represent distinct data attributes.
During software development, we will emphasize three foci to maximize software
accessibility and data reproducibility for end-users. First, we will take advantage of gestural
interfaces for data filtration and exploration of visual renderings (e.g. “slider bars” as depicted in
the center figure of the Figure 1 Tier One workflow). Initially, gestural interfaces will be
primarily accessible through a web browser (manipulation via trackpad or mouse on desktop
computers, and touchscreen interaction for shared visuals accessed on mobile device web
browsers), but a long-term goal is the development of a dedicated mobile app to further leverage
touchscreen interaction capabilities (since existing biological software has not yet taken
advantage of touchscreen capabilities; Page 2012). As a second focus, we will track and record
data provenance for all user interactions within the software. This feature will automatically
generate outputs providing details of data filtering and manipulation (allowing researchers to
access filtered data files and in-depth records of computational processing if needed), without
ever requiring users to interact with software on the level of the underlying code. Finally, the
overall framework and will be specifically designed to promote and facilitate novel scientific
discoveries. The fundamental rational for the proposed visualization framework is to increase the
pace of scientific discovery. Instead of forcing biologists to struggle with Perl scripts and Unix
commands, we aim to provide researchers with a powerful and easy-to-use framework for
exploring hypotheses and generating/testing new scientific questions on the fly.
All software and research products will be maintained as open-source and open access,
allowing us to build a strong community of users and developers. Building up a community
around visualization tools will promote constructive conversations across user groups (e.g.
!
14!
!
!
amongst our project team and target audiences for different tools), encourage developers to
leverage our database framework for new software that is outside the scope of this proposal, and
lead to broad dissemination of scientific products (software and publications).
Management and Staffing plan
Pitch Interactive will drive software development of the Tier One and Two frameworks
(programming and database construction outlined in Figure 1). PI Bik will drive the biological
side of software development (prioritizing research questions to be address and features to be
implemented), liaising with Pitch Interactive to explain data structures and give feedback on
database frameworks and user interface design. PI Bik will additionally coordinate all
interactions with case study participants. PI Bik and case study participants will test software
products and report back to the Pitch Interactive team with feedback on user interface design,
tool functionality, and visual presentation of data. Both Pitch Interactive and PI Bik will
contribute to the construction and upkeep of the project portal website, and disseminate project
updates and products via social media and blog updates. At the outset of this project, the Pitch
Interactive Team and PI Bik will meet in person to prioritize project goals, discuss specific user
groups and audience needs, and review the anticipated “test cases” (Appendix 2) to be used as
models towards defined project goals. This in-person meeting will be repeated every month; each
meeting will begin with assessment of the previous month’s goals and redefine new targets based
on successful paths. Daily and weekly correspondence between Pitch Interactive and UC Davis
will be conducted via E-mail and Skype.
For all projects, Pitch Interactive follows a Systems Development Life Cycle (SDLC)
process8 for building digital projects and interactive data visualizations. This process involves
significant upfront project planning and requirements definition that helps keep large projects
!
15!
!
!
manageable and maintainable. During the planning phase, key variables are defined, such as
project milestones, technologies, programming methodologies, documentation requirements, and
other task breakdowns in order to organize and manage the project and its deliverables. For
everyday project management tasks such as maintaining project-related correspondence,
discussions about features, to do lists, sharing files, group coordination, and task accountability
this project will use Basecamp9, a proven project management tool used by Pitch Interactive for
several years on dozens of projects.
!
Figure 3: Project
milestones and timeline
Broader impacts activities (plans for dissemination and sustainability)
Project activities and software tools will primarily be disseminated via a dedicated portal
website, while also leveraging social media tools (Twitter, Google+) and in-person presentations
at workshops and scientific conferences to announce software updates and research products
(peer-reviewed manuscripts). The project website will be built using Wordpress software, a
flexible and customizable platform that will be used to host a project blog and software tutorials,
pull in discussions from Twitter, link to our open source code repositories (GitHub,
http://github.com, the repository we plan to use for managing software development and version
control), and most importantly direct users to the visualization frameworks outlined in Figure 1.
We also anticipate harnessing PressForward10 (a Sloan-funded project), a Wordpress plugin that
will allow us to amalgamate relevant content about scientific visualization on the project website.
!
16!
!
!
Both Pitch Interactive and PI Bik (a postdoc in the lab of Jonathan Eisen at UC Davis)
maintain a significant online presence (including Twitter, blogs, and professional websites), and
are well suited to take a lead role in promoting broad dissemination of visualization software
developed during this project. Pitch Interactive maintains strong connections across major media
sources (including scientific magazines, news outlets, and technology magazines), and website
traffic shows major peaks when high-profile visualizations are released (400,000 views in one
week for a recent project). The Eisen lab is a leading voice in the movement towards open access
science and social media-based online outreach; lab members consistently produce blog content
for both scientific and general audiences. Lab head J. Eisen is active on his popular Tree-of-Life
blog11 with 2000+ subscribers and 20-50,000 site visits per month and also has a high profile
Twitter microblog (@phylogenomics) with >15,000 followers. Project PI Holly Bik contributes
to the leading marine science blog Deep Sea News12, with 100,000-300,000 site visits per month)
and maintains a Twitter account (@Dr_Bik) with >2400 followers. Pitch Interactive and the
Eisen Lab at UC Davis maintain a distinct base of followers in the technology and biological
research sectors, respectively, and additionally maintain a strong network of media/journalism
contacts; disseminating project activities via non-overlapping online channels will encourage
engagement and participation from diverse audiences. Harnessing social media tools during this
project will be critical for promoting open science and disseminating all project outputs.
Because Pitch Interactive is well known and well-respected in the field of data
visualization (Wesley Grubbs maintains a strong network across the technology and design
sectors), their status will additionally allow us to solicit interest for continued support of
visualization frameworks well beyond the initial project development phase. We anticipate that
our mature software products, supported by a diverse user community and disseminated via a
!
17!
!
!
strong web presence (social media and coverage in science blogs/print journalism), will allow us
to secure a financial partner (private sponsor, federal research funding or similar) and ensure
long-term sustainability for the project.
Project outputs
This project aims to produce a variety of discrete outputs, including software products,
peer-reviewed manuscripts, public outreach components, and links to the Sloan Microbiology of
the Built Environment (MBE) Program. The primary output will be a two tier, web-based
interface for visualizing high-throughput sequencing data (Figure 1, as described previously).
Tier One will be a scalable database framework; users will upload their standard-format data
files within a web browser and subsequently adjust settings and select criteria for data filtering.
Once these criteria are submitted, users will be seamlessly transferred to the Tier Two
visualization framework; the software will render 2-4 discrete visualizations, from which users
will select a visual presentation to expand and explore more deeply.
In addition to software frameworks, this project will produce a parallel suite of research
products. PI Bik and case study participants will evaluate visualization tools using their own
environmental sequence datasets, generated from a diversity of habitats (marine ecosystems to
the Built Environment). Since the goal of visualization tools is to provide deeper insights into
biological patterns, we anticipate that users will discover surprising patterns while exploring
their own data (interesting taxonomic patterns, subtle differences in microbial communities
across different types of sample sites). Thus, another priority within this project will be to
translate these biological findings into peer-reviewed scientific manuscripts; we will assist and
encourage case study participants in this process as needed. Once visualization tools are mature,
we also intend to publish our framework as a software note in a relevant open-access journal
!
18!
!
!
(e.g. BMC Bioinformatics); this will enable researchers to reference an appropriate citation in
any future mention of visualization tool use.
For measuring “broader impact” outputs, mechanisms for tracking/analyzing web-based
resources will be spearheaded by both PI Bik and Pitch Interactive. Given the pace of
technology, the type and nature of web tracking data is likely to evolve over the course of the
project, but at minimum we will collect basic statistics related to webpage views (e.g. tools such
as StatCounter13 and Google Analytics which record page views and their geographic origin,
referring links and relevant Google search terms) and wider dissemination of website/blog
content (Tweets, reposts, formal coverage in science journalism). In addition, we plan to utilize
ImpactStory14 (a Sloan-funded project) to further track social media sharing and altmetrics
related to our GitHub software repository; eventually, we anticipate that software users will also
be able to track metrics for their shared visualizations via ImpactStory. All collected data will be
amalgamated and housed on shared cloud-based servers (e.g. Dropbox) and locally backed up at
UC Davis. As our project evolves, we will use web-tracking data to identify the most effective
methods for content dissemination and target specific channels to ensure the broadest reach.
Finally, this project will collate survey data and feedback from case study participants,
including requests for new features and support for other types of standard file formats (see
Appendix 2). Interacting with researchers in the Sloan MBE program represents an integral
component of this approach. These data will be collected with a long-term outlook and
assumption of continued future development and expansion of the visualization frameworks
outlined in Figure 1. All feature requests and bugs will be collated and monitored in a dedicated
issue tracking system that Pitch Interactive has used extensively for several years called Trello15.
!
19!
Appendix 1 – Technical Specifications
To undertake this project, several technical aspects must be considered. This document helps
address the technical specifications and needs.
1. User Data (Initial Inputs)
The data required to run on this system will be static JSON files (OTU tables) and tabdelimited .csv files (taxonomy/metadata mapping files) generated by researchers using the
QIIME pipeline (http://qiime.org) either locally or on Amazon EC2 cloud servers. Using pregenerated data files is a cost-effective solution that eliminates calls to an external server and it
allows for the visualization system to run on- or off-line. The JSON and .csv files will follow
specific formats that will be read by the data parsing system so that the data can be updated as
frequently as necessary.
2. Data Parsing
For the Tier One framework (Figure 1), raw data files will be parsed on the client browser.
JavaScript libraries, such as D3.js support methodologies of parsing even very large data files.
Once the data has been parsed, a new, smaller and optimized JSON formatted data file will be
created that will be used to generate the data visualizations quicker and more efficiently.
3. Data Storage
There are two options for approaching data storage. We may select both options or only one
depending on the needs of the researchers to address privacy concerns vs. the ease of shareability of an analysis, etc.
Option 1: The parsed, optimized JSON file will be saved to users’ local hard drive. In order to
share their analysis, the researcher must share the local file with fellow researchers. This can be
good for the purpose of working offline and would address any privacy issues a research may
have if they are uncomfortable with their files being uploaded to a server.
Option 2: The parsed, optimized JSON file will be uploaded to a server. This would be optimal
for sharing with multiple researchers and will address the need to keep files centralized.
However, any researcher would be required to work online to view the data file.
4. Data Visualization using Web-based Technologies
The primary target medium for the data visualization components will be modern web browsers
that support WebKit and WebGL for advance graphics rendering (e.g. Google Chrome).
However, while building for web browsers, there will be some functionality engineered
specifically for the iPad Safari browser and Android tablet device browsers. The tablet browsers
may not currently have the capacity to render high-level graphics the same way that a WebKit
browser like Chrome can, but based on the advancing trends of these devices, it would be naïve
to neglect these technologies for future purposes.
!
27!
Appendix 2 - User Case Studies
In this project we will maintain close interactions with three primary case study groups. Case
study participants are integral to the development of effective visualization tools, as they
represent the target end-users of software products resulting from this project. Case study groups
encompass a wide breadth of research expertise (biologists to computer scientists), and will serve
two primary purposes:
1. Test software functionality and user interface design. Are tools easy and intuitive to
use? Is there sufficient documentation to solve problems if users get stuck? Are users
able to upload their standard data files as anticipated?
2. Give feedback on visual interfaces for data exploration (Tier Two visualizations).
Are these visualizations appropriate given the nature of the data? Can researchers ask
questions and explore hypotheses in an efficient and intuitive way? Are there any aspects
of the data or metadata that users would like to explore, but cannot currently visualize?
Case Study Group 1: Eisen Lab members
The most immediate case study participants will be members of Jonathan Eisen’s lab at
UC Davis (where PI Bik is a postdoc). Eisen lab members work with a variety of highthroughput sequencing data types, including shotgun metagenomes, 16S/18S rRNA amplicon
data, and isolate microbial genomes. Graduate students and postdocs will be asked to test
prototype visualization frameworks using their own computers and datasets. The close proximity
of Eisen lab participants will allow our project team to solicit consistent and informal feedback
on a daily/weekly basis. Although the activities in this proposal focus on the visualization of
rRNA amplicon data, our long-term vision would be to eventually expand this software to
!
28!
support other standard data types (e.g. from shotgun metagenomic analyses). Discussions with
Eisen Lab members will enable us to gather ideas and visualization requests for planning future
software development.
Case Study Group 2: Sloan Grantees within the Built Environment Program
This project will establish strong links to the Sloan Microbiology of the Built
Environment (MBE) program, using researchers funded through this initiative as case study
participants. PI Bik is already heavily involved with MBE initiatives; she contributes to the
microBEnet project (http://www.microbe.net - a portal website established to catalyze interest in
the MBE program and connect researchers working in diverse disciplines), and regularly
interacts with MBE researchers at Sloan meetings (QIIME/VAMPS meeting in Boulder in
October 2012, Annual Sloan MBE meetings, “Evolution of the Indoor Biome” meeting at
NESCent in June 2013). These links to the Sloan MBE program will provide important test cases
for visualization tools, enabling researchers working on diverse projects to test out prototypes
and deliver feedback. Since many Sloan MBE projects are looking to conduct fine-scale analyses
on microbial communities in the Built Environment, we believe that the proposed data
visualization tools could satisfy some of their main frustrations with existing computational
tools; we aim to facilitate more rapid analysis and publication of data collected as part of MBE
projects.
We specifically propose to work closely with the Sloan-funded Wild Life of Our Homes
(WLOH) project (Rob Dunn and Holly Menninger at North Carolina State University).
Researchers Dunn and Menninger has expressed a keen interest in trialing visualization tools
using amplicon data that has already been generated as part of this project. The WLOH project is
!
29!
looking to assess subtle patterns in bacterial and fungal communities across residential dwellings
in their sample set, but faces current difficulties finding feasible and informative ways to
compare microbial communities. Thus, the WLOH project has ideas for specific feature requests
that will also represent user needs for other Built Environment projects, including: 1) Filtering
data to look at outputs from single samples or small subsets of samples, 2) Visual presentation of
taxonomic information that allows users to infer knowledge about an organism’s biology and
ecology, 3) Real-time modification of OTU clustering parameters for a subset of taxa, for
example when biological information suggests that a 97% pairwise identity clustering cutoff may
be lumping several bacterial strains together, and 4) Display of geographic data that will allow
novel spatial inferences of microbial species
Case Study Group 3: Projects within the NSF-funded program “Advancing and
Visualizing the Tree of Life”
The original basis for the visualization tools outlined in this proposal came from PI Bik’s
participation in the National Science Foundation’s Ideas Lab for Analyzing and Visualizing the
Tree of Life (AVAToL) held in August 2011 in Lake Placid, NY (Collins et al. 2013). Although
visualization tools were not ultimately supported by NSF funding (despite several subsequent
proposals submitted to the NSF Tree of Life program by PI Bik), new ways of visualizing data
were recognized as a critical need for advancing our understanding of global biodiversity. PI Bik
maintains close contact with PIs leading projects funded under the NSF AVAToL initiave
(http://avatol.org ), and will liaise with these researchers to disseminate visualization software
products and solicit feedback. A long-term goal will be to facilitate cooperation and integration
between this visualization project and NSF AVAToL projects (none of which are primarily
focusing on exploratory visual tools), as follows:
!
30!
1. Open Tree of Life (led by PI Karen Cranston at NESCent) – This project will provide
the first reference guide tree (with critical metadata) for the entire Tree of Life. We
propose to leverage the open-source Open Tree API to pull down relevant metadata
(phylogenies, taxonomic annotations, and environmental metadata) to supplement and
populate visualization tools described in this proposal.
2. Arbor (led by PI Luke Harmon at the University of Idaho) – This project will leverage
phylogenetic trait data and species distribution data to design a software platform that
allows the investigation of a) Evolutionary processes of spatial diversification, b)
Evolution of symbiotic communities, and c) Evolution of complex interactions. We will
work with the Arbor team to encourage the development of APIs that will be able to
interact and access data from each other.
3. Next Generation Phenomics (led by PI Maureen O’Leary at Stony Brook University) –
This project aims to digitize morphological, taxonomic and paleontological data using
machine learning, computer vision and crowd sourcing approaches. PI Bik is a “data
provider” for the Phenomics projects and will work with this project team to develop
computationally-friendly metadata formats that are accessible for the visualization
framework outlined in this proposal . In the long term, we specifically aim to pull in
species image data and phenomic matrix data to populate interactive visualizations.
Collection of survey and assessment data for all case studies: User surveys and evaluation will
be driven by Pitch Interactive, who have substantial experience in determining target audiences
and assessing user needs for software tools (e.g. via questionnaires at SurveyMonkey
http://www.surveymonkey.com or based within GoogleDocs) These data will be incorporated
into project management software and used to specifically define targeted outcomes.
!
31!
Appendix 3
References Cited
Akey, J.M. and Shriver, M.D. (2011) A grand challenge in evolutionary and population genetics:
new paradigms for exploring the past and charting the future in the post-genomic era.
Frontiers in Genetics, 2:1-2.
Bik, H.M., Porazinska, D., Caporaso, J.G, Knight, R., Thomas, W.K. (2012a) Sequencing our
way towards understanding global eukaryotic biodiversity. Trends in Ecology and
Evolution, 27(4):233-243.
Bik HM, Halanych KM, Sharma J, Thomas WK. (2012b) Dramatic Shifts in Benthic Microbial
Eukaryote Communities following the Deepwater Horizon Oil Spill. PLoS ONE,
7(6):e38550.
Callahan SP, Freire J, Santos E, Scheidegger CE, Silva CT, Vo HT. (2006) VisTrails:
Visualization meets Data Management. Proceedings of ACM SIGMOD, June 27-29,
2006, ACM Press, New York, NY. pp. 745–747.
Collins T, Kearney M, Maddison D. (2013) The Ideas Lab Concept, Assembling the Tree of
Life, and AVAToL. PLOS Currents Tree of Life. Published online March 7, 2013.
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. (2010)
QIIME allows analysis of high-throughput community sequencing data. Nature Methods.
7(5):335–336.
Creer S, et al. (2010) Ultrasequencing of the meiofaunal biosphere: practice, pitfalls and
promises. Molecular Ecology.19(s1):4–20.
Darling, A., Jospin, G., Matsen, F., Lowe, E. Bik, H.M., & Eisen, J.A. (Submitted) PhyloSift: A
pipeline for phylogenetic taxonomy assignments from environmental metagenome data.
Heer, J., Bostock, M, & V. Ogievetsky. (2010) A tour through the visualization zoo. Association
for Computing Machinery http://queue.acm.org/detail.cfm?id=1805128
Heer, J. & Shneiderman, B. (2012) Interactive dynamics for visual analysis Association for
Computing Machinery http://queue.acm.org/detail.cfm?id=2146416
Johnson B, Shneiderman B (1991)Tree-maps: a space-filling approach to the visualization of
hierarchical information structures. Visualization 1991, Proceedings, IEEE Conference
on 1991, 284-291.
Matsen, F.A. et al. (2010) pplacer: linear time maximum-likelihood Bayesian phyogenetic
placement of sequences onto a fixed reference tree. BMC Bioinformatics. 11, 538
McDonald D, Clemente JC, Kuczynski J, Rideout J, Stombaugh J, Wendel D, et al. (2012) The
Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and
love the ome-ome. Giga Science, 1(1):7.
Munzner, T., et al. (2006) NIH-NSF visualization research challenges report summary. IEEE
Computer Graphics and Applications, 26(2):20–24.
Nekrutenko A, Taylor J. (2012) Next-generation sequencing data interpretation: enhancing
reproducibility and accessibility. Nature Reviews Genetics, 13(9):667–72.
Ondov BD, Bergman NH, Phillippy AM. (2011) Interactive metagenomic visualization in a Web
browser. BMC bioinformatics, 12(1):385.
Page RDM. (2012) Space, time, form: viewing the Tree of Life. Trends in Ecology & Evolution,
27(2):113–20.
Pavlopoulos GA, Soldatos TG, Barbosa-Silva A, Schneider R. (2010) A reference guide for tree
analysis and visualization. BioData Mining, 3(1):1.
32#
Shoresh N, Wong B. (2012) Points of view: Data exploration. Nature Methods, 9(1):5.
Sogin ML, et al. (2006) Microbial diversity in the deep sea and the underexplored "rare
biosphere". Proc Natl Acad Sci USA.103(32):12115–12120.
Wang Q, Garrity GM, Tiedje JM, Cole JR. (2007) Naive Bayesian classifier for rapid assignment
of rRNA sequences into the new bacterial taxonomy. Applied and Environmental
Microbiology, 73(16):5261–7.
Wong B. (2012) Points of view: Visualizing biological data. Nature Methods, 9(12):1131.
Footnotes (websites referenced in proposal text)
1
Data estimates and an overview of how "Big Data" is impacting diverse fields can be found at
the NEON website http://www.neoninc.org/news/big-data-part-i
2
Twitter statistics were obtained from a Washington Post article published on 3/21/13:
http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter
3
An extensive list of visualization software for phylogenetic trees can be found at
http://evolution.genetics.washington.edu/phylip/software.html
4
Examples of all microbial community visualizations listed in the proposal text can be found at
http://qiime.org/tutorials/tutorial.html
5
An extensive list of publications related to software design and human-computer interactions
can be found at http://www.infovis-wiki.net/index.php?title=Category:Publications
6
We Feel Fine Twitter visualization can be viewed at http://www.wefeelfine.org/
7
100,000 Stars visualization can be viewed at http://workshop.chromeexperiments.com/stars/
8
Full details of the Systems Development Life Cycle (SDLC) process can be found at
http://en.wikipedia.org/wiki/Systems_development_life-cycle
9
Futher details about the Basecamp software can be found at http://www.basecamp.com/
10
Futher details about the PressForward plugin can be found at http://pressforward.org/
11
Jonathan Eisen’s Tree-of-Life blog can be accessed at http://phylogenomics.blogspot.com/
12
The Deep Sea News marine blog can be accessed at http://deepseanews.com/
33#
13
Further details about the StatCounter web tracking tool found at http://statcounter.com
14
Further details about ImpactStory can be found at http://impactstory.it
15
Further details about Trello can be found at http://www.trello.com/
34#
UNIVERSITY OF CALIFORNIA, DAVIS
BERKELEY ● DAVIS ● IRVINE ● LOS ANGELES ● MERCED ● RIVERSIDE ● SAN DIEGO ● SAN FRANCISCO
● SANTA BARBARA ● SANTA CRUZ
UC DAVIS GENOME CENTER
ONE SHIELDS AVENUE
DAVIS, CALIFORNIA 95616
May 8, 2013
Dear Josh,
We would like to thank the external reviewers and Sloan staff members for their extensive and insightful
comments on our proposal, “A Research-Driven Data Visualization Framework for High-Throughput
Environmental Sequence Data.” Our responses to reviewer comments are as follows:
Response to Reviewers 1-3:
Rationale for choosing a small team, focused approach at the project outset
We appreciate reviewer concerns about engaging a larger community in this project, both in terms of
software development and user feedback/testing. In the long term, both myself and Pitch Interactive are
committed to ensuring broad reach of visualization tools, and soliciting input from diverse users.
However, we believe that including a larger community during the early development of this project
would hinder project progress and endanger the chance of success. Based on their experience, Pitch
Interactive believes that when a project tries to get as much feedback as possible at the outset, the decision
making process weakens and responsibilities fail. Thus, we have made a conscious choice to keep the
project very focused from the start. Thus, we plan to initially work with a small, agile team of scientists
(myself and case study participants) to help us address key issues. Once we build a stable prototype, the
greater community can add to this framework and develop it further. Building a community of developers
(and engaging with developers working on other existing software packages) will be a long-term, multiyear process; both myself and Pitch Interactive believe that at our current stage, it is too early to discuss
sustainability through the creation of a developer community. We envision this community building as
“Stage 2” (contingent on continued funding), which would occur after the release of our final prototype at
the end of project year 1.
Our focused, small team approach is a successful model that has been previously applied in data
visualization fields. For example, Processing (a popular programming language for visualization) was
initially developed by Ben Fry and Casey Reas in the MIT Media Lab. Following its conception by this
small team in 2001, Processing has expanded to become a very rich toolset full of extensions and libraries
created by the community, and maintains a strong following of developers.
Another critical aspect of this proposal is that software development and project management will be led
by Pitch Interactive, an “outside” organization that is separate from the academic research community.
Pitch Interactive will effectively provide a unique and detached view of the project; we will have ample
opportunity to look at data and approach problems from new perspectives. Both myself and Wesley
Grubbs (CEO of Pitch Interactive) have had extensive discussions about the challenges of scientific
visualization and the potential for new tools to transform the process of scientific research. We are excited
about the opportunity to collaborate. Even though Pitch Interactive is a private organization, we both view
this project as an intensely creative endeavor that is driven by mutual scientific passion and curiosity,
i
rather than simply a contract service carried out by a “for-profit” company (a point raised by several
reviewers).
Response to comments regarding technical specifications
Regarding WebGL, a majority of web browsers do currently support this technology
(see http://en.wikipedia.org/wiki/Usage_share_of_web_browsers). As of April 2013, more than 70% of
browsers support WebGL. Internet Explorer is the only remaining popular browser to not support
WebGL, however the next version (IE11) will support it. The point of this project is to be forward
thinking and plan for public usage around summer 2014. It is worth noting that the current WebGL
specification of 1.0 that is widely used was released 2 years ago, in March 2011
(http://en.wikipedia.org/wiki/WebGL). At this point, we do not believe the accessibility of WebGL is a
major concern, since we anticipate that our users will all have access to WebGL browsers. Similarly, the
discussions of “offline support” (Reviewer 1) are highly pertinent here, since many scientists (including
myself) carry data files around on their laptops and have easy access to WebGL browsers. Here we meant
to emphasize that the use of visualization tools will not necessarily require an internet connection (even
though the toolkit will be accessed in a web browser).
Although Reviewer 1 asked for specific details about software libraries (e.g. D3), this will be determined
during the course of the project as Pitch Interactive gains a better understanding of scientific data
structures. We will likely use Three.js, D3, and jQuery but beyond that, it will depend on the requirements
building that we do in the planning phase. D3 is more of a visualization toolset, and we would not likely
use it for data parsing. Three.js is used for WebGL and 3D rendering, while Two.js and jQuery are for
general web support, such form handling, animations, etc.
Web services to interact with online data (e.g. housed on Dropbox or GenBank) will not be an initial
focus of this project – we had discussed this idea with Pitch Interactive while writing the proposal, and
decided that accommodating such features would require a non-trivial amount of programming and
auxiliary software development. This proposal aims to produce a prototype visualization toolkit in a short
period of time, and thus to simplify development it makes the most sense to initially leverage local data
files on a user’s computer. In the long-term, however, we do hope to implement tools that will allow users
to access online data and public databases. Similarly, the integration of statistical analysis and
visualization (e.g. support for data analysis using R) is another aspect that has been extensively discussed
by myself and Pitch Interactive. We believe this is too lofty a goal to be addressed during the initial
prototyping phase, and would be better tackled in a subsequent project phase (e.g. a “wish list” feature
that could be advertised to developers during the community-building phase).
Image export will offer options for high-resolution graphics (e.g. vector graphics such as SVG, etc. that
are needed for journal submissions) in addition to other image outputs such as PNG.
Server based storage has been fully budgeted for within Pitch Interactive’s subcontract expenses.
Data collection, research questions, and use cases of visualization tools
Both reviewers and staff members enquired about the specific scientific questions that visualization tools
would be designed to answer. These questions will be narrowed down during the initial phase of the
project when we discuss user needs with case study participants. However, in Case Study 2 (proposal
page 30) we have included four examples of features that are not currently easy implement in existing
data analysis tools. These feature requests can be thought of as the scientific questions driving data
ii
analysis. For example, we could ask: 1) Can we obtain novel information about the spatial distribution of
species by concurrently visualizing geographic data? or 2) If we can easily filter data according to
taxonomy, can we discover hidden ecological patterns across sample sites? Such questions will be
generally applicable to high-throughput environmental data regardless of study system, since they are
common questions asked by scientists when designing a study. The scientific questions that we will
prioritize will represent the common themes that emerge when we talk to different case study participants.
We anticipate that many researchers will want to conduct similar types of data filtering and visualization,
so the needs of case study participants will be extendable to the greater community.
As described in the proposal, early stage discussions with other community members (case study
participants) will begin immediately at the outset of the project; we will collect research needs, identify
use-cases, and prioritize research questions during this process. We propose this restricted approach, as
opposed to the extensive interactions with developers suggested by Reviewer 3, since we are convinced
that this is in the best interests of the project (see above justification for our rationale outlining our small
team approach).
The timeline presented in Figure 3 is intentionally simplified; our user testing will happen throughout the
implementation, as we liaise with case study participants on a regular basis from the outset of the project.
Visualization tools will be designed for downstream analysis – we will pick up where QIIME
functionality ends (e.g. after users have filtered and processed their data via standard workflows such
those described in the QIIME overview tutorial: http://qiime.org/tutorials/tutorial.html). As Reviewer 3
stressed, we will not begin with raw data, since there are many robust tools that already exist for this
purpose (QIIME, MG-RAST, mothur, etc.), and users will be expected to have carried out these
workflows beforehand.
Software products and innovations that will result from this project
Several reviewers mentioned the lack of specific details regarding user interfaces, software prototypes and
the nature of scientific visualizations that will result from this proposal. We specifically avoided laying
out a detailed schematic for the final toolkit (apart from some suggestions of existing tools and one
mockup image). Our overarching goal is to build a visualization framework that improves research
efficiency and enables rapid peer-to-peer communication of ideas and findings. Thus, while mockups and
specific details of visualizations may look pretty in a grant proposal, they most likely will not represent
the most effective paths towards achieving these project goals. We emphasize here that the direction we
will take in software development cannot be determined until we start working with the data. This is the
normal development process employed by Pitch Interactive; both myself and Wesley Grubbs believe that
Pitch’s track record has proven that this path is a highly successful one.
Response to Staff Comments:
Support from lab head Jonathan Eisen – A letter of support has now been included from lab head
Jonathan Eisen; as my current postdoctoral mentor, Jonathan has been kept fully informed of this Sloan
proposal, and has enthusiastically supported the development of all visualization tools outlined.
Postdoctoral salary support – The budget requests salary support over the full course of the 1-year
project; with a small team and short project timeline, this salary support will allow me to devote my full
attention to the development of visualization tools (whereas partial funding would require me to split my
time with other commitments). This scenario will ensure the successful development of software tools and
iii
the production of scientific outputs (e.g. research papers, see below) that will facilitate my pursuit of an
academic career. Initially, my efforts will focus on contacting case study participants (phone
conversations, emails, user surveys), organizing their thoughts and requests, and conveying this
information to Pitch Interactive. I will also be required to process my own existing high-throughput
datasets (rRNA amplicon and metagenomic data) and deliver/explain these data to the Pitch Interactive
team. Developing and updating the project website will be another significant time commitment. As the
project matures, I will also be responsible for testing user interfaces and visualization tools, and
publicizing the software across the scientific community (blog posts, conference presentations, etc.). The
wide variety of tasks to be completed, coupled with the iterative nature of software development, will
require my full attention throughout the year-long project timeline.
Academic career plans – I am dedicated to obtaining a tenure-track faculty position at a major research
university, and I have carefully considered how this project fits in with my long-term career goals.
Publications are thus a primary consideration, and I plan on leading the development of several
manuscripts during this project (at minimum, a software paper describing the visualization framework,
and 1-2 manuscripts where visualization tools are used to dig deeper into my existing datasets to build a
deeper view of microbial eukaryote communities in understudied environments such as deep-sea abyssal
plains). Project tools will increase my own efficiency in conducting research and lead to traditional
scientific outputs. Spearheading this project will also allow me to gain valuable career expertise in
preparation for a future faculty role, in the form of grant management and project management. Scientific
visualization represents one of my long-term research interests; by leading this project I will continue to
develop a unique niche as an independent scientist (distinct from the research interests of my postdoctoral
mentors), providing further preparation for a future transition to an Assistant Professor position.
Demand and challenges for visualization software – with regard to staff member D’s comments, we
believe we have addressed many of these questions in the “Major Related Work in the Field” proposal
section (pages 3-7), outlining the issues with proprietary software and non-scalable nature of much
existing biological software. In my view, the private sector does not have a deep understanding of the
daily challenges that biologists face in data analysis (struggling with data formats and unwieldy software
tools is common in the Eisen lab); many proprietary tools are designed for clinical/medical applications
where the user base is larger and there is more potential to profit from the sale of software. Useful
software is not necessarily difficult to produce (scalable, forward-thinking technology exists, such as
WebGL), but rather, it is difficult to find funding to embark on close collaborations between users who
understand the data (biologists) and developers who are knowledgeable about the most cutting-edge
technology and capable of building well-engineered software (studios such as Pitch Interactive). We will
overcome this particular challenge given the nature of our project team. We will also design our
visualization framework with a “horizontal-looking” view – by consciously gaining an awareness of
similar data types in other fields and disciplines (JSON files, etc.) that would be amenable to analyze
within into our framework, and eventually allow users to input many different types of “big data” and
adopt the software for their specific research needs. Encouraging such diverse use would be an obvious
goal of any future “Stage 2” community building effort.
Final comments:
Finally, we appreciate the suggestions for tracking the impact and adoption of the visualization tools that
result from this project. We believe this a very important aspect to consider when developing such a novel
biological tool, and we are committed to tracking both traditional metrics (peer-reviewed journal articles,
citations, etc.) and altmetrics (social media tracking, contributions to code repositories, website and tool
iv
usage statistics, etc.). If funded, it will be critical to discuss and define a suite of metrics with the Sloan
Foundation at the outset of the project, for periodic assessment and evaluation of success.
As stated in our proposal, we aim to produce an innovative and functional software toolkit at the end of
year 1 that would be immediately useful for the scientific community (even in the absence of continued
funding for this project). We believe that we currently have all the resources in place to accomplish this
goal, and look forward to getting started.
We thank you again for the comments, and look forward to your response.
Regards,
Holly Bik
Postdoctoral Researcher, UC Davis
v