CSDH/SCHN Cyberinfrastructure Conversations Summary

CSDH/SCHN Cyberinfrastructure
Conversations Summary
This is a high-level summary of the outcome of a series of conversations regarding the CFI
Cyberinfrastructure Initiative among Canadian Digital Humanists. The conversations emerged
from CSDH/SCHN consultations that began in the Spring of 2014. The document tries to reflect
the priorities and areas of emphasis that have emerged from these discussions, and suggests
several areas of focus for broad-based collaborative cyberinfrastructure that would serve the
needs of many in the digital humanities research community. The diversity of work in the digital
humanities makes it impossible to mention every need, but in the view of the CSDH executive,
this summary covers a number of pressing needs from a range of research groups across the
country, and balances the need to serve existing researchers with that of expanding access to
important datasets and cyberinfrastructure to leading humanities researchers who are
experimenting with advanced research computing.
This summary is not meant to be prescriptive, nor does it mean to limit the number of proposals
going forward to CFI. It may be that the diversity of work and the range of needs in our
interdisciplinary community makes it impossible to work within the parameters of one or two
applications. CSDH will not coordinate a national consortium or proposal, but it hopes to have
provided a foundation for the formation of a broad-based consortium of institutions to
collaborate in one or more successful proposals to the CFI Cyberinfrastructure Initiative.
In the view of the Executive, this should bring the CSDH/SCHN role in the coordination of
discussion related to this initiative to an end. The cyberinfrastructure listserv will remain
available as a resource to allow those participating in applications to communicate and
coordinate as they wish, but our assumption is that the next stage will involve participants
affiliating themselves with these and likely other areas, forming smaller working groups
associated with those areas, and discussing institutional participation and leadership, in
preparation for the upcoming EOI stage.
We thank all of those who have contributed to these discussions, to the production of the SPARC
White paper, and the response to the draft call.
2
General points
Consultation within the digital humanities community has produced a wide range of ideas for
what kinds of investments in cyberinfrastructure might have the most value and impact. The
richness and diversity of Digital Humanities research is both a strength and a challenge in the
context of the CFI Cyberinfrastructure Initiative.
One recurring theme during the consultation has been the need for a well-supported shared
digital infracture that would provide access to data in conjunction with tools, platforms and
training materials that would serve both leading digital humanities scholars and scholars keen to
incorporate digital methods into their research. Within Canada, the obstacles to getting up and
running with a digital project are still too daunting for many scholars in the humanities. There
are many projects which have created large amounts of data and sophisticated tools but whose
data and tools remain inaccessible and isolated for structural reasons, especially lack of a shared
and powerful cyberinfrastructure enabling collaborative and innovative research.
There is the need for a user-friendly platform, perhaps a cloud environment that offers both
infrastructure-as-a-service and software-as-a-service, that will enable a much greater number of
humanities researchers to make the most of existing and emerging (i.e. grant-funded) digital
datasets through shared web-based infrastructure. Just-in-time resources could be managed
centrally (providing economies of scale) and coupled with the technical and domain expertise
support that humanities researchers need. Compute Canada’s server and account infrastructure
could form the foundation for a welcoming platform customized for digital humanists. Among
the benefits would be large-scale sharing of content and datasets, as well as the capacity to spinup project-specific or event-specific instances of tools to mobilize or combine particular datasets.
There is a widespread consensus that the digital humanities research community needs personnel
with sufficient technical expertise and depth of experience working with digital humanities
projects to work with Compute Canada staff and equipment to increase the capacity of Compute
Canada to serve the humanities. A distributed group of data analysts (similar to those in Compute
Canada and perhaps on a trajectory to become part of the CC complement of analysts) would be
positioned to promote standards, adapt and adopt existing open-source software solutions, and
install emergent tools on Compute Canada equipment to create accessible datasets, tools, and
services with web-based, user-friendly front ends that will dramatically increase the capacity of
Compute Canada to serve the humanities.
3
Potential areas of focus
The following areas of focus have emerged from our consultations as addressing a number of
researchers from more than one research group. They are not an exhaustive representation of the
infrastructure requirements of the community, but represent areas where there is significant need
cutting across several communities of researchers. Of course, the needs could be grouped and
represented differently. Needs of the community are diverse, and listing these areas of focus
together here does not suggest that they should all be part of a single application, even though
there are synergies and overlap between these areas.
Big humanities data accessibility and interoperability:
Aggregating and making available large-scale humanities datasets in an environment that
respects copyright and other rights restrictions by providing data management services and
Improving methods of data fusion and interoperability related to large collections of datasets that
cannot be directly accessed by researchers, but whose owners (large collections such as that held
by the HathiTrust, commercial datasets made available for research purposes, and heritage
institutions around the world) are willing to lend it for research purposes.
Infrastructure as a service: big data storage services
Software-as-a-service: data conversion and ingestion tools; text mining tools such as Hadoop,
Weka, Mallett; visualization tools such as Voyant
Research Rights Management tool
Ancillary work by community and partners: access policy development
Possible partners: Hathi Trust, ARC, Canadiana, research libraries, OCUL, CARL,
Archive.org or Cdn contributors thereto; Gale-Cengage; Érudit, CRKN; Mukurtu;
heritage and memory institutions
Possible research projects: Text Mining the Novel; HistoryCrawler; Textual Communities;
Editing Modernism in Canada; Hispanic Baroque Project
Multi-disciplinary collaboration with heterogeneous cultural and textual heritage data:
A number of projects in Canada and beyond either work collaboratively with heterogeneous
datatypes or work with data in a context that could benefit from collaboration with others
working with other datatypes. This data might include 2D photography, moving image
collections, gaming, 3D cloudpoints and meshes, and so on. The applications these projects
might involve include Digital Libraries, Serious Games, Immersive environments, data
visualisation, crowdsourcing applications, and other research and mobilization activities.
4
A strategic group of multi-disciplinary, national and international partners working with cultural
and textual heritage data in these contexts could collaborate on using existing research datasets to
develop appropriate tools and protocols for collaboration, visualisation, and "publication" (in the
broad sense of "sharing with end users"). The infrastructure will help place Canadian researchers
at the world forefront of research into cultural evolution, and provide data to permit researchers
in Computer Science and mathematics to explore various ways to traverse graphs and to model
heterogeneous multi-network data structures belonging to different semantic domains but
intersecting at various points.
Possible partners: CulturePlex Lab
Related research projects: Visionary Cross Project; Hispanic Baroque Project
Linking humanities data:
Semantic web infrastructure that will allow humanities researchers to participate in the emergent
semantic web, and leverage the growing knowledge graph to answer key questions about cultural
and social change that can only be addressed through the “internet of things.” Provide tools to
allow researchers to link their data, annotate others’ data, manage, curate, and accept/incorporate
annotations by others of their data, to expose the data and annotations on the web, and to manage
the complex relationship between such annotations and the dynamic content to which they refer.
Infrastructure as a service: National triple-store linked to SPARQL end-point, visualization
tools, reasoners,
Software as a service: tools for mobilizing common humanities data formats as linked open
data to provide greater accessibility, interoperability, and contribute scholarly knowledge
to the public good via the emerging semantic web; tools for ontology building, ontology
management (including dynamic ontology management) and linking; tools for harvesting
annotations for data corrections, commentary,
Possible partners: Canadiana, ARC, InPho, CWRC
Related research projects: Textual Communities; Linked Modernisms; Coding Character;
MARGOT; Mariposa Folk Festival Archives project
Dynamic data management:
Addresses challenge associated with the extent to which humanities datasets are almost always
incomplete, in dialogue with ongoing scholarly dialogue, and in need of updating and curation.
Allowing data to evolve and tracking that evolution is key to maintaining robust datasets that
support investigation how patterns of cultural information change and affect human behavior and
knowledge. Producing environments that foster adherence to standards and support community
curation of humanities datasets is key to ensuring the quality and interoperability of humanities
datasets. A step change in humanities research capacity could be produced by leveraging and
5
making interoperable key components of existing software suites associated with major
humanities data producers, aggregators, and disseminators.
Platform-as-a-service: ability to spin up and configure with particular sets of compatible tools
virtual machine sandboxes for particular research and research training purposes; this is
key to growing the number of active data managers and curators among the humanities
research community
Software as a service: commonly needed applications for discovery, analysis and
visualization, plus tools for enhancing and/or preparing for research existing data through
metadata enhancement, automated XML markup, OCR cleanup, NER and triple
extraction, annotation, and tools for managing crowd-sourcing of transcription, collation
and annotation in the making of digital editions.
Possible partners: PKP, Érudit, PREEO lab; CWRC
Related research projects: Textual Communities; INKE; MARGOT; Orlando