GRADE JISC DEVELOPMENT PROGRAMMES Project Document Cover Sheet FINAL REPORT Project Project Acronym GRADE Project ID Project Title Scoping a Geospatial Repository for Academic Deposit and Extraction Start Date 1 June 2005 Lead Institution EDINA Project Director David Medyckyj-Scott, EDINA Project Manager & contact details Anne Robertson EDINA The University of Edinburgh Causewayside House 160 Causewayside Edinburgh EH9 1PR tel: 0131 651 3874 email: [email protected] Partner Institutions University of Edinburgh, University of Southampton Project Web URL www.edina.ac.uk/projects/grade Programme Name (and number) Digital Repositories Programme 2005-07 Programme Manager Neil Jacobs End Date 30 April 2007 Document Document Title Final Report Reporting Period End of project Author(s) & project role Anne Robertson, Project Manager James Reid, Project Advisor David Medyckyj-Scott, Project Director Date November 2007 URL http://edina.ac.uk/projects/grade/GRADEfinalreport.pdf Access Project and JISC internal Filename GRADEfinalreport.pdf General dissemination Document History Version Date Comments 0.1 30/03/07 Initial draft for Programme Manager 1.0 31/10/07 Final draft for review 1.1 30/11/07 Final version for submission to Programme Manager 1.2 31/01/08 Final version with edits requested by Programme Manager 2 Table of Contents Acknowledgements .......................................................................................................................................4 Executive Summary.......................................................................................................................................5 Background ....................................................................................................................................................6 Aims and Objectives......................................................................................................................................7 Methodology...................................................................................................................................................8 Implementation ..............................................................................................................................................9 Outputs and Results ................................................................................................................................... 12 Outcomes .................................................................................................................................................... 21 Conclusions................................................................................................................................................. 23 Implications ................................................................................................................................................. 23 Recommendations...................................................................................................................................... 25 References .................................................................................................................................................. 26 Appendices ................................................................................................................................................. 27 3 Acknowledgements The project was funded by JISC under its Digital Repositories Programme. The project wishes to acknowledge the support of the staff of the EDINA National Data Centre as well as to the valuable contributions from the following consortium and associate partners: Dr Charlotte Waelde, AHRC Research Centre for Studies in Intellectual Property and Technology Law Mags McGingley, AHRC Research Centre for Studies in Intellectual Property and Technology Law Jonathon Gu, AHRC Research Centre for Studies in Intellectual Property and Technology Law Pauline Simpson, University of Southampton Dr Mark Brown, University of Southampton Dr Mike Smith, Kingston University Dr Robin Smith, University of Sheffield Dr Anthony Beck, Leeds University Andrea Frank, Cardiff University Bob Abrahart, University of Nottingham James Batcheller, University of Edinburgh Nick Groome, Ordnance Survey Elsa Joao, Strathclyde University Paul Adderley, Strathclyde University Bryan Lawrence, British Atmospheric Data Centre Owen MacDonald, University of Edinburgh Steven Morris, North Carolina State University Femke Reitsma, University of Edinburgh Mike Sanders, Plymouth University Graham Vowles, Ordnance Survey Dr Pragya Agarwal, University College London Alex Mordue 4 Executive Summary The aim of GRADE has been to assist in developing policy and best-practice strategies for geospatial data sharing and reuse by providing demonstrable evidence of how, why and under what circumstances geospatial data are (and may) be managed via repositories. The broad areas for investigation included: - Digital Rights and IPR issues related to use and reuse of primary and derived data; the technical and cultural milieu in which a geospatial repository might operate in order to better understand the costs and benefits of various approaches; and the technical and cultural issues around informal data sharing mechanisms. An holistic approach to understanding the issues around storing and accessing geospatial data within a media-centric digital repository was taken with work divided into distinct areas. This approach was hoped to afford an understanding of the breadth of issues that would arise in a repository environment and investigate approaches acceptable to the community prior to any future creation and deployment of an infrastructure for geospatial data sharing. Key findings of the project are: • Specific user requirements exist for the effective management of geospatial data within repositories. • Clear evidence exists of grass roots support for a national geospatial data repository. • Geospatial data sharing is commonplace. The most commonly used methods are email attachment, CD/DVD, ftp. • There appears to be a lack of departmental geospatial data management policies • More contemporary informal file sharing practices such as those based upon peer-to-peer networks are not in common use. • There is a perception that current licences for geospatial data are overly-complex • There is a great deal of uncertainty by those creating geospatial data as part of research as to what they can legally do with derived data products • The legal argument that the sui generis database right is the appropriate law for geospatial data and therefore derived data products can legally be made available for reuse throughout the academic community • UK institutional repositories do not currently manage geospatial data and are not yet set up to do so this is compounded by the fact that they have not yet been offered any geospatial data to manage. Some are willing to accept this media type. • The mature census-based approach to standards for geospatial data, particularly geospatial metadata, ensures geospatial data repositories are well placed to interoperate with institutional repositories within the JISC Information Environment and beyond. However more work needs to be done on standards for geospatial data preservation and the packaging of geospatial data. The GRADE project has been successful in demonstrating to members of the UK academic geospatial community the value of a trusted repository for geospatial data sharing and reuse. The project has highlighted that such a data repository is an essential component of the UK academic spatial data infrastructure and its absence is unsatisfactory. The project has successfully demonstrated that the law on copyright and licensing for geospatial data in the UK is not as well understood nor as clear-cut as it might be and there exists scope for clarifying fundamental end user rights. The project has also put forward a legal argument that copyright does not subsist in geospatial data rather the European database directive is the appropriate law. 5 Background Over recent years there has been a growing interest in improving the discovery and sharing of data amongst researchers and end users. Data sharing is important for two reasons. Firstly, data sharing has historically been considered a hallmark of good scientific practice. Openness in the scientific process allows for the confirmation of research findings, especially through the replication of results. Data sharing also makes it possible for scientists to build on the work of others. It is with this in mind that scientific funding agencies often require grant recipients to share the data produced in their studies. Secondly, new “big science” projects involve data that are collected and analysed by multiple people, institutions, and research sites. Sharing data in these cases becomes more than just the exchange of finalised data sets. E-Science initiatives in the United States and the European Union are looking at ways to allow scientists to collaborate on the creation and use of very large research data sets. Despite this importance, however, sharing data is not easy. Many researchers have discussed the problems underlying this seemingly simple process e.g. Louis et al 2002. Problems include: the willingness to share, locating data, mechanisms for sharing and accessing data. 1 One special subset of data of particular concern is digital geospatial data. Geospatial data are typically expensive to create (born digitally or not) and are thus valuable assets which can be best exploited if the infrastructure for their discovery, sharing and reuse are put in place. However, the way geospatial data are delivered to, and used by, end users are contingent upon a number of factors. These include how and why it was created or acquired; the agreements in place to co-operate, share or exchange data between different institutions; conditions and procedures required to meet legal and economic requirements; how and where it is stored; and upon software and hardware requirements. Geospatial data may be categorised by inherent complexity and profound variability (format, size, temporality, quality, value, access restrictions), which from a users perspective, makes data sharing logistically complicated and burdensome. EDINA’s endeavours in this area has focused largely on redistributing (licensing permitting) data files produced by individuals or small teams and has proceeded very much on an ad hoc and reactive fashion. The reasons for this are two fold: i. there has to date been no rigorous appraisal of the viable alternatives for geospatial data sharing within the JISC Information Environment (IE) and there is no default mechanism by which such data can easily be deposited, discovered and shared; ii. Intellectual Property Rights (IPR) and copyright issues are a serious impediment to acquiring, preserving and sharing geospatial data. That is, there are currently concerns and confusion (either phantom or legitimate) over the assertion of IPR and copyright, particularly where the data includes third party data e.g. Ordnance Survey or National Census Offices. At the time of project proposal, April 2005, the UK was somewhat lagging behind developments emerging elsewhere such as those in the US, e.g. the Cornell University Geospatial Information Repository (CUGIR) (Westbrooks 2003) or the National Digital Information Infrastructure and Preservation Program (NDIIPP) which was specifically targeting the ‘Collection and Preservation of At-Risk Digital Geospatial Data’ (Morris 2005). However, as well as the general need to improve the sharing of geospatial data between researchers in the UK, various EU directives and regulations e.g. the Water Framework Directive, Global Monitoring for Environment and Security (GMES), require that the issues of exchange sharing, access and use of spatial data be addressed. For example, INSPIRE (http://www.ec-gis.org/inspire/) requires Member States to adopt measures for the sharing and reuse of spatial data sets between Public Authorities including Universities. Given these external stimuli, the scoping of a Geospatial Repository for Academic Deposit and Extraction (GRADE) seemed both necessary and timely. 1 ‘Geospatial’ is a term used to describe a class of data that has a geographic or spatial nature e.g. paper maps, electronic maps, geo-referenced imagery, satellite data, data stored within a Geographic Information System (GIS). 6 Aims and Objectives The aim of GRADE has been to assist in developing policy and best-practice strategies for geospatial data sharing and reuse by providing demonstrable evidence of how, why and under what circumstances geospatial data are (and may) be managed via repositories. The broad areas for investigation included: - Digital Rights and IPR issues related to use and reuse of primary and derived data; the technical and cultural milieu in which a geospatial repository might operate in order to better understand the costs and benefits of various approaches; and the technical and cultural issues around informal data sharing mechanisms. Additionally, the project has endeavoured to: - establish and evaluate prototype demonstrator repositories based on formal models; establish the mutuality between media-centric (geospatial) repositories and Institutional repositories; and establish how metadata standards deployed in various repositories may meet the requirements of the geospatial community (that is the HFE/GoGeo profile of ISO19115) and what the minimum quality thresholds for useable spatial metadata (use case driven) may be. Specific objectives agreed at the start of the project included: • Establish detailed repository use cases and user based evidence for the requirements and functionality of a repository capable of managing licensed geospatial assets. • Investigate and identify the technical and cultural issues surrounding the storage, management and accessibility of geospatial information derived from licensed data within a centralised digital repository. • Synthesise the lessons into best practice and advice for the wider community particularly those concerned with the establishment and operation of research data or media-centric repositories. • Investigate the extent of current informal data publication and sharing and the ‘grey economy’ of geospatial information sharing. • Investigate the relationships and potential interfaces between informal and formal (including Institutional) repositories. • Pilot an informal geospatial data repository demonstrator with Associate Partners to inform understanding of the cultural and technical issues involved. • Articulate intended use cases of sharing derived geospatial data. • Develop a clear understanding of digital rights pertaining to data created entirely by a user or research team. • Develop a clear understanding of digital rights issues for derived data respecting the licensing conditions of the source data. • Develop a conceptual and technical framework for resolving those described rights management issues raised in relationship to repositories. • Determine to what extent Institutional repositories currently manage geospatial assets. • Determine to what extent Institutional repositories could or should manage geospatial assets. • Investigate the arguments for and against Institutional versus media-centric repositories for geospatial data assets. • Investigate and assess the role of existing JISC sponsored terminology services with respect to repositories. • Investigate and assess interoperability between geospatial data repositories and other types of services/repositories e.g. metadata harvesting into geo-spatial portals, linking e-prints with the datasets referred to in the articles. • Investigate the linking of geospatial repositories with e-Science infrastructures. • Investigate and assess the potential of evolving industry driven geospatial interoperability standards, specifically Open Geospatial Consortium/ISO 19100 series standards, as a means of interoperating with repositories within and outside of academia. At project outset, an objective had been to scope possible technical architectures for informal data sharing and investigate how security issues may be addressed, synthesising findings into a preferred 7 2 policy statement on the use of Informal Repositories. This objective had to be curtailed somewhat with an alternative approach of a workshop-based investigation into various peer-to-peer based file sharing products. Methodology The GRADE project consortium led by EDINA with the Arts and Humanities Research Council Research Centre for Studies into Intellectual Property and Technology Law and the National Oceanography Centre, University of Southampton demonstrated strengths in the key areas of understanding geospatial data, digital rights and data centres. To complement this strong consortium, project associate partners included select UK geospatial academics and representatives from relevant 3 Higher Education Academy Subject Centres . In addition to the expert knowledge brought by consortium members and associate partners to the project, small pieces of work were commissioned to external specialists where necessary. Overall Approach An holistic approach to understanding the issues around storing and accessing geospatial data within 4 a media-centric digital repository adopted with project work divided into the following work packages: 1 2 3 4 5 Work Package Investigation into formal repositories for sharing Scoping the role of informal repositories Digital Rights Issues Scoping the role of institutional repositories for geospatial data Interoperability Responsibility EDINA with input from associate partners EDINA with input from associate partners Arts and Humanities Research Council Research Centre for Studies into Intellectual Property and Technology Law National Oceanography Centre, University of Southampton EDINA Detailed Methodology Within work package 1, the intention was not to establish a geospatial repository per se, but rather to use the development of a demonstrator repository to explore the range of technical and cultural issues involved and to gather evidence via user feedback. An advantage of this approach was that it acted as an intelligence gathering activity whilst offering an immediate advantage to the UK Higher and 5 Further Education (HFE) geospatial community by helping to identify the most salient operational aspects of a media-centric repository as opposed to a ‘conventional’ general purpose repository. It should also be noted that the repository open access movement was something of an unknown quantity to most users and an element of user education was (of necessity) required. In order to evaluate the scope of informal repositories for data sharing, work package 2 used a questionnaire to gather initial evidence of existing trends in informal data sharing. A file sharing experiment and subsequent workshop then provided the opportunity to make an assessment of peerto-peer facilities for data sharing. A compendium of use cases provided exemplars of geospatial data derived in the course of research projects. This compendium formed reference material within work package 3 from which the legal team formed an impression of the digital rights issues faced by those creating geospatial data within UK HFE. Site visits to the compendium author and to the organisation responsible for base mapping of the UK supplemented the background material within the compendium. A review of existing 2 The original intention had been to leverage work within the SPIRE project looking at LionShare as an informal repository. http://edina.ac.uk/projects/grade/team.html 4 This term is used to denote a managed resource collection focusing on a particular content type – in this instance geospatial resources. 5 The term ‘geospatial community’ is used in the broadest sense to describe those working with geospatial data. In UK HFE that includes teaching academics, researchers and students. 3 8 academic licences for access to geospatial content also contributed to the informed views of those responsible for development of the licensing strategy within work package 3. In work package 4, a survey distributed to institutional repository managers was the principle method for scoping the role of institutional repositories for geospatial data. Work package 5, considered interoperability issues and took the form of a technology watch throughout project duration. Implementation Work Package 1 The focus of Work Package 1 was on the storage, management and access to data derived from 6 licensed data within repositories. The original proposal submitted to JISC outlined the concept of building a demonstrator repository as a means of engaging potential users. Prior to building the 7 demonstrator repository (the demonstrator), a literature review was undertaken to identify and possibly leverage work already taking place in this area. The review confirmed that in terms of geospatial data and repositories, very little activity was taking place globally. However one of the 8 project partners had been investigating ingest procedures for their geospatial data archiving project 9 based upon the DSpace open source repository. It seemed appropriate to leverage any experience this group had had with storing geospatial data within that particular repository software and so DSpace was chosen as the basis for the first version of the demonstrator. In terms of who could access the demonstrator, the initial concept described in the project proposal 10 was to repurpose geospatial data derived from licensed content already available via Digimap and 11 UKBORDERS services. With access to the information within these services restricted by subscription and/or registration, the vision was to have a controlled environment for engaging with an array of users/creators of geospatial data. However it was realised fairly early on in the development 12 of the demonstrator that the complexity of UKBORDERS licensing made this too hard to implement and that access would be granted to Digimap registered users only. To ensure legitimate data sharing via the demonstrator, only registered Digimap users were allowed to register as 13 depositing/retrieving users of the demonstrator . A manual check was made of every potential registrant against the Digimap registration database. The methodology chosen for development of the repository was one of short, iterative development 14 cycles based upon user feedback . Demonstrator Version 1 The first release of the demonstrator offered out-of-the-box DSpace functionality with minor 15 16 modifications . The demonstrator was populated with a selection of seed datasets to enable invited participants to interact with the demonstrator. Associate partners were invited to interact with the demonstrator (search, retrieve, upload) and were contacted via email/telephone for their feedback. Demonstrator Version 2 Feedback from the first version of the demonstrator led to a series of mostly cosmetic enhancements being made to the demonstrator, including the front page layout and the addition of images of 6 http://edina.ac.uk/projects/grade/GRADE_Proposal.doc http://edina.ac.uk/projects/grade/RepositoryReviewFinal.doc 8 North Carolina Geospatial Data Archiving Project http://www.lib.ncsu.edu/ncgdap/ 9 www.dspace.org 10 Add url - Digimap is a collection of EDINA services that deliver maps and map data of Great Britain to UK tertiary education.Data is available to download for use in desktop GIS or as maps for printing, inclusion in reports etc. 11 UKBORDERS, funded by ESRC, provides digitised boundary datasets of the UK in a variety of GIS formats for UK HEFE community to download and use. 12 UKBorders users are licensed to have access to particular data series only, not the entire collection of UKBORDERS data. 13 The repository was open to all to browse and review item metadata 14 A full description is found at http://edina.ac.uk/projects/grade/status1.html 15 Athens Authentication added at the point of data deposit/download 16 Mostly sample environmental data 7 9 17 industry-recognisable logos to assist repository browsing . The second version of the demonstrator was accompanied by an online questionnaire (Appendix A) which respondents were asked to complete having interacted with the demonstrator. With this second version of the demonstrator, the aim was to reach a wider audience. Associate partners were asked to publicise the demonstrator throughout their departments/institutions Associate partners were also provided with an expanded template (Appendix B) to not only provide feedback on the demonstrator but also to provide an overview of their current departmental data management practices. Demonstrator Version 3 Feedback gathered from the second version of the demonstrator led to the final set of customisations being made to the demonstrator and it was this customisation where the bulk of software engineering effort was spent (a summary of this work can be found at Appendix C). The key feature users wanted from a repository capable of managing geospatial data was to support location-based searching. To meet this requirement, software engineering effort could either have focused on enhancing the 18 existing UK academic geospatial metadata discovery portal with data deposit/download modules or continue customising DSpace. It was felt that work with DSpace would probably be of most interest to the wider repository community. In terms of offering map-based searching within the demonstrator, 19 the decision was taken to use an open source mapping engine as opposed to an EDINA-based mapping service again for wider programme relevance. Open-source geospatial data translator libraries were used to (i) verify GIS data format and (ii) check file size during the customised deposit 20 process . In its final form the demonstrator offered a repository supporting location-based searching, automatic generation of geographic extent during data deposit, file validation and most recently deposited alerts on the front page. A new questionnaire (Appendix D) to accompany this third iteration of the demonstrator was made available via the project web site. Response-rate to this questionnaire was poor. Work Package 2 Work package 2, assessing the use of informal repositories for geospatial data sharing, commenced with the distribution of an anonymous questionnaire aimed at gaining insight into current data sharing practices. 21 The GISRUK conference series is the UK’s national GIS research conference, established in 1993. th th In 2006 it was held in Nottingham, from 5 to 7 April. It seemed sensible to leverage the opportunity of a large national gathering of GI practitioners to carry out such an investigation. At this event, the GRADE project officer used breakout times to circulate amongst conference delegates to ask if they would be prepared to answer a few questions related to data sharing. Copies of the questionnaire (Appendix E) were also left in the conference computing lab for delegates to complete. The anonymous questionnaire was also made available on the GRADE project web site. 101 responses were received. As well as providing an assessment of current informal data sharing practices, the results of the survey were hoped to provide direction on the most appropriate way forward for the development of an informal repository demonstrator. However, the survey showed that more progressive file sharing practices are not commonplace amongst those working with geospatial data. 22 The GRADE team liaised with members of the JISC-funded SPIRE project on the possibility of leveraging their work with LionShare software. However this partnership did not progress due to difficulties encountered in setting up the LionShare software itself. Other alternatives including instant messaging and bitTorrent technologies were considered as the technical basis for an informal demonstrator. GeoChat is an extension to one of the most 17 Images depicted software vendors logos to assist user to quickly distinguish certain data format types. www.gogeo.ac.uk 19 Google Maps 20 For demonstrator purposes, users were restricted to depositing only certain geospatial data types and files of a maximum data size 21 http://www.geo.ed.ac.uk/gisruk/gisruk.html 22 http://spire.conted.ox.ac.uk/cgi-bin/trac.cgi 18 10 23 commonplace GIS softwares . Geo-chatting is described as offering the ability to exchange geometries and georeferenced imagery with online contacts via P2P technology. However, problems encountered with GeoChat related to the difficulty in setting up the extension. The GRADE team then 24 considered some form of demonstrator leveraging geoTorrent.org . GeoTorrent.org was set up in the 25 second half of 2005 as an initiative to facilitate the distribution of large geospatial datasets with the goal of providing ‘fast peer-to-peer sharing of geospatial data’. Initial contact was made with the organisation responsible for running geoTorrent.org, but with their base in Australia, it seemed the logistics required to facilitate the work were beyond the scope of the informal demonstrator. 26 Finally, as an alternative to an informal demonstrator per se, an expert in participatory GIS was commissioned to carrying out a 2-day data-sharing workshop with a group of twelve participants. On the first day participants used two peer-to-peer file sharing applications to attempt to share geospatial data remotely from their workplace. On day two, participants came together to discuss their experiences and make an assessment on the suitability of informal methods for geospatial data sharing compared to the formal demonstrator repository. Work Package 3 The initial step within Work Package 3 (Digital Rights Issues) was to create a compendium of derived data exemplars for the legal team. A lecturer in GeoSciences at Kingston University with an interest in the legal intricacies of derived data was commissioned to gather this content. Each use case within the compendium was presented in a standard way via a use case template (Appendix F). The use case template was based upon one provided by the digital rights working group of the leading international geo-standards organisation, the Open Geospatial Consortium (OGC). Part of the sustainability strategy for the project was to ensure knowledge exchange with the OGC geoDRM working group. Therefore it seemed appropriate to use their template to describe derived data use cases. On completion of the use case compendium it was passed onto the legal team. The legal team sought clarification on certain data processing techniques described in the compendium and undertook a site visit to the compendium author. During this information-gathering phase, the legal 27 team also invested time reviewing current HEFCE licenses for geospatial data access, for example, 28 via EDINA’s Digimap service. In December 2005, Dr Waelde visited Ordnance Survey to gain insight into that particular organisation’s internal processes for data creation. Having amassed sufficient background material, the legal team commenced their study into developing a legal framework. In June 2006, initial findings were shared with the OGC geoDRM working group for their discussion and feedback. In October 2006, final findings were shared with GRADE project partners at the GRADE all-partner mid-project meeting in Edinburgh (http://edina.ac.uk/projects/grade/meetings/301006.html) for discussion and feedback. The report was finalised with project partners’ comments. Work Package 4 Project work aimed at scoping the role of institutional repositories for geospatial data commenced with the design of questions for a web-based survey. The questionnaire (Appendix G) was posted on the GRADE project web site during early 2006. Project associate partners, on behalf of their institutions, were invited to complete the survey initially. Members of the wider repository network were also invited to complete the questionnaire. This was achieved by email canvassing to individual repository managers (details sourced using the Registry of 29 Open Access Repositories ), as well as making several calls on [email protected] and 23 ESRI’s (www.esri.com) ArcGIS is arguably the most popular desktop GIS used within academia. www.geotorrent.org 25 Geospatial datasets, particularly image datasets, can be very large. A single compressed image can be many gigabytes in size. 26 community-based management of spatial information 27 http://edina.ac.uk/digimap/terms.shtml 28 Digimap is an EDINA service that delivers Ordnance Survey map data to UKHFE. 29 http://roar.eprints.org/ 24 11 30 another to the SHERPA project list. In October 2006, survey findings were presented to GRADE 31 project partners at the GRADE all-partner mid-project meeting in Edinburgh . Project partners were also asked to contribute to the SWOT analysis of institutional v. media-centric repositories at this meeting. Work package 4 was completed in January 2007. Work Package 5 Work investigating how media-centric repositories might interoperate with external repositories and how interaction with the JISC IE may best be achieved took the form of a technology watch for the duration of the project. The work was initially commissioned to an external resource at University College London. However due to extenuating circumstances this work was completed by EDINA staff. Outputs and Results Work Package 1 – Formal media repositories for sharing The aim of work package 1 was to generate user based evidence for user requirements of a repository capable of managing licensed geospatial data via the development of a demonstrator repository. Prior to considering the specific functional requirements of such a repository, it is worth reporting on the status of current departmental guidelines/policies on data sharing reported by associate partners. During the information-gathering phase of the second version of the demonstrator, associate partners were asked to provide an insight into their current departmental data management policies (Appendix H brings together each individual’s feedback). Similar issues were raised by each of the associate partners and can be summarised as: • a lack of policies/guidelines for archiving and ongoing access to research data; • poor metadata practices and a lack of policy on metadata creation; • loss of knowledge/data particularly work completed by post graduates; • data commonly stored on researchers’ personal computers; • data curation is seen as responsibility of the individual/group who carried out research; • for an interested person to access data, they need to contact the data creator directly; • sharing data is based around people networks; • one of the key barriers to sharing relates to concerns over breaking licence conditions; • lack of tie in into institutional practices e.g. archives, asset management. It is worth noting that these practices are not unique to those working with geospatial data, indeed many of the issues raised are highlighted within the recently published JISC-funded ‘Dealing with 32 Data’ (Lyon 2007) report which has as one of it’s recommendations the need to “develop a Data Audit Framework to enable all Universities and colleges to carry out an audit of departmental data collections, awareness, policies and practice”. With the described lack of formal infrastructures for data sharing in place therefore, it is perhaps not surprising that an outcome from this work package has been the level of use generated by the 33 demonstrator repository . At the end of November 2007, the demonstrator has over 160 data deposits and over 170 registered users. An equivalent number of potential users have been refused access to deposit or download data from the repository because they are not registered Digimap users. These ‘rejected’ users are from a variety of sectors including non-Digimap registered UK institutions, international academic institutions, UK government, UK private sector and individuals. With EDINA still receiving weekly registration requests from interested users, options for sustaining the repository are being considered. In the meantime, however, the key aim of work package 1 was to identify those functions necessary for a repository capable of managing geospatial data. The three cycles of development culminated in 30 www.sherpa.ac.uk http://edina.ac.uk/projects/grade/ppt/Simpson_GRADEProject%20Meeting30Oct06.ppt 32 http://www.jisc.ac.uk/whatwedo/programmes/programme_digital_repositories/project_dealing_with_data.aspx 33 http://gradedemo.edina.ac.uk/dspace/index.jsp 31 12 a list of key functions (both geo-specific and more general) being identified by associate partners and 34 their invited reviewers . General user requirements Repository Geo-specific user requirements Function • Location-based searching - by drawing a box on a • Search by institution or research Search • • • • • Deposit • • • Download • Policy • • • map, by entering a place name, by clicking an identifiable area on a map e.g. county boundary, by entering postcode. Location-based searching to be prominent on repository ‘home page’ Location-based searching needs to offer high quality large scale data for locating oneself (maps available within Google Maps, for example, is not sufficient) Geographic breakdown of collections within the repository e.g. UK, Europe, World collections and thus refined searching within geographic collection (linked to this is the option to subscribe to RSS feeds on these more fine-grained collections) Searching via geospatial data type While Dublin Core metadata is sufficient in the first instance for a user to discover a relevant data set there needs to be a few supplementary geo-specific ‘quick visuals’ including industry-recognisable logos for GIS data format and a small pictorial thumbnail indicating actual geographic extent of the dataset. Since storing the geographic extent of a dataset is necessary to enable location-based searching, the repository needs to be able to automatically extract the geographic extent of the dataset directly from the dataset as part of the upload process. Ability to deposit bundled ‘project’ data – that is ‘project data’ that comprises several datasets as one deposit item. A standards-compliant geospatial metadata record 35 (ISO 19115 profile ) should be part of the deposit item. Metadata of this quality is required to make a judgement on fitness for purpose and for confident data reuse. On download, data needs to be in a form ready for immediate use in GIS software and other geospatial application software No restrictions on GIS data format No restrictions on geographic extent of data Rules for naming data deposits (item title to include place name, country, scale) • • • • • • • • • • • • • groups Improved date searching – especially date range Quality rating system Subject searching assisted by controlled keywords Fast and simple submit process Automatic ingest of multiple files Ability to deal with data versioning Guidance on licensing Quick download Ability to inform user of uncompressed data set size prior to download Single authentication system Rules for dates 36 Platform independent 37 Sustainable and trusted for long term and guaranteed access Table 1: Geo-specific and general user requirements for a repository capable of managing geospatial data. 34 Detailed in the report http://edina.ac.uk/projects/grade/FormalGISRepositoryFeedbackFinal.pdf ISO19115 (www.tc211.org/metadata) is the international geospatial metadata standard. The UK academic profile of ISO19115 is promoted as the geospatial metadata standard to be used within UK HFE. 36 For example, the current demonstrator expected data to be uploaded as a zip file – users from unix-based systems indicated that restricted them from upload. 37 Of interest, survey respondents indicated that since the demonstrator repository was run by EDINA (EDINA run a suite of subscription geospatial data download/online mapping services for UK HEFE), a degree of trust was already established in the repository even though it was only a demonstrator. 35 13 Work Package 2 – Scoping the role of informal repositories In order to scope the role of informal repositories for geospatial data sharing, work package 2 initially assessed current data sharing practices with information gathered from (i) the anonymous survey distributed at GISRUK and on the GRADE project web site and (ii) the 2-day workshop on informal 38 data sharing and (iii) associate partners. Results confirmed that data sharing is commonplace, 90% of those responding to the anonymous survey could recall a time they had shared data and indeed many qualified this by saying they shared data often. Data sharing is predominantly between colleagues and amongst project partners. The survey also asked respondents to identify key data sharing issues (including barriers to sharing and what would make data sharing easier). Appendix I contains full results to these questions. However it is relevant to highlight that confusion over license restrictions and the lack of a national data repository to facilitate data sharing were key responses to both questions. These issues were raised again by workshop participants as described below. Informal repositories per se are not being used, rather conventional techniques are the preferred methods for sharing data, the most popular being email attachment and CD/DVD. More up-to-date file sharing techniques are not commonly used. Only a few were described within the anonymous questionnaire responses including making data available as standards-compliant web services and sharing data via www.yousendit.com. Associate partner sites and workshop attendees reported using shared network drives, FTP, USB memory stick, personal URLs, WebCT, unix links, external hard drives and Blackboard for data sharing. Since there was no evidence of contemporary file-sharing softwares being widely used, it was decided the workshop should focus on a brief experiment using fairly new but readily available Internet-based 39 technologies as ‘informal repositories’ mainly utilising peer-to-peer features . Following tests of several products (with varied suitability) two were chosen for their contrasting approaches: one a 40 browser plug-in; the other a desktop client. AllPeers , a free plug-in for Mozilla’s Firefox web browser, allows the user to invite friends to various groups so that data can be shared using a peer-to41 peer approach. Exaroom , a portal-based sharing environment involves the user connecting to a web-page to configure settings and inviting friends. It also involves installing a client that installs “Ground Control” their file-sharing software. On day one of the experiment, workshop participants attempted to share data remotely using AllPeers and Exaroom. For reasons described in the workshop report, the ‘chained’ sharing of data amongst a group of twelve individuals was unsuccessful. However, the desired outcome of enabling participants to contribute to the discussion of informal approaches to data sharing was successful. Participants identified broad problems with the types of approaches to data sharing they had experimented with including the amount of time taken to set-up the software (especially if there was a situation with a real need to share data), complicated interfaces, the need to add friends, the ‘massive searching’ that would be needed for sharing data in a real setting, the need to communicate and negotiate sharing (that is, having to tell the other person the files are available), privacy concerns. More technical participants saw potential in such approaches including solving the problem of large files (for email), solving the lack of immediacy for CDROM, the portability of a browser-based approach, enabling collaboration, more interactive/in your face sharing, offering potential use for browsing other peoples’ data, potential to reduce time spent tackling data requests from students. To expand upon these problems/potentials, participants came up with a SWOT analysis to summarise their views on these informal approaches to data sharing: 38 Survey results are described in full detail at http://edina.ac.uk/projects/grade/InformalQanalysisReport.pdf The workshop is described in detail at http://edina.ac.uk/projects/grade/Grade_reportRSSv2.pdf 40 www.AllPeers.com 41 www.exaroom.com/beta 39 14 Strengths • • • • • • • • • • • • Weaknesses ease of installation ease of registration (through one login rather than multiple links) “ease of use” the user-interface of Exaroom being better than other examples “less data corruption” in the sharing process that the tools could be “always on” that AllPeers was available on any machine Speed and immediate access to data Control over access to data for groups or individual friends Good community building tools Able to assist in negotiating the technical aspects of sharing data Potential to assist in setting up project groups Potential to offer an information resource • • • • • • • • • • • • • • • Opportunities • • • • • • “Too many features not really needed”. Exaroom being “client based” Exaroom being both “…invasive and [requiring] the DotNet extra download” “Privacy” and “possible firewall issues” A need for administrator login for both installation and use Beta status Poor user interface Poor functionality (one participant needed to chat to many friends during the workshop, not just one-to-one) Need to have pc on 24x7 Need to be logged on Require both sharers to be pre-registered as friends Require both sharers to be logged in Navigating across many friends to actually ‘find’ data Limited organization of data Not interoperable Threats Aiding collaboration “Quick negotiation” of access to data “A: ‘Have you got hospital locations?’ B: ‘Yeah, in my Exaroom folder’” Developing “data sharing communities” to expand “the personal network of GI users” Projects where a small group needs to share datasets “Inter- and intra-institute collaboration”, noting the need to share data within and between organisations “Working with non-academic partners on projects” and “non-academic access” to datasets • • • • • • • • • • • • • • • • the software was “untested” a dependence on “…on 3rd party software” that alternatives existed to Exaroom and AllPeers that “personal preferences” would impact on selection of alternatives that better or more embedded examples may exist “…with more GI-related functionality” “… other systems [were] not compatible” the limitations of having an infrastructure which “relies on individuals to run and maintain [it]” the administrative overhead for valuable data assets could be prohibitive in some local settings system security “invasive” nature of the systems and “surveillance?” (especially when it offered too much scrutiny of colleagues’ work) the need to “trust” software providers who could, potentially, have access to the user’s data the loss of “… control over your data” once it is made accessible The “slow take-up” of technology and sharing practices, impacting on the culture of sharing in the academic community. “Will the community really grow?” , limited number of users, limited resources, technologies difficulty of use The lack of development of best practice, with the current culture and “academics!” Table 2: SWOT analysis of peer-to-peer file sharing applications for informal data sharing During the first day of the experiment, participants had also been asked to interact with the formal demonstrator repository (developed within work package 1) as a comparable facility for data sharing. During discussions on the second day of the workshop, participants developed the following SWOT analysis on the formal demonstrator repository. 15 Strengths • • • • • • • • • • Weaknesses Quick and easy/straightforward to use “reliable” and “robust” “permanent” and “dedicated” resource in a “single location” (aiding “collaboration”) “make[s] data available” as part of an “always-on resource”, including, otherwise, “orphan datasets” “server based” using “single registration” the ease of adding and searching for datasets “reduced effort” and/or “removal of effort” in creating “shared” datasets GI centric / “specifically established to host UK HE [Higher Education] GI” Moderated content (i.e. datasets from inside or outside the UK) by a “known/trusted resource” (i.e. administered by EDINA) Opportunities • • • • • • • • • • • • • • • • • • • • • a lack of “links between submissions” and users not being able to “…sign up to topics/themes” “lacks communication/community tools” present in the informal methods Time consuming to use Restrictions on GIS data types Limiting repository content to certain types of ‘GI’ The repository’s underlying software “… cannot read GI in native formats” as .zip files need to be submitted General concerns of the management of “data quality issues”, “quality assurance” and “quality control” of data and metadata Unknown ‘meaning’ of datasets, again, relating to the desire for formalised or community-mediated “semantics” Problems of “repeated metadata creation” The lack of opportunity to “… use previously created metadata” possibly using the metadata already created in ESRI’s ArcCatalog product Threats the community’s “… most common data delivery/sharing method” an experimental base for additional applications such as the “implementation of a powerful GI search engine” a chance to develop best practice in this arena data quality mediation through “Definitive datasets augmenting ‘official’ representations and descriptions” making data accessible to increase quality through reuse (alongside the reporting of errors, omissions and potential application areas) aid “collaborative research”/“collaboration”, with examples including “project based file share hosting” a “Potential major resource for e-science” “Integration” [of the GRADE demonstrator repository] with other EDINA GI collections, including the Digimap service “links with other metadata catalogues (Go-Geo!)” resources such as a “Teaching repository; Research repository” Once developed considering the resource as part of a national spatial data infrastructure42 • • • • • • • • • • • • • • • potential “underuse” of a resource possible concerns about unequal levels of contribution data ‘free-loaders’ the “need to be certified” the need to “… address community concerns of digital rights” “… licensing issues – perceived or real” “…data and software licensing” restrictions The sustainability of the resource linked to “funding”, “maintenance [costs]”, current “Project Shelf Life” Metadata, Limited coverage, the need to adopt “Metadata Standards, the need to utilise metadata already created in software such as ArcCatalog Data volumes (“Does the repository have the right tools to cope with increased data volumes?”) That such demonstrator resources may not be seen as “proper” and not worth participating in if it is not a mature and established resource that the resource could be “limited by access rights to academics only” the “Core purpose [of repositories] may not reflect research practice in terms of the [current] culture of sharing GI” the wider “need [for] a change in attitude” to sharing GI ‘cultural inertia’ issues, including “personal preference” and what is seen as “appropriate” methods to share GI Alternative technologies/resources that may exist in competition to the GRADE demonstrator repository (“There are already other geo-repositories in the UK e.g. NERC”) Table 3: SWOT analysis of role of geospatial data repository for formal data sharing. A series of recommendations concluded this work package: 1. There is a need to address the GIS community’s concerns and possible misconceptions about licensing restrictions against a need to share data. Certainly, the work carried out in WP3 of GRADE relating to digital rights and licensed users should help in what will need to be an educational and participatory process. 2. Currently, GIS-users appear to have a mixture of sufficient approaches to share data, in general. Informal repositories could have some role in geospatial data-sharing for small group activities but they appear to have limited utility to act as a distributed national resource. More work is needed 42 A spatial data infrastructure (SDI) is generally viewed as an umbrella of policies, standards, and procedures under which organisations and technologies interact to foster more efficient use, management and production of geospatial data 16 3. 4. 5. 6. 7. 8. 9. 10. to explore and monitor the uptake and role of informal repositories in small group settings and how they could contribute to a wider infrastructure. The development of a national geospatial data repository was well supported by the study’s participants and should be promoted. As such, there is a need to identify champions in local settings to promote and encourage its use. Regional training sessions and promotional resources should be developed to fit varying audiences that are likely to emerge in both research and teaching. There is a need to develop links between a national geospatial data repository and other EDINA resources. In particular, consideration could be given to the core/framework datasets held in, for example, Digimap and UKBORDERS and storage issues that may otherwise emerge through duplicated geometry. Such an approach would also aid comparisons of the spatial extent of topics, both in terms of the patchwork of geographical coverage of a given theme or the coverage of themes at a given location. Similarly, the establishment of the GRADE repository should be considered around the role of GoGeo!, where EDINA should envisage Go-Geo! as a geospatial ‘one-stop-shop’ for UK academia. In particular, the service should help users to find, evaluate and re-use geospatial assets held within a national facility. Go-Geo! should also be capable of searching and accessing geospatial data in any repository capable of managing licensed geospatial data as well as exploring the provision of the social/community tools seen to be needed to negotiate, discuss and, potentially, instill ‘trust’ around GI, towards in silico research. As such, there is a need to consider the metadata that this should involve, drawing on information that users have already provided in existing software. In addition, there is a need to raise awareness of formal metadata for both sharing and personal data- maintenance/-curation, something requiring greater promotion at training and grassroots levels, beyond just the GIS community. The creation of an accessible permanent central geospatial data repository offers opportunities to link to wider data sources. The relationships between a national central repository and formal institutional repositories, alongside grey-sharing and the fostering of orphan datasets, should be explored in terms of OGC compliant registries and the means to draw neglected data into a place where it can be stored, re-used, validated and valued. The creation of a successful national geospatial data repository offers opportunities to develop Geographical Information Retrieval tools capable of searching and making sense of both a plethora of data and, through middleware, within massive datasets. Such ideas also link to the challenges of currently non-standard forms of GI being considered as part of the semantic web and how (community-mediated) semantic and technical interoperability play a role in having users readily access the data they need. Consideration should be given to current research practice and the desire to develop a wider infrastructure in the context of in silico research and national spatial data infrastructures, particularly through the conditions that would need to be put in place to allow authorised access by public sector colleagues to metadata and dataset ‘discussion’ and, if appropriate, allow authenticated access to the actual datasets, especially given the potential roles of academia in activities such as INSPIRE. Such activity should be recognised as an opportunity to continue research and exploration involving longitudinal and in situ qualitative approaches to better understand the demands of a variety of GIS users, in relation to concerns about trust relating to their data and that of others, and the development of appropriate theoretical models to understand the social and technical context and role that (academic) geospatial data repositories play in Europe’s wider ‘e-society’. Work Package 3 – Digital Rights Issues A work package on legal issues was included within the project because it was generally believed the law on copyright and licensing for geospatial data in the UK is not as well understood nor as clear-cut as it might be and there exists scope for clarifying the fundamental end user rights. 43 Through the provision of eleven (geospatial) use-case scenarios describing the main actors, stakeholders, data sets and outputs, a basis for the investigation of copyright issues surrounding the use and dissemination of derived data sets was given. In particular, the importance of the inheritance of copyright licensing for derived data sets was established. The interaction of a variety of 43 http://edina.ac.uk/projects/grade/usecasecompendium.pdf 17 stakeholders with varying implicit and explicit licensing conditions makes the definition of precise copyright boundaries difficult to establish. The requirement to adhere to the most severe licensing restriction poses significant problems to data repository establishment. The report was principally based around the presentation of eleven use-case examples from a variety of geographic disciplines, using national and international data sets from both third parties and collected by the individual authors. The use-cases were intended to outline how a project uses a variety of input data sets to perform a research task in order to produce an output data set. It is assumed that one of the primary outcomes of such a project is the production of a new data set that may then be lodged with a digital data repository. Each use case identified key stakeholders, goals of those stakeholders, data processing techniques, described the end product (either the data set or some presentation of the data) and summarised possible licensing restrictions for reuse of the data. The use cases provided a sound basis for the legal team to develop an understanding of the geospatial data licence landscape leading to the legal argument developed as the key outcome of work package 3. The argument centres around the principle that there is no legal grounding for the continued strict limitations on the (re)use of digital geospatial data within research and teaching contexts. It is argued this is because the common assumption that geospatial data are protected by 44 UK copyright law may be founded on fundamental misconceptions of copyright and database law . A conclusion reached quite separately by Janssen and Dumortier in their paper on the protection of maps and spatial databases in Europe and the US (Janssen and Dumortier, 2006). This conclusion has far-reaching consequences for geospatial dataset (re)use in the UK, and could allow a national geospatial repository service to operate in a less constrained fashion than might 45 otherwise prevail. The JISC IPR Consultancy has acknowledged the work in their paper on licensing issues in derived data (Korn, Oppenheim and Duncan, 2007) and has recommended that JISC commission an in-depth study of IPR and licensing of derived data building upon GRADE 46 work . To date, negotiations with content providers have not been based on the conclusions of the GRADE work on digital rights, though this is a complex area and we understand that discussions are continuing. However, if at some point, database law is indeed deemed the most appropriate legal interpretative context then the following assumptions form the basis of a licensing framework for repositories managing licensed geospatial data assets: • A repository capable of managing licensed geospatial assets (a repository) will be used in the HFE community for consultation, non-commercial research and teaching purposes. • Geospatial data deposited in a repository will come from a variety of sources and is likely to have passed through various stages of manipulation • The researchers who deposit data in a repository will either have created the data themselves, used data which are not subject to re-use restrictions and/or will be lawful users of the geospatial databases from which extractions of geospatial data are made • So long as a lawful user, those researchers are at liberty to extract an insubstantial amount from the contents of the source database for any purpose (including for deposit in a repository) and where the data are used for non-commercial research and illustration for teaching they may extract a substantial part from the source database • Given that only a limited number of researchers will be interested in any particular part of a geospatial database, making available the extractions to other researchers for the purposes of non-commercial research or illustration for teaching would seem not to infringe the reutilisation right even where those extractions are a substantial part of the source database • Data deposited by researchers may amount to only insubstantial parts of the source database, but if repeated deposits from the same source database are made, then these may, in total, amount to a substantial part of the source database. The Database Directive shields those who use them for the purposes of non-commercial research and illustration for teaching. 44 http://edina.ac.uk/projects/grade/gradeDigitalRightsIssues.pdf http://www.jisc.ac.uk/whatwedo/projects/ipr/iprconsultancy.aspx 46 It is doubtful how much more a thorough investigation can be built upon the concise well-developed legal discussion put th forward in the GRADE report. The work has widely distributed with an article appearing in the 5 April 2007 edition of Technology Guardian as part of The Guardians Free Our Data campaign http://www.guardian.co.uk/technology/2007/apr/05/freeourdata.intellectualproperty 45 18 • • • • • • • • Where a researcher or teacher extracts a substantial part of the source database to use for the permitted purposes then the source must be attributed. [This would be easier to manage if there is an obligation to attribute source no matter the size of the deposit] Where extractions are to be made, a standard access management technology, such as Shibboleth (via the UK Access Management Federation for Education and Research) or Athens would help to ensure that substantial parts of the source databases are used only within HFE and for the permitted purposes A repository facilitates the work of researchers and teachers but is not itself a lawful user of any of the source databases By depositing the contents in a repository, the repository does not thereby ‘extract’ contents from the original database within the meaning of the Directive By holding the geospatial data deposited by researchers, a repository thereby re-utilises the data within the meaning of the Directive (i.e. it makes the data available to the public through on-line transmission) Where only an insubstantial part of the source database is made available, this would not infringe the re-utilisation right. Where a repository makes a substantial part of the source database available to the public, whether the re-utilisation right is infringed will depend upon whether making it available to a limited number of researchers and teachers is considered as making it available ‘to the public’. Consultation of a database does not infringe the sui generis right. Therefore there would be no difficulty in having a repository ‘open’ to be consulted by all. Work Package 4 – Scoping the role of institutional repositories for geospatial data Work package 4 aimed to make an assessment of the role of institutional repositories for managing 47 geospatial data . The number of responses received to the web-based survey was poor despite several call exercises being made. Although aimed at UK institutional repositories, of the 35 responses received, several (2) were from Europe and elsewhere (5) and from institutional repositories and data centres (2). Survey results revealed that there are currently no UK geospatial subject-specific/community repositories in operation and that although there are growing numbers of institutional repositories, none of them currently manage any geospatial content (and would not be capable of doing so). It probably also has something to do with the evolving repository landscape. The report points to survey replies with many repositories managers admitting they are dealing first with publication output since with support from the Open Access Movement they are likely to achieve some measure of success in obtaining content. At the time of the report it was also suggested that it could also be resultant on the fact that the open source IR softwares do not have a ready made metadata schema to accommodate datasets. It was suggested that if the IR software vendors developed a Dataset plug-in it is possible . that Institutional Repositories would have already been challenged to manage them JISC are now funding development of several application profiles to support repository searching of a variety of media including geospatial data. The survey also revealed that whilst there would be a willingness of (some) institutional repositories (IRs) to accept geospatial data, little has been offered to date. Experience of geospatial data handling within IRs is thus largely non-existent and given the specialisms involved (as detailed in the findings from the formal demonstrator user survey), institutional repositories may not be ideally suited to manage the data outputs of the geospatial research community. Indeed where designated data centres exist it is unlikely that Institutional Repositories will be the archive of choice for datasets, but there are many disciplines where there is no formal data archive available, or the data centre has a strict scoping on size and subject of datasets they accept. The report suggests that in these cases rather than datasets existence being hidden on a personal pc, it is possible the IRs may have a role to play. The question of whether IRs are the most suitable repository for geospatial data was debated within the report. Table 4 provides a summary of that debate. 47 http://edina.ac.uk/projects/grade/GRADE_Survey_Report.pdf 19 STRENGTHS of an IR dealing with geospatial data One repository – less administrative and technical overhead Linking text, datasets, images easier within one environment Showcase for all institutional research IR Software - Open Access – interoperability – visibility Software based on International Standards Metadata skills provided by Information community Formal Dataset citation Supports Citation analysis and metrics for research funding and personal promotion for data generators/managers WEAKNESSES of an IR dealing with geospatial data Software not designed to cope with data No IR metadata schema for datasets yet IR staff without Data Processing skills IRs do not quality control content IRs not involved in production of information products Storage – Preservation (all media types) OA culture not yet extended to data although OEDC, EU and some Research Councils etc. mandate deposit of data emanating from funding. OPPORTUNITIES of an IR dealing with geospatial data Contribute to the design of an IR dataset metadata module To offer a data archive (where non exists) Treats ‘orphan’ datasets not accepted by Data Centres Enhancement of IR staff skills Showcase in one digital repository of all research output Ready host when dataset deposit is mandated Integration – joined up research Additional Funding opportunities from e-Research projects Input to the Data citation model Data and Information communities working together Collaboration between disciplines Dataset harvesting from IRs to Data Centres THREATS of an IR dealing with geospatial data Turf war between IRs and Data Centres Will funding follow to IRs Will funding stream for data management reduce? Too large an undertaking for IRs Data lost in publication ‘bucket’ ‘Thematic’ datasets distributed No migration/preservation policy Datasets fall ‘between stools’ Table 4: SWOT analysis of institutional repositories managing geospatial data. The report concluded the repository landscape is rapidly evolving and that a pragmatic approach is appropriate now. If a researcher has access to an appropriate data centre to deposit the dataset then that should be the preferred route provided that the papers and publications resulting from the dataset are linked. However, if a researcher does not have access to an appropriate data centre, is it not better that the dataset is at least deposited in a trusted repository? Leaving the dataset on the researcher’s pc, as is often the case now, will ensure it is ‘lost’ forever. Work Package 5 Work package 5 focused upon interoperability aspects for geospatial data repositories. Key findings of this work confirmed geospatial data repositories are in a strong position to interoperate with other repositories/services within the JISC IE and beyond. Standards for exchanging geospatial data have been in development since the early 1990s through the work of the International Organisation for 48 Standardisation’s Technical Committee No. 211 (ISO TC211) and the Open Geospatial 49 Consortium (OGC). ISO TC211 standards form the foundation/ building blocks for geospatial data interoperability to occur (e.g. setting the metadata standard for describing geospatial data and defining how to represent coordinate systems that geospatial data exist in). OGC specifications 50 implement these standards . With this mature framework in place, repositories managing geospatial data are well positioned to exploit opportunities for interoperating. For example, the GRADE project repository demonstrator successfully demonstrated that a geospatial data repository can interoperate with other repositories. This was achieved by implementing the DSpace OAI-PMH interface on the 51 demonstrator. The JISC-funded PerX project then harvested metadata from the GRADE project demonstrator and made it searchable via the PerX project search interface. This demonstrates that a 48 49 50 http://www.isotc211.org/ www.opengeospatial.org For example the OGC Web Mapping Service (WMS) specification delivers geospatial data as an image which can be viewed within any browser or overlaid with other geospatial data within a geographic information system. The WMS specification adheres to the ISO 19111 Coordinate Reference standard for defining the coordinate system so that the image is overlayed correctly with other georeferenced data. 51 http://www.icbl.hw.ac.uk/perx/ 20 geospatial data repository can interoperate with other repositories across the JISC IE via OAI-PMH 52 and Dublin Core as a lowest common denominator metadata standard . Despite the mature standards-making framework for geospatial data, there are particular areas where work needs to focus to improve key interoperability issues that a repository environment bring to the fore for geospatial data, including: • The lack of geospatial data standards for long term data preservation. Geography mark-up language is an interoperable geospatial data standard for data transfer. However its verbose nature and the fact that it cannot be used natively within a GIS application without being converted means it is not ideal for preservation purposes. • Content packaging standards – standards making organisations have not to date put effort into developing standard approaches for packaging geospatial data. For example, geospatial data can be somewhat meaningless without the appropriate visualisation/rendering information. Or as reported within work package 1, feedback indicated a strong desire to deposit project related data as one deposit item. Development of a content packaging standard for geospatial data is essential to repository interoperability. • Web services on deposited data – the geospatial community are fairly progressive in their take up of accessing geospatial data as live data streams or web services. Again during work package 1 reference was made to a weakness of the demonstrator repository that it was unable to offer access to repository items as web services rather than as data downloads. Resolving how to offer up web services of items within a repository would enhance interoperability of the repository. Outcomes Table 5 provides a summary view of project achievements against original project aims and objectives. Work Package Aims and objectives 1.1 Establish detailed repository use cases and user based evidence for the requirements and functionality of a repository capable of managing licensed geospatial assets. 1.2 To investigate and identify the technical and cultural issues surrounding the storage, management and accessibility of geospatial information derived from licensed data within a centralised digital repository. 1.3 To synthesise the lessons into best practice and advice for those concerned with the establishment and operation of research data or media-centric repositories. 2.1 To investigate the extent of current informal data publication and sharing and the ‘grey economy’ of geospatial information sharing. 2.2 To investigate the relationships and potential interfaces between informal and formal (including Institutional) repositories. 2.3 To scope possible technical architectures for informal data sharing and investigate how security issues may be addressed (DRM issues will be addressed as part of WP3). 2.4 To pilot an informal geospatial data repository demonstrator with Associate Partners to inform understanding of the cultural and technical issues involved. 2.5 To synthesise our findings into a preferred policy statement on the use of Informal Repositories. 3.1 To articulate intended use cases of sharing derived geospatial data. 3.2 To develop a clear understanding of digital rights pertaining to data created entirely by a user or research team. 3.3 To develop a clear understanding of digital rights issues for derived data respecting the licensing conditions of the source data. 3.4 To develop a conceptual and technical framework for resolving those described rights management issues raised in relationship to repositories. 52 Project Achievements 1.1 User-based evidence well documented based upon iterative interaction with project demonstrator 1.2 Technical issues for geospatial data within digital repository well-investigated and described. Cultural issues for geospatial data within repositories came to the fore more in WP2 during the investigation of informal data sharing practices 1.3 Compiling user-based evidence brought many generic issues for data sharing of interest to all parties dealing with research data 2.1 Achieved a comprehensive investigation of current data sharing practices 2.2 No investigation of potential interfaces between informal and formal repositories. 2.3 No scoping of possible architectures for informal data sharing 2.4 No demonstrator developed, rather a workshop using a variety of existing peer-2-peer software 2.5 Series of recommendations developed from workshop findings New – Gained insight into departmental data management practices 3.1 Detailed compendium of derived geospatial data created 3.2 and 3.3 Legal report presenting arguments for digital rights issues produced 3.4 Licensing framework proposed. Mappings between UK academic profile of ISO19115 and Dublin Core and Data Documentation Initiative metadata standards are available at www.gogeo.ac.uk/ 21 4.1 To determine to what extent Institutional repositories currently manage geospatial assets 4.2 To determine to what extent Institutional repositories could or should manage geospatial assets 4.3 To investigate the arguments for and against Institutional versus media-centric repositories for geospatial data assets 5.1 To investigate and assess the role of existing JISC sponsored terminology services with respect to repositories. 5.2 To investigate and assess interoperability between geospatial data repositories and other types of services/repositories 5.3 To investigate the linking of geospatial repositories with eScience infrastructures. 5.4 To investigate and assess the potential of evolving industry driven geospatial interoperability standards, specifically Open Geospatial Consortium/ISO 19100 series standards, as a means of interoperating with repositories within and outside of academia. 4.1 Baseline audit of IRs carried out 4.2 and 4.3 Discussion paper investigates role of IRs 5.1 Relationship of data repository to other key elements of academic spatial data infrastructureInvestigated) 5.2 Interoperability between geospatial data repository and other types of repositories demonstrated within work package 1. 5.3 Identification of a degree of community need for repository data to be made available as web service. 5.4 Identification of key relevant OGC/ISO 19100 standards for repository interoperability Table 5 – Summary view of project outcomes against original aims and objectives Work package 1 successfully identified user requirements for a repository capable of managing geospatial data. It also identified a series of more general repository user requirements relevant to any repository – institutional or media-centric. Of greater importance perhaps was the opportunity the demonstrator afforded the wider geospatial community to interact with an infrastructure to facilitate formal sharing of derived data. Findings from work packages 1 and 2 demonstrated clear support for 53 the creation of a national geospatial data repository . Creating a demonstrator was a positive method for engaging the community and has resulted in community focus being shifted onto the need for the provision of this key element of an academic spatial data infrastructure. Perhaps by focusing on the possibilities offered by the demonstrator, associate partners more readily acknowledged the lack of data management practices within their own departments. Work package 2 provided concrete evidence that geospatial data sharing is commonplace with 90% of survey respondents confirming they had shared data recently. The geospatial community will benefit from the knowledge that concerns about the (perceived) complexities of current data licences are commonplace throughout the community. Workshop participants have benefited from involvement in the active research aimed at scoping the role of informal methods for data sharing. This involvement not only offered them exposure to peer-to-peer approaches to file sharing but also offered them the opportunity to consider the potential use informal sharing approaches versus with the more formal sharing afforded by the demonstrator repository. Again this work reinforced workshop participants views of the need for a national geospatial data repository linked closely to other geo services within the JISC IE including Digimap and Go-Geo! The investigation into licensing issues for geospatial data repositories within Work package 3 has brought unprecedented focus on the legality of current licenses for geospatial data reuse within UK HEFE. While key data providers are unwilling to enter into such a debate, the academic geospatial community still stand to benefit from this work if only by forcing the development of clear, plain English guidelines as to what end users can in fact do with derived data under current licences. It should be viewed favourably that the JISC IPR Consultancy have recommended that JISC fund further investigation into the area of licensing issues for derived data. Work package 4 has provided a benchmark assessment of levels of geospatial data within institutional repositories. This is of value as it can be used to assist in determining growth patterns of institutional repositories. The report identified that the lack of metadata schema for data within open source repositories may also have hindered possible deposition rates. It is hoped this conclusion assisted the JISC in deciding to fund the development of a geospatial application profile for repository searching. The SWOT analysis of issues for storing geospatial data within institutional repositories should be of value to repository managers in their strategic planning for future repository expansion. Work package 5 provided an overview of interoperability issues for geospatial data repositories. This work is valuable to the geospatial community because it demonstrates a commitment to aligning geoindustry standards (which the geo community feel strong allegiance to) with those standards found within the JISC IE. The identification of where geospatial standards are weak (content packaging 53 Recommendation 3 from work package 2 concluded “The development of a national geospatial data repository was well supported by the study’s participants and should be promoted.” 22 standards, preservation standards) demonstrate that there is useful work within the e-Library world that the geospatial community can leverage and again shows commitment to investing in further standards-setting work. Likewise the familiarity within the geospatial community of data being made available as web services is valuable for the wider JISC community in their strategic plan to push forward e-Science infrastructures. The demonstration that metadata for repository items within the project demonstrator could be harvested and presented for searching by a general repository should be of value to both the geospatial and the repository community as demonstrating the value of OAI interfaces on repositories. Conclusions It is clear that the legal issues surrounding the use and reuse of geospatial data within the academic community are peculiar and distinct from other data sharing communities where IPR and digital rights issues are less pronounced (if relevant at all). Legal investigations suggest that copyright may not subsist in geospatial data (rather the correct law is the EU database directive). If this legal argument were true, derived data can be deposited legitimately in a repository for reuse by other members of academia for non-commercial use so long as acknowledgement of source data providers is made. Concerns relating to breaking licensing conditions are the major barriers to more formal data sharing. There is a clear view from user-based evidence amassed from the results of a questionnaire survey, from workshop participants and from direct discussions with associate partners, that there is a perceived need for the establishment of a national data repository to support the sharing and reuse of geospatial data. This data repository would fill a noticeable gap and provide good linkages to other elements within the UK academic spatial data infrastructure. It was discovered the geospatial community has particular user requirements for a repository capable of managing digital geospatial assets particularly location-based searching and a degree of automatic generation of geo-related metadata during the deposit process. Iterative enhancements were made to the formal demonstrator in order to illicit feedback on community need and aspirations resulting in a demonstrator repository with over 160 deposited datasets, over 170 registered users and an equal number of users unable to register (due to licensing restrictions). Informal data sharing is commonplace amongst people networks. Sharing is predominantly via trusted methods including email attachments and CD/DVD. More up-to-date peer-to-peer file-sharing approaches are as yet unexploited. During the workshop into informal sharing, doubt was raised as to their value, in their current beta form. Institutional repositories do not currently manage geospatial data and are not set up to do so. However no IR has yet been offered geospatial data as a deposit item and over time IRs may have a role alongside media-centric repositories in managing scientific research data. The current development of a geospatial application profile to support repository searching may go some way to assisting IRs being able to accept geospatial data. Geospatial standards (for metadata, OGC-based for geoprocessing services) are key to interoperability of a geospatial repository within the JISC IE. With this in mind, the well-established relationship between the International Organisation for Standards and the Open Geospatial Consortium mean geospatial data repositories are well positioned to interopate with other repositories and facilities within the JISC IE. For improved interoperability, the development of standards for geospatial data preservation and content packaging need to be prioritised. Implications Legal Implications The legal argument put forward in GRADE has consequences that reach beyond the Repositories Programme. Indeed it is of significance to not only those creators/users of geospatial data within academia but to any individual/organisation seeking to reuse/share geospatial data derived from licensed geospatial data. It is appropriate that the JISC IPR Consultancy have identified the need to pursue this work further and encourage wider debate. 23 In the meantime, researchers remain uncertain as to what is a legally acceptable use of derived data. It is possible however to foresee the establishment of a geospatial data repository working within the circumscribed space afforded by existing license agreements. Under this scenario, work would have to focus on resolving the issue of copyright inheritance of derived data again resulting in guiding principles for assisting the user decide what they can/cannot do with their derived data. Without these guidelines, researchers will increasingly have to deal with a conflict of interest - coming under increased pressure to deposit their research outputs whilst remaining uncertain as to what they can legally do with their derived data within the boundaries of the licence agreements they have entered into with data providers. Interoperability When considering interoperability aspects of geospatial data repositories conclusions reached claimed the mature geospatial standards organisations ensure repositories managing geospatial data are well placed to interoperate within the wider IE repository landscape. However, it is clear there are two key areas that future work should focus upon: • Geospatial data sets are often not one single file rather a group of files that must exist together to be valid. In addition to that often geospatial data need to belong with other files for example, files that describe how to classify and display the data and files that describe the project system of the data. A third scenario, described in user feedback from work package 1, was that researchers wish to be able to deposit a bundle of project-related data. For this to work attention needs to be given to developing standards for packaging geospatial data, associated files and metadata. The international standards making organisations have not given attention so far to content packaging for geospatial data so this is a key area for future work. • Related to the first point is another key area - standards for geospatial data preservation. Geospatial data exist in multiple formats. Currently there is no agreed format for data preservation. Geography Mark-up Language is an XML-based open standard for geospatial data exchange. A data exchange format is not ideal for preservation purposes not least 54 because of the size of the resultant file but also the possible loss of detail . There is a need for work to focus on appropriate interoperable formats for geospatial data preservation. Technical Advances At GRADE project outset, in mid 2005, a literature review was carried out to assess the global status of repositories and geospatial data. At the time, there were very few instances of geospatial data being stored within repositories. However since project outset in June 2005, the data repository landscape has changed, the challenge for any future work is how to leverage these advancements including: • Open Geo-Archives Initiative looking at integrating earth science data centres into research 55 portals. PANGAEA is a public data library for science aimed at archiving, publishing and distributing georeferenced data with special emphasis on environmental, marine and geological basic research. • Advances in semantic linking of data and journal publications. For example the STD-DOI 56 Creating access to Scientific Data project . This project uses DOI to link datasets to articles and vice versa. 57 • The GEONGrid project: a cyber-infrastructure facility to advance Earth science research and education with a data repository function. Informal Repositories for Data Sharing The other key technical advance during the duration of the GRADE project can be considered to have 58 future implications for informal data sharing. In early 2007 , Google announced that its main search 54 At the First International Workshop on Database Preservation (PresDB07), Peter Buneman’s talk “Why Current Database Technology Does not Support Preservation” noted that ‘although curated databases use database technology, the contents of the database seldom includes all the data of interest’ 55 www.pangaea.de www.std-doi.de 57 www.geongrid.org 56 24 engine was able to search and parse KML files, the native file format for Google Earth. More importantly however the Google search engine parses and understands the geographical data within the KML and returns relevant results geographically. Searching for and locating relevant geospatial data purely via Google could well overtake any other informal method for geospatial data sharing and indeed could impact upon levels of geospatial data made available for depositing within repositories. New development work could consider the impact of Google’s new geo-enhanced searching. Recommendations • • • • • • • 58 There is clear support for a repository to facilitate geospatial data sharing and reuse throughout UK academia. The JISC should consider the role of a national geospatial data repository within the UK academic spatial data infrastructure (closely linked to UK academic geo discovery portal). As recommended by the JISC IPR Consultancy, JISC should commission an in-depth study of IPR and licensing of derived data building upon the legal work carried out within GRADE, There is a need to address the community’s concerns and possible misconceptions about licensing restrictions against a need to share data. Attempts need to be made to give the community clear direction on permissible use of derived data specifically when it comes to depositing data in a repository for others to reuse. Informal repositories could have some role in geospatial data-sharing for small group activities but they appear to have limited utility to act as a distributed national resource. More work is needed to explore and monitor the uptake and role of informal repositories in small group settings and how they could contribute to a wider infrastructure. If IRs is to accept geospatial data they need to consider how they will meet the specific user requirements of the geospatial community (including location-based searching and automatic metadata generation). The GIS community should leverage standards developed within the e-Library world for developing content packaging standards for geospatial data It is recommended here should be UK HEFE representation on the international working group looking at standards for geospatial data preservation. http://www.gearthblog.com/blog/archives/2007/02/new_search_capabilit.html 25 References Jansse, K. and Dumortier J. (Winter 2006), The Protection of Maps and Spatial Databases in Europe and the United States by Copyright and the Sui Generis Right, John Marshall Journal of Computer and Information Law, (24 J. Marshall J. Computer & Info. L. 195) Korn N., Oppenheim C. and Duncan C., May 2007, IPR and Licensing issues in Derived Data, http://www.jisc.ac.uk/media/documents/projects/iprinderiveddatareport.pdf Louis, K. S., Jones, L. M. and Campbell, E. G. Sharing in science. American Scientist 90, 4 (2002), 304-307. Lyon, L (2007) “Dealing with Data: Roles, Rights, Responsibilities and Relationships” Consultancy Report, v1.0, June 2007, http://www.jisc.ac.uk/media/documents/programmes/digitalrepositories/dealing_with_data_reportfinal.pdf Morris, S (2005) “National Digital Information Infrastructure and Preservation Program Project Work Plan Collection and Preservation of At-Risk Digital Geospatial Data” <http://www.lib.ncsu.edu/news/gis.php?p=329&more=1> Westbrooks, E (2003) “Efficient Distribution and Synchronization of Heterogeneous Metadata for Digital Library Management and Geospatial Information Repositories.”, Dublin Core 2003, Seattle, Washington, September 28, 2003 <http://www.siderean.com/dc2003/204_Paper78.pdf> 26 Appendices Appendix A - Demonstrator Questionnaire Version 1 27 28 Appendix B - Template for demonstrator feedback from associate partners GRADE Pilot Site Report and Feedback sheet Completed by ___________________________ Date ________________________ Institution & Department ______________________________________________ 1. Registered users of GRADE repository at your institution: User Name User email address Date Register ed Uploaded – YES/NO Download ed – YES/NO 2. Uploading/Downloading geospatial data from the GRADE repository: Geospatial data title Uploaded Downloaded Data type/file format Data description Date 1 2 3 4 5 1 2 Feedback Upload/depositor process Download process Person 1 Search and found data easily? Ease, instructions clear? Search improvements Problems? Download problems? Metadata fields required – sufficient Quality rating? Zipping up of files clear? Attach any email/verbal correspondence on feedback. Person 2 feedback, etc Person 3 feedback, etcp 29 Author Could you also list a top 10 personal wish-list of geospatial datasets you would ideally like a national GRADE repository to hold: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) e.g. postcode boundary data, contaminated land data, forest data etc e.g. Particular companies i.e. Environment Agency data etc e.g. Coverage? UK? World? 3. Complete User Survey on GRADE repository website – YES/NO URL: http://gradedemo.edina.ac.uk/dspace/index.jsp 4. Informal Geospatial data-sharing current practices at your institution (at least 500 word report) The suggestions/questions below are guides to assist with your report. Please feel free to supply as much information as you wish. Descriptions of geospatial data your institution/department holds (if part of Go-Geo! pilot study and you have already completed an Audit state this). How are files stored? – Where? What format? Data volumes? Is this listed/documented? Who uses them? (i.e. teaching purposes, ongoing research project) Is postgraduate/undergraduate research geospatial data stored? – is it searchable /accessible? What proportion of the geospatial datasets in your institution/department are: (1) Funded by research council grants/deposited there? (2) Derived from primary data sources (e.g. OS)? (3) Would have no data-sharing restrictions that you are aware of? Give examples of geospatial datasets that are re-used or derived datasets that you hold. What methods are used to exchange geospatial data?(internally/externally), i.e. email, department server, WebCT, URL or websites, FTP,P2P? Do you have any formal sharing guidelines (i.e. what can be shared, what information must be added i.e. metadata, contractual agreements/conditions, between particular restricted groups- who are these) With whom do you exchange/share GIS data? E.g: (1)college in department (2)other departments within institution – which (3)other Institutions/colleges- please specify (4) non academic bodies – please specify (5) others, individuals – outside UK etc – where, how, what, why. Have you ever encountered problems acquiring geospatial data (provide details of dataset titles, issues, what happened, how was it solved if it was) What sources does your institution use to get access to data? - E.g. Athens authenticated services, internal passwords, where do you get raw GIS data from? 30 If you do not share any geospatial data at all please explain the reasons why not, and state under what circumstances you would consider sharing GIS-data. Have you completed the GRADE Informal geospatial data-sharing questionnaire? – YES/NO URL: http://edina.ac.uk/projects/grade/questionnaire.html 5. Case study examples of a derived geospatial dataset that was shared at your institution (include at least 2) CASE STUDY 1: Title of geospatial dataset Actors Primary i.e. dataset creator/researcher Date created Summary detailing what dataset is Dataset type & format i.e. Vector ESRI Shapefile, etc. Stakeholders & Role Secondary i.e. Funding Body, or co-researcher etc. e.g. organisations/companies license or rights involved, i.e. EDINA Digimap or OS or NASA, etc. Roles i.e. Creator, distributor, Education Institution, Grant body, publisher Dataset components: Name Layer 1 i.e. Land-form PANORAMA Owner Distributor Licensing Type Area OS EDINA Copyright JISC till 2009 Raster elevation derived 31 Layer 2 Etc. continue for as many dataset layers there are Who shared with, why? What was shared? How did you share Between who, who requested it, who did they approach, and how. Full raw data, map only, metadata, licence terms/conditions, other information, what file format? Method of exchange, how was it packaged? How long did it take Issues/Problems Barriers or difficulties or ones that had to be overcome. CASE STUDY 2: Title of geospatial dataset Actors Primary Date created Dataset type & format Stakeholders Secondary Dataset components: Name Owner Distributor Licensing Type Area Who shared with, why? Summary detailing what dataset is Layer 1 Layer 2 How did you share How long did it take Issues/Problems This form should aim to be well on the way to completion by the GRADE All-Partner Meeting October th 30 2006, and full report and all tasks finalised by December 2006 deadline. 32 Appendix C – Project Demonstrator The formal repository demonstrator consisted of one instance of DSpace. Initially set up with an outof-the-box configuration, the demonstrator repository was further customised to meet project requirements including: restricting item upload and download to registered users only, validating data deposit on upload, automatically populating geo-related deposit metadata, map-based searching. The customisation work carried out to accomplish these functions is described below. Initializing authentication barriers Using the DSpace administration functions, settings were configured so that item download and item upload were restricted to registered users but item metadata was visible to alls. File Validation during upload process File validator code was built as a separate package that can run independently of DSpace. The validator takes in a filename of a zip file containing a dataset, it extracts this dataset, runs the appropriate geospatial info tool (ogrinfo for vector based formats, gdalinfo for raster formats) and parses the results produced by these tools to create an xml report about the dataset. To neatly slot this validation package into DSpace a modification for DSpace called the Configurable Submission System was used, which splits the DSpace submission process up into individual java servlets, one for each step of the submission process (e.g. one for the upload, one for the metadata entry etc.). A new submission step called validation was created, which took the uploaded file and fed it to the validation package detailed above. The resulting xml report is used by the validation step servlet to produce a response page for the user, letting them know if the file is valid and if it is creating the dataset boundary for them to tweak. Display of Geographic Extent of Items Google maps integration is all done client side using javascript, mainly consisting of writing a little javascript in the JSP pages (these create the actual HTML parts of DSpace, the parts that are presented to the user) and making sure that the servlets (which do the work of talking to the database, authentication, validation, etc.) passed the correct coordinates to the JSP pages so that the bounding boxes were displayed correctly on the map. Geographic Searching DSpace uses a qualified version of the Dublin Core schema. Geographic searching was made possible by extending the Dublin Core Metadata Element Set Coverage element with the DCMI Box 59 encoding scheme which allows the identification of a region of space using its geographic limits, representing that information as a value string. The actual searching was done by adding a geoindex 60 table , with the coordinate information stored as floating point numbers, to the database with an entry for each item in the repository. The searching could have been done by searching through each of the DCMI Location strings however this would have made the search relatively slow. The combined keyword and geospatial search firstly does the regular keyword search, the geoindex rows for that result set are then searched to provide the geospatial capability. The DCMI Location was stored in order to have more complete metadata but isn't actually used in the demonstrator for searching. Searching the geoindex is relatively straightforward, each row in the geoindex table stores a reference to the DSpace item it represents and 4 coordinates (representing the N,E,S,W limits of its bounding box). The coordinates from the google maps interface are fed to a Java servlet that does either an "intersects" or a "within" search using SQL. The combined keyword and geospatial search firstly does the regular keyword search, then the geoindex rows for that result set are then searched to provide the geospatial capability. 59 This work is of value to the current JISC-funded Geospatial Data Application Profile to support a basic UK repository search project. 60 The database used with DSpace was Postgres. 33 Appendix D – Second Demonstrator Questionnaire 34 35 36 37 38 Appendix E – Questionnaire on Informal Data Sharing 39 40 Appendix F – Use Case Template All text in 10-point type refers to content areas. Within these areas, text in italic are variables that need to be completed by the author. Normal text or fixed variables that should not be changed. The use of bold denotes different selection options. Authors Author 1 Author name Use case details Title Title of project/work Date Date Application Area Subject area of application Summary Actors Primary A researcher has received funding and wishes, or is required, to deposit output data from the project in a digital repository that can then be searched and accessed by other researchers. Type: Name: Researcher Name of primary actor Goals: Goals for completing the work Secondary Type: Goals: End-user Potential use of output data Broad areas to include research, teaching, class, institution or personal. Stakeholders Ordnance Survey Type: Goals: Creator or distributor or grant body Sales or Licensing restrictions or Marketing or Advancement of research or Dissemination of data 41 Dataset Details Dataset 1 Name: Dataset name Owner: Dataset owner Distributor: Dataset distributor Licensing: © or Creative Commons or Public Domain Annual or Perpetual Quantitative (number of processes) or qualitative (number of processes) Raster type or vector Derived or original or presentation Processing: Type: Area: Output Data Type Format Vector or raster File type Descriptives Context Context for the generation of the dataset. Processing Processing performed Key Points Any key points raised concerning copyright/distribution issues. References List of references 42 Appendix G – Institutional Repository Questionnaire GRADE will investigate and report on the technical and cultural issues around the reuse of geospatial data within the JISC IE in the context of media-centric, informal and institutional repositories. A Work Package requirement is to carry out an audit of geospatial asset management within institutional repositories. The survey below only takes five minutes to complete; all completed responses will go into a prize draw for a £30 Amazon book voucher to be drawn during the week commencing 13th February 2006. For the purposes of this survey geospatial data is defined as data explicitly containing coordinate geometry ie vector, raster, geo referenced images, text files, containing x,y coordinate values. (eg. Electronic maps, geo-referenced imagery, satellite data, data stored within a Geographic Information System (GIS)) Survey Questions 1. If you have a repository, what software do you use? 2. Is your repository publicly available, if not who are your depositors and users? 3. Do you accept the deposit of geospatial datasets into your repository? If so how much? If not do you plan to? 4. What special metadata fields do you offer in your repository to describe geospatial data - how can users search geographically for geospatial data? 5. If you have geospatial data, do you... Receive supporting documentation from the depositor? Have guidelines on file format required? QA the data at all? Require a declaration of ownership / copyright from depositors? Confirm if derived from other datasets? Have agreements to deal with issues of liability from use of the data? 6. Generally, do you think that archiving and providing access to research data is something institutions should do or specialist data centres? 7. What processes, if any, do you have in place to ensure that long term access to research data will continue e.g. migration of data so that it remains readable? 8. Please list any other Institutional Repositories you know that are managing geospatial datasets 43 GRADE Appendix H – Summary of Data Management Practices at Associate Partner Sites Site 1 site Data Policies • There are no formal procedures for storing or archiving spatial data or outputs of research projects. • There are no guidelines concerning data sharing, beyond any formal restrictions imposed on secondary data (e.g. OS licensing). • There are no requirements for metadata or any formal contractual agreements Data Management • primary method of data storage is ad hoc whereby academics store data on a mixture of work PCs, home PCs and removable devices (e.g. HDDs or USB). Some academics may also use network storage at the university • Neither PG nor UG research geospatial data is audited or stored in any manner, other than for UG dissertation students in GIS where it is a requirement for data deposition within Blackboard. • We are currently implementing metadata input as well and are keen to trial a repository for these students. Of more urgency is the need to manage PG data. • We currently have a small amount of RC funded projects and therefore this data should have been deposited. • Far more research (perhaps 70%) is derived from university and department funding, as well as indirect research utilising secondary data sources (e.g. OS, census). • There is also corporate, European and Knowledge transfer funding which accounts for a not insignificant amount of research. • My impressions is that very little research would have no data-sharing restrictions. • actual data usage must run well over 100Gb and possibly beyond 1Tb Data Sharing Methods • via email, Blackboard, FTP, CD/DVD, URL Data Sharing Patterns • Data sharing occurs between colleagues within the department quite frequently, although remains ad hoc. • Very little data sharing occurs within the university. • Moderate data sharing occurs with other institutions. This will primarily be research based and usually between project team members in institutions anywhere in the world. • Data sharing will also occur with non-academic bodies, although this will be related to funded research projects and will usually involve the in-flow of data required for the project and out-flow in terms of final project outputs (e.g. Environment Agency, mining companies). • data sharing will occur with individuals who may contact department members “on spec”. Data sharing on this basis is much rarer but does happen when there are demonstrable gains to be made from data sharing. • • • Site 2 • There is no formal structure regarding the management of geospatial data. There is no formal framework regarding the sharing and exchange of geospatial data. Sharing guidance is informal and verbal regarding the IPR, especially for OS derived data. 2 departments interviewed state that their data sharing policy is effectively adhering to the University’s guidance over ethical responsibility, thereby only using data for which consent was obtained • • • • • • • The university has an institutional repository (IR) for published academic papers. This is not used as a central depository for geospatial data. In general, there is the perception that an IR is unsuitable for depositing geospatial data because of its structure. Therefore, each department manages its own geospatial data. Data commonly stored on a researcher’s PC hard drive, or portal hard drive, and access is protected through the university’s generic security systemusername and password. Geospatial data used or developed in postgraduate research is not commonly stored, only the thesis or abstract. In terms of preventing data loss, each researcher is normally responsible for his/her data. Some researchers consider that depositing data with a data centre is a sufficient disaster recovery strategy. One dept set up an informal library of geospatial data (on CD/DVD) to capture data acquired through research grants, derived/generated from research, obtained free of charge or supplied with software. However, confidential or restricted license data remain with principle researcher Another dept has acquired a digital data server to manage geospatial data with a capacity of 900GB. 2 aims (i) to provide secure storage for geospatial data (ii) improve access within dept, ax uni and 2 external partners via ftp. Access to DDS restricted 2 approved users Data volumes stored are in the range of 5GB to 60GB with at least 90% of data derived from primary data • • • 45 Internal data sharing is mainly with coresearchers at the university by DVD/CD or shared access hard drives, or USB memory sticks, and with research students by DVD/CD or e-mail ftp to a central server researchers tend to use Mircosoft Windows Explorer to search for their data • • • The sharing of geospatial data is dependent on the research undertaken; however, data are commonly shared, internally and externally, through informal networks within and outside a department. External data sharing practices are with project owners or clients, with subject specific data centres, and with consortium members e.g when sharing with BODC follow NERC policy guidelines Respondents acknowledge the benefits that a GIS user group could bring to data sharing • • • There are little or no formal methods of sharing geospatial data There are no formal sharing guidelines. Certain research endeavours may have their own specific data submission stipulations (e.g. depositing data with NERC as part of funding requirements), but nothing similar is in place at the Institute level. • • Site 3 • Datasets held are predominantly used for teaching purposes and are mostly located in a central read-only repository. The repository is managed by the IT helpdesk, which allocates write permissions on a folder by folder basis. Users with an account on the local server may access the data by mapping a drive in Windows or navigating via the command line in UNIX (e.g. net use). There is not a well defined folder structure: depositors create a sub-folder file system within the net data location based on a variety of schemes: eg. by year, by location, by relevant semester and practical session in which the data is due to be used. Undocumented legacy datasets of uncertain ownership and origin also exists in the Institute, most of which are stored on CD-ROM. Such resources need to be documented and catalogued, however allocating responsibility and assigning priority is not straightforward. There is no main index or catalogue of the data held in the central repository – users are expected to find the required data themselves, or solicit guidance from the relevant data depositor – who is not immediately apparent for those new to the Institute. Students using data during practicals are informed of its location at the start of each session. No universal mechanism for searching the repository exists outside those facilities provided by proprietary software e.g. ArcCatalog. Individual data holdings are mostly undocumented; metadata is sparse at best. When such metadata items exist, they are either minimally populated or are the default files generated automatically by the data’s host proprietary application (and are hence incomplete). The results of research by students are available but not readily accessible. Data may accompany academic submissions (i.e. on CD-ROM) but such a practice is not compulsory unless specifically requested by the relevant supervisor. Once students (or researchers) depart the Institute, individual accounts are backed up into an archive and removed from the ‘live’ collection of accounts. Specific files or folders may be retrieved by IT personnel for others wishing to conduct further analysis on archived findings; this however necessitates logging a call with IT help and may take up to two weeks to execute. There is however de facto support for recording time-variant datasets as archiving of current Institute accounts is carried out on a daily basis (although requests are subject to similar retrieval delays unless in case of emergency). • • Commonly used data exchange methods include sending datasets by email – as zipped attachments or uncompressed if not particularly large, via WebCT, by URL, passed by hand / posted via CD-ROM, pen drives. Example of sharing between staff, researchers and students using a shared network drive • • • • • 46 There is little evidence of data passed outside the Institute apart from those datasets produced and deposited with funding bodies. Due to the ambiguity / legalese of licensing agreements, respondents of the current report expressed a reluctance to share data outside the School of GeoSciences. One respondent said that if they had to provide data, it would be predominantly images – rectified geo-tiffs) with standard human-readable metadata (destination-compliant format), compressed in a non-lossy format such as tar/gzip. There is uncertainty as to whether Institute / funded researcher or group is entitled to certain licensed data. The Institute does occasionally generate its own data (aerial (e.g. blimp) photography, gps survey, etc) but these datasets by and large remain within the domain of the collecting researcher. There is little coordination between research groups; similarly there is little awareness of what data does exist within the building (the shared repository aside) There are no given guidelines about how data need to be documented and in what format it has to be delivered • • • • • • Site 4 • The School does not have a central storage facility where all datasets can be uploaded or downloaded when necessary. Every researcher stores his/her own data locally on his/her desktop or on an external hard drive. Some shared sample datasets, mostly used for teaching purposes are stored in shared locations on servers, where they can be accessed from departmental computer rooms. Unfortunately the data are mostly very badly documented. However, it is possible to request more information about the datasets from the person who has deposited the data on the shared drive (the data are usually in a directory that is labelled with the instructor’s name). It is not compulsory for Masters/PhD students to submit well-documented datasets as part of their dissertations. Therefore if somebody else would like to continue with the research or reuse some of the generated data, they need to contact the author of the work directly or his/her supervisor for more information. The exchange of geospatial datasets is mostly done on an ad-hoc basis depending on the volume of the data. 47 Teaching material is centrally distributed via WebCT, however, bigger datasets are stored on a local server. The department used to have a Unix machine upon which such datasets were stored with uploads and downloads being undertaken using FTP. This machine has now been decommissioned and replaced with a departmental Windows server that permits file transfers using drag and drop. Popular ways of sharing research related geospatial information are via email, or using external hard drives, or burning CD and DVD disks. • • The main problem of acquiring geospatial data is closely connected with a general unwillingness of people to share and a very poor documentation of the datasets that are provided for sharing. Therefore, acquiring and reusing geospatial data can often be problematic. At the School researchers tend to collect and create their own dataset which is thought in most cases to be easier and faster than looking for similar existing datasets that are held locally and which can be reused. 'Wish list' items Other A full list of sources and informal methods to share geospatial data Faster computer network speeds More control over, or recognition of, your work Geography forum similar to napster or myspace University/departmental repository Central find/locator portal National geospatial repository Less restrictive or no licence agreements Scores re da ta Ea sie r M isu se ot he rs re po sit or y of tru st in a Ac ce ss to th e pr iva cy O th er in te rn et D at a by ot he of rs re le va n td to at st as ar et ta s ga in fro m sc ra tc h Te ch ni ca la bi lity La ck La ck to IP R of m et ad at a co nd itio ns re se ar ch er s Ac ce ss re La ck lic en sin g C on ce rn s Co nc er ns Responses GRADE Appendix I – Survey results identifying key barriers to data sharing and key elements for improving data sharing (extracted from http://edina.ac.uk/projects/grade/status/workpackage2.html) Figure 2: Barriers to sharing geospatial data 90 80 70 60 50 40 30 20 10 0 Issues Figure 3: 'Wish list' of how to make geospatial data sharing easier 200 180 160 140 120 100 80 60 40 20 0
© Copyright 2025