Cover Sheet for Proposals (Please complete ALL sections) IE Programme

Cover Sheet for Proposals
(Please complete ALL sections)
IE Programme
Name of Call Area Bidding For (tick ONE only):
Strand A4: Repositories: rapid innovation Name of Lead Institution:
University of Southampton
Name of Proposed Project:
Distributed Internet Archive System for Educational Repositories –
DIASER
OMII-UK and Interlinux LTD
Name(s) of Project Partner(s):
Full Contact Details for Primary Contact:
Name: Damian Brasher BSc RHCE MBCS
Position: Systems administrator / programmer.
Email: [email protected]
Tel: 023 8059 8862
Fax: 023 8059 8870
Address: OMII-UK, B32, University of Southampton, Southampton, SO17 1BJ, United Kingdom
Length of Project:
6 Months
st
th
Project Start Date:
1 April 2009
Project End Date:
30 September 2009
Total Funding Requested from JISC: £30,117
Funding Broken Down over Financial Years (April - March):
April 09 – March 10
April 10 – March 11
£37,646
Total Institutional Contributions: £7,529
Outline Project Description:
April 11 – March 12
-
The role of institutional and domain specific repositories as a mechanism for capturing the research
outputs from projects is growing, driven by initiatives such as the EC FP7 Open Access Pilot and
GRL2020. As noted by the JISC in the reports “Keeping research data safe” and “Economic Implications
of Alternative Scholarly Publishing Models”, it is vitally important that these repositories gain the trust and
respect of local and international users by properly safeguarding the information held in them.
Organisations responsible for repositories face the same challenge; within budget how to store and
manage ever increasing amounts of archive data, how to ensure that archived data is safe from loss in
the long term at the same time maximise it’s availability in case restoration is required
Infrastructure that can provide a high level of resilience can be expensive and slow. DIAP (Distributed
Internet Archive Protocol) software has been developed by Interlinux Ltd to address these challenges.
Elements of DIAP have been successfully introduced at OMII-UK and provide efficient archival of software
development infrastructure displaying significant benefits over other Open Source archival software. In
response development of Distributed Internet Archive System for Educational Repositories (DIASER) is
being proposed.
This project will build on the initial DIAP innovation to improve the long term robustness and resilience of
repositories by realising and making accessible the benefits of storing replica archives across networks in
different geographical locations to learning and research organisations. It will provide working tools and a
consistent high level architecture with which the wider JISC repository community can assess this
complementary, service building archive, technology and a move towards establishing best practice.
I have looked at the example FOI form at
Appendix B and included an FOI form in the
attached bid (Tick Box)
I have read the Call and associated Terms and
Conditions of Grant at Appendix D (Tick Box)
YES NO
YES NO
FOI Withheld Information Form
We would like JISC to consider withholding the following sections or paragraphs from disclosure, should the
contents of this proposal be requested under the Freedom of Information Act, or if we are successful in our
bid for funding and our project proposal is made available on JISC’s website.
We acknowledge that the FOI Withheld Information Form is of indicative value only and that JISC may
nevertheless be obliged to disclose this information in accordance with the requirements of the Act. We
acknowledge that the final decision on disclosure rests with JISC.
Section / Paragraph No.
na
Relevant exemption from
disclosure under FOI
na
There is no information in this bid that we would wish to be withheld.
Justification
Na
1. Introduction
With the growing uptake and usage of repositories, it is important that valuable research data held within the
repositories is kept safe. Existing repository systems such as EPrints, DSpace and Fedora Commons all
recommend a regular backup process. Mixed files stores need a regular backup process. Archiving in this
proposal refers to the long term storage and management of data that has been backed up using existing
systems. Organisations responsible for repositories face the same challenge: how to store and manage
increasing amounts of archive data within budget, how to ensure that archived data is safe from loss, whilst
maximising its availability in case restoration is required. This project aims to improve the repository
manager’s ability to help the organisation responsible for repositories meet these challenges. Efficient multisite, distributed archival infrastructure is often slow and/or expensive, necessitating new, innovative methods
of ensuring repository robustness. The JISC wishes to support the development of repository foundations by
stimulating growth of new technology designed to improve repositories. In response, the Distributed Internet
Archive System for Educational Repositories (DIASER) project will build on DIAP to achieve this. DIAP is a
new system that provides a long-term integrated distributed archiving solution via a single Open Source
Software (OSS) interface. It utilises new or existing commodity disk storage and saves the repository
manager manpower, cost, administration time and support time. The remainder of this proposal describes in
detail how this innovation will address these challenges, how local and national learning and research
institutions will benefit from the work, the technical development and specific usage scenarios.
2. Background
The role of institutional and domain-specific repositories as a mechanism for capturing the research outputs
1
from projects is growing, driven by initiatives such as the EC FP7 Open Access Pilot which states that
“Researchers should deposit final articles or manuscripts into the institutional repository of the research
institution with which they are affiliated. If this is not possible, they should identify an appropriate subject
2
based/thematic repository” and GRL2020 . Since 2006, the Wellcome Trust “expects authors of research
3
papers to maximise the opportunities to make their results freely available” . As noted by JISC in the reports
4
Keeping research data safe, “HEIs should consider federated structures for local data storage within their
institution comprising data stores at the departmental level and additional storage and services at the
5
institutional level” and Economic Implications of Alternative Scholarly Publishing Models “a system of
institutional repositories in UK higher education in which every institution had one publications-oriented
repository and all publications were self-archived once would cost around £20 million per annum”. It is vitally
important that these repositories gain the trust and respect of local and international users by properly
safeguarding the information held in them.
DIASER builds on DIAP, which exists to keep data safe. Using the experience of project partners and the
close collaboration of the EPrints team at Southampton, this work will explore ideas for improving the
robustness and resilience of repositories, develop novel ways of combining repositories with archival
services and investigate interoperability with existing archive systems.
3. Technical Overview of DIAP (DIASER)
3.1 DIAP enables the systems administrator to distribute full and differential volumes created using standard
backup software for enhanced interoperability (see Figure 1). It uses a well defined and carefully tested
architecture which stores and manages replica archives across multiple geographically distributed nodes.
DIAP can contain mixed data sources if the data has been collected and fed into volumes (the disk
equivalent to a tape). DIAP has a predefined structure and has many advantages over standard diskmirroring techniques, especially when managing medium- to long-term archives. DIAP is a disk-based
archive system. The DIAP archive administrator retains full administrative control with the option of
encryption when generating volumes.
Unique properties of DIAP:
1)
2)
3)
4)
1
Designed to operate over multiple geographical locations.
Can be optimised to cope with long term archive requirements.
Allows automatic and semi-automatic adjustments during loss or migration of storage nodes.
Designed to allow adjustments over time to compensate for changing network conditions.
Open Access Pilot in FP7: http://ec.europa.eu/research/science-society/open_access
Global Research Library 2020 – A Vision for a Global Research Library: http://www.grl2020.net/
Wellcome Trust Open Access Policy: http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTD002766.htm
4
Keeping Research Data Safe: http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx
5
Economic Implications of Alternative Scholarly Publishing Models:
http://www.jisc.ac.uk/publications/publications/economicpublishingmodelsfinalreport.aspx
2
3
5) Able to quickly optimise its operation if data sets suddenly increase within set parameters.
6) Designed to provide accurate status reporting to users.
7) Single tool which combines the functionality of existing OSS backup tools, and provides additional
administrative capabilities making it suitable for managing important data.
3.2 Repository administrators archive repositories – these can range from <100GB to several TB or more –
using tape and disk. For example, the University of Southampton’s Electronic and Computer Science (ECS)
EPrints repository is approximately 4.5TB and is archived to tape. Disk backup is often made to custom
network attached storage (NAS) devices. OSS mirroring and network backup tools may also be used. These
solutions are standardised and well understood. They also have limitations which DIAP does not. Tape drives
can have a high initial cost of ownership and large maintenance overheads. Tape positioning times for data
restoration are much slower than disk. To benefit from geographical resilience, tapes need to be manually
moved between sites. Tapes need to be manually changed or expensive auto-changers purchased. Longterm tape storage can consume additional resources due to their fragile magnetically and environmentally
sensitive components. NAS devices solve some of these limitations, but do not address the issue of
geographical robustness. Quality NAS disk arrays are expensive and prone to non-standard vendor
extensions. NAS also need to reside in environmentally controlled locations, which is not always feasible for
many organisations.
3.3 DIAP architecture uses a round robin
allocation between nodes where the data
transfer rate is dynamically controlled,
including compression, according to load
and availability. Long-term storage is
achieved by extending the volume retention
period. Currently DIAP is implemented as a
Perl application, released under the GNU
Public License, which features an
installation script with the following
functions: usage, node account and
directory generation, RSA key generation,
automatic key sharing between nodes,
crontab definitions, and uninstall. It is based
on the IETF Networking Group Internet
Draft “Distributed Internet Archive Protocol
6
(DIAP)” . To maximise interoperability, the
system conforms with POSIX(Portable
Operating System Interface). Perl is chosen
due to its maturity, flexibility and ability to
work well in heterogeneous environments.
Initial storage calculations use a DIAP
architecture designed to provide 30 days of
data retention. Network capacity varies
considerably depending on network
hardware and conditions so estimates are
provided here. The software includes a
function prototype to calculate bandwidth
capabilities on an ad-hoc basis. LMB is the lowest maximum bandwidth available between any two or three
nodes. Differential volume sizes are estimated at 500MB/day. For three nodes, triple redundancy and 12
hours daily operation with LMB of 1Gbit/s yields an approximate total DIAP storage node capacity of 10.8TB
over 30 days. Source data without compression is just under half total DIAP storage capacity. With
optimised use of switches and possibly ethernet channel bonding and data-deduplication storage capacity
has the potential to be increased.
3.4 DIAP competes directly with commercial and custom-built mid to low-range NAS Virtual Tape Library
(VTL) backup hardware and software solutions, which in turn compete with mid- to low-range LTO tape drive
units, all in the ~10TB storage capacity range. DIAP design is not based on these commercial solutions.
There are no OSS alternatives which provide the same functionality as DIAP. Some tools, including Rsync,
Rdiff, Bacula, Amanda and BackupPc, provide different types of backup functionality. Rdiff is the nearest
match to DIAP, but it does not interoperate with existing backup software or offer a comparable feature set or
scalability to DIAP. DiGS is a distributed storage system designed to use metadata and application specific
datasets, however DiGS does not interoperate in a generic way. There are also internet-based data storage
6
Distributed Internet Archive Protocol (DIAP): http://www.ietf.org/internet-drafts/draft-brasher-diap-04.txt
services: Amazon S3 offers competitive storage services. These types of service are limited by internet
bandwidth speed, cost, availability and issues of long term provision of service.
4. Appropriateness, Fit to Programme and Overall Value to the JISC Community
The strength of underlying infrastructure of all repositories is key to their long-term success. Archiving data
stored in repositories is a fundamental practice and there are few alternatives to traditional methods
described in this proposal. Systems’ environments are ever-changing and offer new opportunities to exploit
resources to help address these challenges. DIASER is a technical opportunity to stimulate change: an
alternative or complementary innovation to traditional archive methodology which retains interoperability with
existing technology and is flexible enough to benefit multiple institutions.
4.1. Permanent Research Output repository
At the University of Southampton, the School of Electronics and Computer Sciences maintains a permanent
EPrints research repository to deposit academic work from within the school. This scenario is similar across
many institutions. The ECS EPrints repository is ideally suited to act as the initial evaluation site for
DIASER.
4.2. JISC Repository integration
Discussions with EPrints developers have highlighted opportunities to integrate the DIASER software with
EPrints in two ways:
1) The EPrints architecture includes a storage controller with a plug-in component for different types of
storage. A plug-in to allow EPrints to regularly write to Subversion (SVN) will potentially enable
DIASER to collect SVN contents for subsequent long-term archive storage.
2) A simpler method is to extend DIASER to periodically collect the contents of EPrints Linux file
structure, /opt/eprints3 and accompanying backup of the database instance, which is the current
recommended way of (manually) backing up an EPrints installation.
DSpace, when using PostgresSQL storage, can be backed up using the standard tools i.e. pg_dump, at
which point DIASER management can be applied. Similarly Fedora Commons which uses MySQL, Oracle or
PostgreSQL for storage can be placed under DIASER management.
4.3. Mixed data and file collection
As the trend towards capturing not only publication, but also datasets and code, increases, we envisage a
scenario where storage of a mixed data and file collection is required. This could include versions of data,
configuration files and software (in source code and executable form) that have been generated by a
research group over the course of several years, e.g. OMII-UK backs up about 100GB to disk and 30GB to
tape). Cambridge University are exploring a similar scenario using Fedora to store entire virtual machine
images to ensure the ability to run code and regenerate datasets in the future.
4.4. Benefits for different users
Users will experience different benefits from using DIASER. Benefits filter in from the systems administrator
via management to the user – ultimately reflecting in advantages to the organisation responsible for
deployment as described previously in this document.
User: Trust and certainty in the reliability of the repository’s data retention; Capabilities will greatly increase
through successful usage over extended periods.
Repository manager: Strategic use of resources when working with restricted repository budgets; An effective
tool to help build robust disaster recovery plans; Consistent use and storage reports generated over many
years; Secure administrative control over archives; Reduced support time and systems administrator
overheads.
Systems administrator: A reliable standard to help build infrastructure; Fast planning times and installation of
an archive facility; Fast and reliable data retrieval; Good archive data validation; A tool to help use archive
allocated system resources efficiently; Providing the flexibility to re-allocate freed up resources; A much
needed consistent user interface; Long-term monitoring and reporting to track and facilitatee management of
repository archives; Less time spent supporting and administering archives.
5. Project Plan
The plan is divided into the following four phases:
Phase 1 - (April) Analysis of existing IP, ensure correct partner acknowledgement plans and IPR agreements
are in place. Review of the code function prototypes. Ensure project development infrastructure is in place.
Grant administrative access to existing community website, diap.org.uk and other repositories to project
partners. Produce documentation plan. Ensure IETF-ID accurately reflects the current high level architecture
th
design. Pinpoint events, e.g. UKUUG Spring conference and 4 International Conference on OSS Systems
which will provide networking and interaction with potential users and evaluators. Arrange remote and on-site
live demonstrations with HE repository managers. Identify opportunities to provide short talks. Identify and
join relevant community mailing lists. Start diaser-dev-alpha application coding.
Phase 2 - (May - July) Weekly release cycle of diaser-dev-alpha. Ensure existing diap.org.uk properly
reflects JISC terms and conditions. Update ID. Incorporate and update existing project documentation, user
instructions, FAQ and diagrams. Deploy development server and virtual machines. Review software feature
list. Attend events and identify additional HE evaluators.
Phase 3 - (August) Maintain diaser-alpha release cycle. Trial software with identified HE evaluators. Identify
and join relevant IETF working group(s). Design and implement reporting and web monitoring interface.
Review software feature list. Attend events.
Phase 4 - (September) Consolidation of code repository, user documentation and FAQ. Apply for IETF RFC
status. Ensure all project materials reflect JISC terms and conditions. Ensure code and documentation are
appropriately released to JISC community with GPL. Publish evaluation reports. Carry out final IPR license
agreement analysis.
6. Project Deliverables
In addition to the standard documentation required by the JISC Project Management guidelines, the following
will be delivered during the course of the project:
Phase 1 – (April) JISC project website and revised supporting community website. Incorporation of IP made
available by Interlinux Ltd.
Phase 2 - (May - July) Refined project documentation. Software feature analysis and interoperability report.
Phase 3 – (August) Trials report based on evaluation of alpha software.
Phase 4 – (September) Perl software installer with at least these capabilities: on demand bandwidth
calculation, time zone adjustment, reporting, system compatibility checks, development mode for rapid
testing, restore tools, node migration tools, node network availability and average differential computation
and upgrade function. Simple web monitoring interface. Accompanying documentation, user guide and FAQ.
Final IETF-ID including extended data retention architecture and IETF-RFC application.
7. Risk Analysis
Risk
Staffing.
Minor technical problems:
i.e. difficulty implementing
SSH-agent functionality.
Major technical problems:
i.e. unable to implement a
main feature like node
migration.
Lack of community
engagement.
Probability Severity Score
(1-5)
(1-5)
(P x S)
1
3
3
Action to Prevent/Manage Risk
3
1
3
Careful time management to coincide with staff
availability. All staff are in place and available.
Draw on lead institution developer experience.
1
4
4
Draw on lead institution developer experience.
3
2
6
Improve community coverage by providing
articles to websites and journals, deliver
presentations, deploy demonstrators to potential
users and evaluators. Attend networking events.
Time allocated to the project
is too little.
1
2
2
Unable to exploit due to IPR
issues
1
3
3
Timescale to evaluate
software at additional
institutions is insufficient
3
3
9
Damian Brasher will mitigate this risk by
ensuring careful resource allocation and time
management.
A Heads of Terms agreement has already been
drawn up to ensure that IPR issues have been
addressed before the project starts
The summer may be problematic due to staff
absences, however we have already planned
one HE repository evaluation.
8. Technical Development
The DIASER project will follow OSS development methodology, utilising a typical Linux development
environment including source control under SVN, and a community website with public releases, Bugzilla for
bug tracking, and developer and user mailing lists. The test environment will include deployment of virtual
machines in geographically different campus locations. The software will be developed with reference to ‘Perl
Best Practices’ by O’Reilly. Software will be released using the GNU GPL licence model.
9. Areas of Further Development
Previous exploratory work has identified areas of additional development that may benefit and validate
DIASER development in the future. These include: statistical risk analysis comparison between different
archive systems;chaining together DIASER pools to extend data retention periods and to allow sharing
between multiple deployments of DIASER allowing community services; tighter integration with other Open
Source backup software to streamline restore function; and tighter integration with Open Source repository
software to automate and integrate the database backup process with DIAP.
10. IPR Considerations
The project will have access to IP generated by the company Interlinux Ltd, which is owned and managed by
the project leader and primary technical developer, Damian Brasher. Damian and partners of the company
have agreed to a Heads of Terms agreement drawn up in conjunction with the University of Southampton’s
Research & Innovation Services. On receipt of funding, the Heads of Terms stipulates a second in-licence
agreement, again written in conjunction with Research & Innovation Services, to become operational for the
project duration and 1 year beyond completion of the project. If the University does not make commercial
use of the IPR within this year, then all IPR are returned to Interlinux Ltd. Proper recognition of the lead
institution and JISC involvement will be made according to JISC guidelines and institution in-licence
agreements as noted. IP and existing facilities to be made available by Interlinux Ltd for the purpose of this
project comprise:
•
•
•
•
•
Source code of current implementation.
Community website including all documentation and graphics – http://www.diap.org.uk
DIAP® - UK registered trademark number 2466480.
SourceForge project site including SVN repository and mailing lists –
http://sourceforge.net/projects/diap
IETF Internet-Draft V04 – http://www.ietf.org/internet-drafts/draft-brasher-diap-04.txt
11. Sustainability
After this six-month development phase, the GPL licensed source code will be available to the JISC and the
wider OSS community. By using the outputs of this project, and in conjunction with the EPrints team, the
partners hope to attract enough users and developers from these communities to maintain an up-to-date
code base in the future. Additional funding will be sought after the development period outlined in this
proposal to pursue specific areas of further development. The final IETF-ID and corresponding RFC
application will also be made available through the appropriate IETF channels.
12. Budget
Budget (see Appendix A for details). This bid requests 80% of the Full Economic Cost (FEC): £37,646
JISC funding sought (80% of FEC): £30,117. Institutional contribution: £7,529. This costing is based on
Damian Brasher (40%) as project manager and software developer, Simon Hettrick (10 days) for marketing
and documentation, and Neil Chue Hong (5%) as Principal Investigator and project advisor. All staff are in
position and available to start at the project start date.
13. Partners
Lead institution - OMII-UK (Southampton University) is an Open Source organisation that empowers the
UK research community by developing and maintaining software for researchers - software such as DIAP.
OMII-UK has brought together, developed and sustained popular and widely used software ranging across
the scientific software stack, from programming environments aimed at developers of scientific software to
high-level tools aimed at e-Scientists and research informaticians.
Association with OMII-UK provides DIAP with access to technical resources and decades of software
experience. OMII-UK is an organisation with a long history of understanding the nature and difficulties faced
by open distributed systems, and is highly experienced at maintaining OSS. OMII-UK are advocates of – and
adherents to – the OSS development methodology. This includes licence management and integration,
which will be of significant value to DIAP development. Security within distributed computing can be a
challenging problem, which means that users are understandably cautious about the security issues
surrounding the use of such software. This is an area in which OMII-UK has developed extensive experience
by overcoming the security problems experienced by different Grid and e-Science communities.
DIAP is a generic technology, which will benefit from development as a Grid/e-Science project by utilising
existing distributed-computing resources. OMII-UK is the UK’s leading organisation for development within
the Grid/e-Science community, meaning that collaboration with OMII-UK will be invaluable to the DIAP
project.
Interlinux LTD The small research company founded and run primarily by Damian Brasher with the help of
his partners. The company enables safe management of the IP generated during the exploratory phase of
DIAP development. Damian Brasher is the sole author of DIAP IETF-ID. Partner Myles McClelland supported
the early phase of the project between 2005-2007.
14. Key Personnel
Damian Brasher BSc (Open) RHCE MBCS - Project lead, management and primary technical developer.
Damian is a Linux systems administrator for OMII-UK. He has nine years’ experience in the IT industry
designing and maintaining systems infrastructure for non-profit organisations, the public sector and small
business. His work in the non-profit sector, between 1999 and 2006, enabled the rapid growth of an
organisation providing much needed support to NHS services. During this time, he wrote and implemented
several ICT strategies and has made extensive use of OSS technologies throughout his career. Damian is
professionally qualified as a Red Hat Certified Engineer, one of the most widely respected industrial IT
certifications, and has begun studying at systems architecture level with Red Hat global learning services.
He obtained his Open University degree in IT and mathematics in 2003. Damian has been actively involved
in the local OSS community with Hampshire Linux Users Group since 2004, facilitating meetings and giving
talks as well as a regular contributor to numerous technical mailing lists. Damian is also a member of the
British computer Society. Damian established the company Interlinux Ltd with his partners in 2005, see
section 13 above.
Dr Simon J Hettrick - Documentation, graphics and marketing.
Simon Hettrick is the Publicity Coordinator for OMII-UK. He has three year’s experience of managing the
publicity strategy for OMII-UK and preparing publicity materials. Simon is responsible for the development of
a successful, quarterly newsletter, the OMII-UK website, interaction with the press, documentation and he
organises the company’s presence at conferences and events.
Neil P Chue Hong – Project advisor.
Neil Chue Hong is Director of OMII-UK, working with e-Research software projects to achieve sustainability.
He is co-chair of the “Grids meet Repositories” series of workshops. He sits on the boards of the Open Grid
Forum, UK National Grid Service, OSS-Watch, nanoCMOS, ADMIRE, and Globus Incubator Project. He
spent five years managing the data access and integration programme at EPCC, including OGSA-DAI. Prior
to this, he was a technology transfer consultant working with Scottish SMEs.
Appendix A - Budget
Directly Incurred
Staff
Damian Brasher, Software
Developer, 40% FTE
Simon Hettrick, Technical Author, 10
days (~10% FTE)
Total Directly Incurred Staff (A)
Apr09–
Mar10
£7,728
Apr10Mar11
-
Apr11Mar12
-
TOTAL £
£7,728
£6,938
-
-
£6,938
£14,666
-
-
£14,666
Non-Staff
Apr10Mar11
-
Apr11Mar12
-
TOTAL £
Travel and expenses
Apr09–
Mar10
£400
£400
Hardware/software
£1,200
-
-
£1,200
Dissemination
-
-
-
-
Evaluation
-
-
-
-
Other
£600
-
-
£600
Total Directly Incurred Non-Staff (B)
£2,200
-
-
£2,200
Directly Incurred Total (C)
(A+B=C)
£16,866
-
-
£16,866
Directly Allocated
Apr10Mar11
-
Apr11Mar12
-
TOTAL £
Staff
Apr09–
Mar10
£1,753
Estates
£4,585
-
-
£4,585
Other
£
-
-
£
Directly Allocated Total (D)
£6,338
-
-
£6,338
Indirect Costs (E)
£14,442
-
-
£14,442
Total Project Cost (C+D+E)
£37,646
-
-
£37,646
Amount Requested from JISC
£30,117
-
-
£30,117
Institutional Contributions
£7,529
-
-
£7,529
Percentage Contributions over the
life of the project
Partners
X%
Institution
20% FEC
JISC 80%
FEC
Total
100%
£1,753