Cover Sheet for Proposals (Please complete ALL sections) IE Programme Name of Call Area Bidding For (tick ONE only): Strand A4: Repositories: rapid innovation Name of Lead Institution: University of Southampton Name of Proposed Project: Distributed Internet Archive System for Educational Repositories – DIASER OMII-UK and Interlinux LTD Name(s) of Project Partner(s): Full Contact Details for Primary Contact: Name: Damian Brasher BSc RHCE MBCS Position: Systems administrator / programmer. Email: [email protected] Tel: 023 8059 8862 Fax: 023 8059 8870 Address: OMII-UK, B32, University of Southampton, Southampton, SO17 1BJ, United Kingdom Length of Project: 6 Months st th Project Start Date: 1 April 2009 Project End Date: 30 September 2009 Total Funding Requested from JISC: £30,117 Funding Broken Down over Financial Years (April - March): April 09 – March 10 April 10 – March 11 £37,646 Total Institutional Contributions: £7,529 Outline Project Description: April 11 – March 12 - The role of institutional and domain specific repositories as a mechanism for capturing the research outputs from projects is growing, driven by initiatives such as the EC FP7 Open Access Pilot and GRL2020. As noted by the JISC in the reports “Keeping research data safe” and “Economic Implications of Alternative Scholarly Publishing Models”, it is vitally important that these repositories gain the trust and respect of local and international users by properly safeguarding the information held in them. Organisations responsible for repositories face the same challenge; within budget how to store and manage ever increasing amounts of archive data, how to ensure that archived data is safe from loss in the long term at the same time maximise it’s availability in case restoration is required Infrastructure that can provide a high level of resilience can be expensive and slow. DIAP (Distributed Internet Archive Protocol) software has been developed by Interlinux Ltd to address these challenges. Elements of DIAP have been successfully introduced at OMII-UK and provide efficient archival of software development infrastructure displaying significant benefits over other Open Source archival software. In response development of Distributed Internet Archive System for Educational Repositories (DIASER) is being proposed. This project will build on the initial DIAP innovation to improve the long term robustness and resilience of repositories by realising and making accessible the benefits of storing replica archives across networks in different geographical locations to learning and research organisations. It will provide working tools and a consistent high level architecture with which the wider JISC repository community can assess this complementary, service building archive, technology and a move towards establishing best practice. I have looked at the example FOI form at Appendix B and included an FOI form in the attached bid (Tick Box) I have read the Call and associated Terms and Conditions of Grant at Appendix D (Tick Box) YES NO YES NO FOI Withheld Information Form We would like JISC to consider withholding the following sections or paragraphs from disclosure, should the contents of this proposal be requested under the Freedom of Information Act, or if we are successful in our bid for funding and our project proposal is made available on JISC’s website. We acknowledge that the FOI Withheld Information Form is of indicative value only and that JISC may nevertheless be obliged to disclose this information in accordance with the requirements of the Act. We acknowledge that the final decision on disclosure rests with JISC. Section / Paragraph No. na Relevant exemption from disclosure under FOI na There is no information in this bid that we would wish to be withheld. Justification Na 1. Introduction With the growing uptake and usage of repositories, it is important that valuable research data held within the repositories is kept safe. Existing repository systems such as EPrints, DSpace and Fedora Commons all recommend a regular backup process. Mixed files stores need a regular backup process. Archiving in this proposal refers to the long term storage and management of data that has been backed up using existing systems. Organisations responsible for repositories face the same challenge: how to store and manage increasing amounts of archive data within budget, how to ensure that archived data is safe from loss, whilst maximising its availability in case restoration is required. This project aims to improve the repository manager’s ability to help the organisation responsible for repositories meet these challenges. Efficient multisite, distributed archival infrastructure is often slow and/or expensive, necessitating new, innovative methods of ensuring repository robustness. The JISC wishes to support the development of repository foundations by stimulating growth of new technology designed to improve repositories. In response, the Distributed Internet Archive System for Educational Repositories (DIASER) project will build on DIAP to achieve this. DIAP is a new system that provides a long-term integrated distributed archiving solution via a single Open Source Software (OSS) interface. It utilises new or existing commodity disk storage and saves the repository manager manpower, cost, administration time and support time. The remainder of this proposal describes in detail how this innovation will address these challenges, how local and national learning and research institutions will benefit from the work, the technical development and specific usage scenarios. 2. Background The role of institutional and domain-specific repositories as a mechanism for capturing the research outputs 1 from projects is growing, driven by initiatives such as the EC FP7 Open Access Pilot which states that “Researchers should deposit final articles or manuscripts into the institutional repository of the research institution with which they are affiliated. If this is not possible, they should identify an appropriate subject 2 based/thematic repository” and GRL2020 . Since 2006, the Wellcome Trust “expects authors of research 3 papers to maximise the opportunities to make their results freely available” . As noted by JISC in the reports 4 Keeping research data safe, “HEIs should consider federated structures for local data storage within their institution comprising data stores at the departmental level and additional storage and services at the 5 institutional level” and Economic Implications of Alternative Scholarly Publishing Models “a system of institutional repositories in UK higher education in which every institution had one publications-oriented repository and all publications were self-archived once would cost around £20 million per annum”. It is vitally important that these repositories gain the trust and respect of local and international users by properly safeguarding the information held in them. DIASER builds on DIAP, which exists to keep data safe. Using the experience of project partners and the close collaboration of the EPrints team at Southampton, this work will explore ideas for improving the robustness and resilience of repositories, develop novel ways of combining repositories with archival services and investigate interoperability with existing archive systems. 3. Technical Overview of DIAP (DIASER) 3.1 DIAP enables the systems administrator to distribute full and differential volumes created using standard backup software for enhanced interoperability (see Figure 1). It uses a well defined and carefully tested architecture which stores and manages replica archives across multiple geographically distributed nodes. DIAP can contain mixed data sources if the data has been collected and fed into volumes (the disk equivalent to a tape). DIAP has a predefined structure and has many advantages over standard diskmirroring techniques, especially when managing medium- to long-term archives. DIAP is a disk-based archive system. The DIAP archive administrator retains full administrative control with the option of encryption when generating volumes. Unique properties of DIAP: 1) 2) 3) 4) 1 Designed to operate over multiple geographical locations. Can be optimised to cope with long term archive requirements. Allows automatic and semi-automatic adjustments during loss or migration of storage nodes. Designed to allow adjustments over time to compensate for changing network conditions. Open Access Pilot in FP7: http://ec.europa.eu/research/science-society/open_access Global Research Library 2020 – A Vision for a Global Research Library: http://www.grl2020.net/ Wellcome Trust Open Access Policy: http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTD002766.htm 4 Keeping Research Data Safe: http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx 5 Economic Implications of Alternative Scholarly Publishing Models: http://www.jisc.ac.uk/publications/publications/economicpublishingmodelsfinalreport.aspx 2 3 5) Able to quickly optimise its operation if data sets suddenly increase within set parameters. 6) Designed to provide accurate status reporting to users. 7) Single tool which combines the functionality of existing OSS backup tools, and provides additional administrative capabilities making it suitable for managing important data. 3.2 Repository administrators archive repositories – these can range from <100GB to several TB or more – using tape and disk. For example, the University of Southampton’s Electronic and Computer Science (ECS) EPrints repository is approximately 4.5TB and is archived to tape. Disk backup is often made to custom network attached storage (NAS) devices. OSS mirroring and network backup tools may also be used. These solutions are standardised and well understood. They also have limitations which DIAP does not. Tape drives can have a high initial cost of ownership and large maintenance overheads. Tape positioning times for data restoration are much slower than disk. To benefit from geographical resilience, tapes need to be manually moved between sites. Tapes need to be manually changed or expensive auto-changers purchased. Longterm tape storage can consume additional resources due to their fragile magnetically and environmentally sensitive components. NAS devices solve some of these limitations, but do not address the issue of geographical robustness. Quality NAS disk arrays are expensive and prone to non-standard vendor extensions. NAS also need to reside in environmentally controlled locations, which is not always feasible for many organisations. 3.3 DIAP architecture uses a round robin allocation between nodes where the data transfer rate is dynamically controlled, including compression, according to load and availability. Long-term storage is achieved by extending the volume retention period. Currently DIAP is implemented as a Perl application, released under the GNU Public License, which features an installation script with the following functions: usage, node account and directory generation, RSA key generation, automatic key sharing between nodes, crontab definitions, and uninstall. It is based on the IETF Networking Group Internet Draft “Distributed Internet Archive Protocol 6 (DIAP)” . To maximise interoperability, the system conforms with POSIX(Portable Operating System Interface). Perl is chosen due to its maturity, flexibility and ability to work well in heterogeneous environments. Initial storage calculations use a DIAP architecture designed to provide 30 days of data retention. Network capacity varies considerably depending on network hardware and conditions so estimates are provided here. The software includes a function prototype to calculate bandwidth capabilities on an ad-hoc basis. LMB is the lowest maximum bandwidth available between any two or three nodes. Differential volume sizes are estimated at 500MB/day. For three nodes, triple redundancy and 12 hours daily operation with LMB of 1Gbit/s yields an approximate total DIAP storage node capacity of 10.8TB over 30 days. Source data without compression is just under half total DIAP storage capacity. With optimised use of switches and possibly ethernet channel bonding and data-deduplication storage capacity has the potential to be increased. 3.4 DIAP competes directly with commercial and custom-built mid to low-range NAS Virtual Tape Library (VTL) backup hardware and software solutions, which in turn compete with mid- to low-range LTO tape drive units, all in the ~10TB storage capacity range. DIAP design is not based on these commercial solutions. There are no OSS alternatives which provide the same functionality as DIAP. Some tools, including Rsync, Rdiff, Bacula, Amanda and BackupPc, provide different types of backup functionality. Rdiff is the nearest match to DIAP, but it does not interoperate with existing backup software or offer a comparable feature set or scalability to DIAP. DiGS is a distributed storage system designed to use metadata and application specific datasets, however DiGS does not interoperate in a generic way. There are also internet-based data storage 6 Distributed Internet Archive Protocol (DIAP): http://www.ietf.org/internet-drafts/draft-brasher-diap-04.txt services: Amazon S3 offers competitive storage services. These types of service are limited by internet bandwidth speed, cost, availability and issues of long term provision of service. 4. Appropriateness, Fit to Programme and Overall Value to the JISC Community The strength of underlying infrastructure of all repositories is key to their long-term success. Archiving data stored in repositories is a fundamental practice and there are few alternatives to traditional methods described in this proposal. Systems’ environments are ever-changing and offer new opportunities to exploit resources to help address these challenges. DIASER is a technical opportunity to stimulate change: an alternative or complementary innovation to traditional archive methodology which retains interoperability with existing technology and is flexible enough to benefit multiple institutions. 4.1. Permanent Research Output repository At the University of Southampton, the School of Electronics and Computer Sciences maintains a permanent EPrints research repository to deposit academic work from within the school. This scenario is similar across many institutions. The ECS EPrints repository is ideally suited to act as the initial evaluation site for DIASER. 4.2. JISC Repository integration Discussions with EPrints developers have highlighted opportunities to integrate the DIASER software with EPrints in two ways: 1) The EPrints architecture includes a storage controller with a plug-in component for different types of storage. A plug-in to allow EPrints to regularly write to Subversion (SVN) will potentially enable DIASER to collect SVN contents for subsequent long-term archive storage. 2) A simpler method is to extend DIASER to periodically collect the contents of EPrints Linux file structure, /opt/eprints3 and accompanying backup of the database instance, which is the current recommended way of (manually) backing up an EPrints installation. DSpace, when using PostgresSQL storage, can be backed up using the standard tools i.e. pg_dump, at which point DIASER management can be applied. Similarly Fedora Commons which uses MySQL, Oracle or PostgreSQL for storage can be placed under DIASER management. 4.3. Mixed data and file collection As the trend towards capturing not only publication, but also datasets and code, increases, we envisage a scenario where storage of a mixed data and file collection is required. This could include versions of data, configuration files and software (in source code and executable form) that have been generated by a research group over the course of several years, e.g. OMII-UK backs up about 100GB to disk and 30GB to tape). Cambridge University are exploring a similar scenario using Fedora to store entire virtual machine images to ensure the ability to run code and regenerate datasets in the future. 4.4. Benefits for different users Users will experience different benefits from using DIASER. Benefits filter in from the systems administrator via management to the user – ultimately reflecting in advantages to the organisation responsible for deployment as described previously in this document. User: Trust and certainty in the reliability of the repository’s data retention; Capabilities will greatly increase through successful usage over extended periods. Repository manager: Strategic use of resources when working with restricted repository budgets; An effective tool to help build robust disaster recovery plans; Consistent use and storage reports generated over many years; Secure administrative control over archives; Reduced support time and systems administrator overheads. Systems administrator: A reliable standard to help build infrastructure; Fast planning times and installation of an archive facility; Fast and reliable data retrieval; Good archive data validation; A tool to help use archive allocated system resources efficiently; Providing the flexibility to re-allocate freed up resources; A much needed consistent user interface; Long-term monitoring and reporting to track and facilitatee management of repository archives; Less time spent supporting and administering archives. 5. Project Plan The plan is divided into the following four phases: Phase 1 - (April) Analysis of existing IP, ensure correct partner acknowledgement plans and IPR agreements are in place. Review of the code function prototypes. Ensure project development infrastructure is in place. Grant administrative access to existing community website, diap.org.uk and other repositories to project partners. Produce documentation plan. Ensure IETF-ID accurately reflects the current high level architecture th design. Pinpoint events, e.g. UKUUG Spring conference and 4 International Conference on OSS Systems which will provide networking and interaction with potential users and evaluators. Arrange remote and on-site live demonstrations with HE repository managers. Identify opportunities to provide short talks. Identify and join relevant community mailing lists. Start diaser-dev-alpha application coding. Phase 2 - (May - July) Weekly release cycle of diaser-dev-alpha. Ensure existing diap.org.uk properly reflects JISC terms and conditions. Update ID. Incorporate and update existing project documentation, user instructions, FAQ and diagrams. Deploy development server and virtual machines. Review software feature list. Attend events and identify additional HE evaluators. Phase 3 - (August) Maintain diaser-alpha release cycle. Trial software with identified HE evaluators. Identify and join relevant IETF working group(s). Design and implement reporting and web monitoring interface. Review software feature list. Attend events. Phase 4 - (September) Consolidation of code repository, user documentation and FAQ. Apply for IETF RFC status. Ensure all project materials reflect JISC terms and conditions. Ensure code and documentation are appropriately released to JISC community with GPL. Publish evaluation reports. Carry out final IPR license agreement analysis. 6. Project Deliverables In addition to the standard documentation required by the JISC Project Management guidelines, the following will be delivered during the course of the project: Phase 1 – (April) JISC project website and revised supporting community website. Incorporation of IP made available by Interlinux Ltd. Phase 2 - (May - July) Refined project documentation. Software feature analysis and interoperability report. Phase 3 – (August) Trials report based on evaluation of alpha software. Phase 4 – (September) Perl software installer with at least these capabilities: on demand bandwidth calculation, time zone adjustment, reporting, system compatibility checks, development mode for rapid testing, restore tools, node migration tools, node network availability and average differential computation and upgrade function. Simple web monitoring interface. Accompanying documentation, user guide and FAQ. Final IETF-ID including extended data retention architecture and IETF-RFC application. 7. Risk Analysis Risk Staffing. Minor technical problems: i.e. difficulty implementing SSH-agent functionality. Major technical problems: i.e. unable to implement a main feature like node migration. Lack of community engagement. Probability Severity Score (1-5) (1-5) (P x S) 1 3 3 Action to Prevent/Manage Risk 3 1 3 Careful time management to coincide with staff availability. All staff are in place and available. Draw on lead institution developer experience. 1 4 4 Draw on lead institution developer experience. 3 2 6 Improve community coverage by providing articles to websites and journals, deliver presentations, deploy demonstrators to potential users and evaluators. Attend networking events. Time allocated to the project is too little. 1 2 2 Unable to exploit due to IPR issues 1 3 3 Timescale to evaluate software at additional institutions is insufficient 3 3 9 Damian Brasher will mitigate this risk by ensuring careful resource allocation and time management. A Heads of Terms agreement has already been drawn up to ensure that IPR issues have been addressed before the project starts The summer may be problematic due to staff absences, however we have already planned one HE repository evaluation. 8. Technical Development The DIASER project will follow OSS development methodology, utilising a typical Linux development environment including source control under SVN, and a community website with public releases, Bugzilla for bug tracking, and developer and user mailing lists. The test environment will include deployment of virtual machines in geographically different campus locations. The software will be developed with reference to ‘Perl Best Practices’ by O’Reilly. Software will be released using the GNU GPL licence model. 9. Areas of Further Development Previous exploratory work has identified areas of additional development that may benefit and validate DIASER development in the future. These include: statistical risk analysis comparison between different archive systems;chaining together DIASER pools to extend data retention periods and to allow sharing between multiple deployments of DIASER allowing community services; tighter integration with other Open Source backup software to streamline restore function; and tighter integration with Open Source repository software to automate and integrate the database backup process with DIAP. 10. IPR Considerations The project will have access to IP generated by the company Interlinux Ltd, which is owned and managed by the project leader and primary technical developer, Damian Brasher. Damian and partners of the company have agreed to a Heads of Terms agreement drawn up in conjunction with the University of Southampton’s Research & Innovation Services. On receipt of funding, the Heads of Terms stipulates a second in-licence agreement, again written in conjunction with Research & Innovation Services, to become operational for the project duration and 1 year beyond completion of the project. If the University does not make commercial use of the IPR within this year, then all IPR are returned to Interlinux Ltd. Proper recognition of the lead institution and JISC involvement will be made according to JISC guidelines and institution in-licence agreements as noted. IP and existing facilities to be made available by Interlinux Ltd for the purpose of this project comprise: • • • • • Source code of current implementation. Community website including all documentation and graphics – http://www.diap.org.uk DIAP® - UK registered trademark number 2466480. SourceForge project site including SVN repository and mailing lists – http://sourceforge.net/projects/diap IETF Internet-Draft V04 – http://www.ietf.org/internet-drafts/draft-brasher-diap-04.txt 11. Sustainability After this six-month development phase, the GPL licensed source code will be available to the JISC and the wider OSS community. By using the outputs of this project, and in conjunction with the EPrints team, the partners hope to attract enough users and developers from these communities to maintain an up-to-date code base in the future. Additional funding will be sought after the development period outlined in this proposal to pursue specific areas of further development. The final IETF-ID and corresponding RFC application will also be made available through the appropriate IETF channels. 12. Budget Budget (see Appendix A for details). This bid requests 80% of the Full Economic Cost (FEC): £37,646 JISC funding sought (80% of FEC): £30,117. Institutional contribution: £7,529. This costing is based on Damian Brasher (40%) as project manager and software developer, Simon Hettrick (10 days) for marketing and documentation, and Neil Chue Hong (5%) as Principal Investigator and project advisor. All staff are in position and available to start at the project start date. 13. Partners Lead institution - OMII-UK (Southampton University) is an Open Source organisation that empowers the UK research community by developing and maintaining software for researchers - software such as DIAP. OMII-UK has brought together, developed and sustained popular and widely used software ranging across the scientific software stack, from programming environments aimed at developers of scientific software to high-level tools aimed at e-Scientists and research informaticians. Association with OMII-UK provides DIAP with access to technical resources and decades of software experience. OMII-UK is an organisation with a long history of understanding the nature and difficulties faced by open distributed systems, and is highly experienced at maintaining OSS. OMII-UK are advocates of – and adherents to – the OSS development methodology. This includes licence management and integration, which will be of significant value to DIAP development. Security within distributed computing can be a challenging problem, which means that users are understandably cautious about the security issues surrounding the use of such software. This is an area in which OMII-UK has developed extensive experience by overcoming the security problems experienced by different Grid and e-Science communities. DIAP is a generic technology, which will benefit from development as a Grid/e-Science project by utilising existing distributed-computing resources. OMII-UK is the UK’s leading organisation for development within the Grid/e-Science community, meaning that collaboration with OMII-UK will be invaluable to the DIAP project. Interlinux LTD The small research company founded and run primarily by Damian Brasher with the help of his partners. The company enables safe management of the IP generated during the exploratory phase of DIAP development. Damian Brasher is the sole author of DIAP IETF-ID. Partner Myles McClelland supported the early phase of the project between 2005-2007. 14. Key Personnel Damian Brasher BSc (Open) RHCE MBCS - Project lead, management and primary technical developer. Damian is a Linux systems administrator for OMII-UK. He has nine years’ experience in the IT industry designing and maintaining systems infrastructure for non-profit organisations, the public sector and small business. His work in the non-profit sector, between 1999 and 2006, enabled the rapid growth of an organisation providing much needed support to NHS services. During this time, he wrote and implemented several ICT strategies and has made extensive use of OSS technologies throughout his career. Damian is professionally qualified as a Red Hat Certified Engineer, one of the most widely respected industrial IT certifications, and has begun studying at systems architecture level with Red Hat global learning services. He obtained his Open University degree in IT and mathematics in 2003. Damian has been actively involved in the local OSS community with Hampshire Linux Users Group since 2004, facilitating meetings and giving talks as well as a regular contributor to numerous technical mailing lists. Damian is also a member of the British computer Society. Damian established the company Interlinux Ltd with his partners in 2005, see section 13 above. Dr Simon J Hettrick - Documentation, graphics and marketing. Simon Hettrick is the Publicity Coordinator for OMII-UK. He has three year’s experience of managing the publicity strategy for OMII-UK and preparing publicity materials. Simon is responsible for the development of a successful, quarterly newsletter, the OMII-UK website, interaction with the press, documentation and he organises the company’s presence at conferences and events. Neil P Chue Hong – Project advisor. Neil Chue Hong is Director of OMII-UK, working with e-Research software projects to achieve sustainability. He is co-chair of the “Grids meet Repositories” series of workshops. He sits on the boards of the Open Grid Forum, UK National Grid Service, OSS-Watch, nanoCMOS, ADMIRE, and Globus Incubator Project. He spent five years managing the data access and integration programme at EPCC, including OGSA-DAI. Prior to this, he was a technology transfer consultant working with Scottish SMEs. Appendix A - Budget Directly Incurred Staff Damian Brasher, Software Developer, 40% FTE Simon Hettrick, Technical Author, 10 days (~10% FTE) Total Directly Incurred Staff (A) Apr09– Mar10 £7,728 Apr10Mar11 - Apr11Mar12 - TOTAL £ £7,728 £6,938 - - £6,938 £14,666 - - £14,666 Non-Staff Apr10Mar11 - Apr11Mar12 - TOTAL £ Travel and expenses Apr09– Mar10 £400 £400 Hardware/software £1,200 - - £1,200 Dissemination - - - - Evaluation - - - - Other £600 - - £600 Total Directly Incurred Non-Staff (B) £2,200 - - £2,200 Directly Incurred Total (C) (A+B=C) £16,866 - - £16,866 Directly Allocated Apr10Mar11 - Apr11Mar12 - TOTAL £ Staff Apr09– Mar10 £1,753 Estates £4,585 - - £4,585 Other £ - - £ Directly Allocated Total (D) £6,338 - - £6,338 Indirect Costs (E) £14,442 - - £14,442 Total Project Cost (C+D+E) £37,646 - - £37,646 Amount Requested from JISC £30,117 - - £30,117 Institutional Contributions £7,529 - - £7,529 Percentage Contributions over the life of the project Partners X% Institution 20% FEC JISC 80% FEC Total 100% £1,753
© Copyright 2024