GGBN Data Readiness and Handling Gabi Droege Overview • • • • • • Behind the scenes, work in progress Basic Architecture GGBN Data Standard Harvester Data Quality Recommendations on collecting and storing • • • Specimen Data Relationships Basic DNA lab data • Software for Biobanking Management: DNA Module Best Practice Recommendations • Working on guidelines related to GGBN data, e.g. • • • • • • • Collecting and storing data of voucher specimens and tissue samples Collecting and storing DNA samples and basic molecular lab data (e.g. DNA extraction) Collecting and storing extended molecular lab data (e.g. sequencing details) Handling of data related to RNA, proteins etc. Publishing data in journals with proper voucher/tissue/DNA information ABS, CBD, and Nagoya Protocol (data) requirements How to use GGBN data in other portals (technical issues, citation policy) Recommendations will be available in GGBN Library Best Practice Recommendations • Working on guidelines related to GGBN data • • • • • • • Collecting and storing data of voucher specimens and tissue samples Collecting and storing DNA samples and basic molecular lab data (e.g. DNA extraction) Collecting and storing extended molecular lab data (e.g. sequencing details) Handling of data related to RNA, proteins etc. Publishing data in journals with proper voucher/tissue/DNA information ABS, CBD, and Nagoya Protocol (data) requirements How to use GGBN data in other portals (technical issues, citation policy) Recommendations will be available in GGBN Library Standardised Data Exchange • Two main systems in use of GBIF providers • IPT -> Darwin Core Archive • BioCASE -> ABCD (Access to Biological Collection Data) • Mapping between Darwin Core and ABCD exists • Both work perfectly for sharing specimen data, but DNA facts were missing 2009: DNA extension for ABCD published 2014: GGBN Data Standard published, based on ABCDDNA GGBN & GBIF • GGBN is using established GBIF infrastructure GGBN members must become GBIF providers GGBN members must provide underlying specimen data to GBIF Basic Architecture Harvester Provider (HIT) Provider Get full single record IPT Data Cleaning Index (MySQL) To enable search functionality To speed up queries GGBN Data Standard • Aim: shared vocabulary, mandatory and recommended parameters • Last review by GGBN community: March 2014 • to be used with ABCD or Darwin Core • 83 new elements and 22 elements from MIxS, ABCD, DwC, and DC GGBN Data Standard • Coverage • Sample facts (preparation, extraction, preservation) • Permits and Loaning Conditions • Gel Image • Amplifications, Sequencings, Single Reads (incl. chromatograms) • Genetic Accession Numbers • DNA Cloning • Note: Reference to voucher specimen can be done with ABCD or Darwin Core already GGBN Data Standard • http://terms.tdwg.org/GGBN_Data_Standard GGBN Data Standard DwC-A-GGBN • • Implemented in IPT/DwC-A and BioCASE/ABCD Test data: • NMNH (IPT/DwC-A-GGBN) • AM (DwC-A-GGBN w/o IPT) • DSMZ (ABCDGGBN) • BGBM (ABCDGGBN & IPT/DwC-A-GGBN) ABCDGGBN Basic Architecture Harvester Provider (HIT) Provider Get full single record IPT Data Cleaning Index (MySQL) To enable search functionality To speed up queries Harvesting and Indexing Toolkit (HIT) • • • Developed by GBIF Can handle ABCD, DwC, DwC-A Expanded by BGBM for GGBN and other Special Interest Networks to • allow indexing of • • • • • • Multiple determinations Multimedia Associated records (DNA -> tissue -> voucher) Selected records of a dataset GGBN facts • create reports for providers • include data cleaning routines Additional source code will be made available as open source Harvesting and Indexing Toolkit (HIT) Data Quality Done after harvesting, using available libraries and new stuff Cleaning/checking of: • Country names, ISO codes • Country name and coordinates (does it match?) • Transform coordinates into decimal • Transform collection date into ISO format • Extract collection year • Seas and oceans (no ISO list available) • Scientific names (GBIF name parser, taxonomic backbone) • Associated records available? Specimen Data, Collecting Event • Typical GBIF record • • • • • • • • Scientific Name including Author/Year Country where collected Locality and Coordinates* where collected Collectors Collector‘s Number* CatalogNumber (e.g. barcode) Place of deposit (InstitutionCode and CollectionCode) Stable identifier (GUID) GBIF identifiers GBIF record elements are mandatory for GGBN If unknown say „unknown“ or „Indet.“ (Sci. Name) Data should be provided to GGBN and GBIF *if applicable Data Quality Recommendations Scientific Names • Check scientific names, make sure Author (and Year) is provided • -> helps to match with Taxonomic Backbone! • Store names structured (Genus, Specific Epithet, Rank, Infraspecific Authorship, Author and Year) • Avoid „sp.“, „sp. n.“ etc. (e.g. Abies, not Abies sp.) Data Quality Recommendations Scientific Names • Store rank names of higher taxa in latin (e.g. familia, not family) • Store at least one higher taxon level (usually familia, ordo, or classis) • -> helps to clarify homonyms in the portal! Data Quality Recommendations Country • Avoid free text fields in your database for Country names! • Use ISO Country lists instead and let users select from this list • Match countries that no longer exist with ISO list, store original country name in locality field Data Quality Recommendations Country – Typical Examples Provided Cleaned into England Soviet Union Atlantic Korea No data Principles: Record : Country Record : Sea Country : Continent Country : Ocean Sea: Ocean Great Britain Unknown or unspecified Eurasia Unknown or unspecified Country; Atlantic Ocean Unknown or unspecified Asia Unknown or unspecified Country 1:1 1:1 1:n 1:n 1:n Data Quality Recommendations Use unique numbers for every physical object! Voucher specimen, tissue sample, and DNA sample should have different numbers/identifiers If you want to use same numbers you should add prefix, e.g. „DNA 123“ (DNA), „T-123“ (Tissue) GGBN and GBIF are using a triple ID: CatalogNumber, CollectionCode, InstitutionCode CollectionCode = „DNA Bank“ or „Entomology“ InstitutionCode = „K“ or „NMNH“ Data Quality Recommendations Store kind of relation between objects carefully! Distinguish between „same individual“ and „same population“ Distinguish between „in situ“ and „ex situ“ Example „same in situ population“ Use Case 1: Single Voucher, multiple tissues Animals All objects are related as „same individual“ to each other Use Case 1: Single Voucher, multiple tissues Plants All objects are related as „same individual“ to each other Use Case 2: Specimens as lots Trace back the samples! - Label sampled specimen individually - Or store sampled specimens individually Record relation to each other carefully! If you can‘t label/store sampled specimen individually all objects are related as „same population“ Use Case 3: Botanical population sampling All tissues are related to one main voucher, but mostly only one as “same individual”, all others as “same population”. Use Case 4: Collecting Tissues Only take photographs or if useful sound files of the sampled individual and if applicable its environment Store multimedia on a web server, care for the license! Use Case 5: Parasites and Sessile Organisms Store information on host properly Scientific name, not trivial name Separate field, not in notes field Homo sapiens L., 1758 as host species is valuable information GGBN can only enable search for host species if provided properly! Use Case 6: Zoos and Botanical Gardens Be careful with coordinates and locality! A lion in London Zoo: no coordinates, locality=unknown OR coordinates from somewhere in Africa where the lion was kept A squirrel in London Zoo: London coordinates, locality=London Zoo Applies to e.g. Zoos, Botanical Gardens, oyster beds, fish farms, roadside trees, rescue or rehabilitation centres, etc. Relation tissue -> voucher = “same ex situ individual” or “same ex situ population” Specimen Data – Relations • Try to use GBIF identifiers for referencing from DNA to Specimen • Avoid storing foreign specimen data in your system with different identifiers • Check GBIF first, if the institution holding your voucher is providing data to GBIF and if your voucher is available If so: use Catalog Number, Collection Code, and Institution Code If not: store data in your system with self defined identifiers DNA Module • • • • Open source (MPL), MySQL and PHP In use by 4 institutions 100% compatible with GBIF and GGBN Can interact with every GBIF compliant database via GBIF web services and with BioCASE providers directly • In accordance with GGBN Data Standard DNA Module Input (connect DNA and specimen) Search (your DNA bank) Sample requests Data Cleaning Specimen Tool (voucher/tissue not in GBIF) Help for users and admins Config Tool (user management, settings) Input data Input data Input data Input data – Specimen information Input data – Specimen information Input data – Specimen information Input data – Specimen information Input data – Specimen information Try it out • Find specimen records of a certain collection/institution in GBIF, for example your own Input data – DNA facts Input data – DNA facts Such lab numbers often used in papers, Store them! Input data – DNA facts Input data – DNA facts Sample data and loan information shown in GGBN Data Portal Sample data not shown in GGBN Data Portal If „Stock Gone“ selected -> automatically blocked for loaning Sequence Data at GGBN Provide raw sequence data, chromatograms, primer information, failed amplifications • Not mandatory, but nice to have and very valuable for other scientists • Only for published sequences Relations between voucher/tissue • Currenly possible with BioCASE only • Association between voucher and tissue not yet available via GBIF Specimen Tool Tissue here, Voucher somewhere else Summary • Use exisiting open source specimen databases, do not create your own! e.g. Specify, BRAHMS, DiversityWorkbench • You can use Specimen Tool as your specimen or tissue database, if you don‘t have one yet (e.g. RBI is doing this) • DNA Module can interact with every specimen database if it is a GBIF provider Summary 1. Provide specimen data to GBIF (BioCASE or IPT) • You need a webserver! 2. Provide tissue and DNA data to GGBN You need two separate mappings and a webserver! Want to join GGBN as Core Member now? -> Use BioCASE From June 2015 you can choose between BioCASE and IPT! Summary Technical Support BioCASE for GBIF and GGBN IPT for GBIF and GGBN DNA Module BGBM GBIF/(BGBM)* BGBM October 2014: Beta Version of „DNA Module“ available February 2015: Public release (planned) *BGBM provides support for GGBN mapping questions only, not how to use IPT in general. Thank you GGBN Interim Executive Committee GGBN Members GGBN Collaborators GGBN Task Forces Natural History Museum of London Royal Botanic Gardens, Kew GGBN Conference Sponsors GGBN Conference Participants
© Copyright 2024