GGBN Data Readiness and Handling Gabi Droege

GGBN Data Readiness and Handling
Gabi Droege
Overview
•
•
•
•
•
•
Behind the scenes, work in progress
Basic Architecture
GGBN Data Standard
Harvester
Data Quality
Recommendations on collecting and storing
•
•
•
Specimen Data
Relationships
Basic DNA lab data
• Software for Biobanking Management: DNA Module
Best Practice Recommendations
• Working on guidelines related to GGBN data, e.g.
•
•
•
•
•
•
•
Collecting and storing data of voucher specimens and tissue
samples
Collecting and storing DNA samples and basic molecular lab
data (e.g. DNA extraction)
Collecting and storing extended molecular lab data (e.g.
sequencing details)
Handling of data related to RNA, proteins etc.
Publishing data in journals with proper voucher/tissue/DNA
information
ABS, CBD, and Nagoya Protocol (data) requirements
How to use GGBN data in other portals (technical issues,
citation policy)
 Recommendations will be available in GGBN Library
Best Practice Recommendations
• Working on guidelines related to GGBN data
•
•
•
•
•
•
•
Collecting and storing data of voucher specimens and tissue
samples
Collecting and storing DNA samples and basic molecular lab
data (e.g. DNA extraction)
Collecting and storing extended molecular lab data (e.g.
sequencing details)
Handling of data related to RNA, proteins etc.
Publishing data in journals with proper voucher/tissue/DNA
information
ABS, CBD, and Nagoya Protocol (data) requirements
How to use GGBN data in other portals (technical issues,
citation policy)
 Recommendations will be available in GGBN Library
Standardised Data Exchange
• Two main systems in use of GBIF providers
• IPT -> Darwin Core Archive
• BioCASE -> ABCD (Access to Biological Collection Data)
• Mapping between Darwin Core and ABCD exists
• Both work perfectly for sharing specimen data, but DNA
facts were missing
 2009: DNA extension for ABCD published
 2014: GGBN Data Standard published, based on ABCDDNA
GGBN & GBIF
• GGBN is using established GBIF infrastructure
 GGBN members must become GBIF providers
 GGBN members must provide underlying specimen
data to GBIF
Basic Architecture
Harvester
Provider
(HIT)
Provider
Get full
single record
IPT
Data
Cleaning
Index
(MySQL)
To enable
search
functionality
To speed up
queries
GGBN Data Standard
• Aim: shared vocabulary, mandatory and recommended
parameters
• Last review by GGBN community: March 2014
• to be used with ABCD or Darwin Core
• 83 new elements and 22 elements from MIxS, ABCD, DwC,
and DC
GGBN Data Standard
• Coverage
• Sample facts (preparation, extraction, preservation)
• Permits and Loaning Conditions
• Gel Image
• Amplifications, Sequencings, Single Reads (incl.
chromatograms)
• Genetic Accession Numbers
• DNA Cloning
• Note: Reference to voucher specimen can be done with
ABCD or Darwin Core already
GGBN Data Standard
• http://terms.tdwg.org/GGBN_Data_Standard
GGBN Data Standard
DwC-A-GGBN
•
•
Implemented in IPT/DwC-A and BioCASE/ABCD
Test data:
• NMNH (IPT/DwC-A-GGBN)
• AM (DwC-A-GGBN w/o IPT)
• DSMZ (ABCDGGBN)
• BGBM (ABCDGGBN & IPT/DwC-A-GGBN)
ABCDGGBN
Basic Architecture
Harvester
Provider
(HIT)
Provider
Get full
single record
IPT
Data
Cleaning
Index
(MySQL)
To enable
search
functionality
To speed up
queries
Harvesting and Indexing Toolkit (HIT)
•
•
•
Developed by GBIF
Can handle ABCD, DwC, DwC-A
Expanded by BGBM for GGBN and other Special Interest
Networks to
• allow indexing of
•
•
•
•
•
•
Multiple determinations
Multimedia
Associated records (DNA -> tissue -> voucher)
Selected records of a dataset
GGBN facts
• create reports for providers
• include data cleaning routines
Additional source code will be made available as open source
Harvesting and Indexing Toolkit (HIT)
Data Quality
Done after harvesting, using available libraries and new
stuff
Cleaning/checking of:
• Country names, ISO codes
• Country name and coordinates (does it match?)
• Transform coordinates into decimal
• Transform collection date into ISO format
• Extract collection year
• Seas and oceans (no ISO list available)
• Scientific names (GBIF name parser, taxonomic backbone)
• Associated records available?
Specimen Data, Collecting Event
• Typical GBIF record
•
•
•
•
•
•
•
•
Scientific Name including Author/Year
Country where collected
Locality and Coordinates* where collected
Collectors
Collector‘s Number*
CatalogNumber (e.g. barcode)
Place of deposit (InstitutionCode and CollectionCode)
Stable identifier (GUID)
GBIF identifiers
 GBIF record elements are mandatory for GGBN
 If unknown say „unknown“ or „Indet.“ (Sci. Name)
 Data should be provided to GGBN and GBIF
*if applicable
Data Quality Recommendations
Scientific Names
• Check scientific names, make sure Author (and Year) is
provided
•
-> helps to match with Taxonomic Backbone!
• Store names structured (Genus, Specific Epithet, Rank,
Infraspecific Authorship, Author and Year)
• Avoid „sp.“, „sp. n.“ etc. (e.g. Abies, not Abies sp.)
Data Quality Recommendations
Scientific Names
• Store rank names of higher taxa in latin (e.g. familia, not
family)
• Store at least one higher taxon level (usually familia,
ordo, or classis)
•
-> helps to clarify homonyms in the portal!
Data Quality Recommendations
Country
• Avoid free text fields in your database for Country
names!
• Use ISO Country lists instead and let users select from
this list
• Match countries that no longer exist with ISO list, store
original country name in locality field
Data Quality Recommendations
Country – Typical Examples
Provided
Cleaned into
England
Soviet Union
Atlantic
Korea
No data
Principles:
Record : Country
Record : Sea
Country : Continent
Country : Ocean
Sea: Ocean
Great Britain
Unknown or unspecified Eurasia
Unknown or unspecified Country; Atlantic Ocean
Unknown or unspecified Asia
Unknown or unspecified Country
1:1
1:1
1:n
1:n
1:n
Data Quality Recommendations
Use unique numbers for every physical object!
 Voucher specimen, tissue sample, and DNA sample should
have different numbers/identifiers
 If you want to use same numbers you should add prefix, e.g.
 „DNA 123“ (DNA), „T-123“ (Tissue)
 GGBN and GBIF are using a triple ID: CatalogNumber,
CollectionCode, InstitutionCode
 CollectionCode = „DNA Bank“ or „Entomology“
 InstitutionCode = „K“ or „NMNH“
Data Quality Recommendations
Store kind of relation between objects carefully!
 Distinguish between „same individual“ and „same population“
 Distinguish between „in situ“ and „ex situ“
 Example „same in situ population“
Use Case 1: Single Voucher, multiple tissues
Animals
 All objects are related as „same individual“ to each other
Use Case 1: Single Voucher, multiple tissues
Plants
 All objects are related as „same individual“ to each other
Use Case 2: Specimens as lots
Trace back the samples!
- Label sampled specimen
individually
- Or store sampled
specimens individually
Record relation to each
other carefully!
If you can‘t label/store
sampled specimen
individually all objects are
related as „same population“
Use Case 3: Botanical population sampling
 All tissues are related to one main voucher, but mostly only
one as “same individual”, all others as “same population”.
Use Case 4: Collecting Tissues Only
 take photographs or if useful sound files of the sampled individual and if
applicable its environment
 Store multimedia on a web server, care for the license!
Use Case 5: Parasites and Sessile Organisms
 Store information on host properly
 Scientific name, not trivial name
 Separate field, not in notes field
 Homo sapiens L., 1758 as host species is valuable information
 GGBN can only enable search for host species if provided properly!
Use Case 6: Zoos and Botanical Gardens
Be careful with coordinates and locality!
 A lion in London Zoo: no coordinates, locality=unknown OR coordinates
from somewhere in Africa where the lion was kept
 A squirrel in London Zoo: London coordinates, locality=London Zoo
 Applies to e.g.
 Zoos, Botanical Gardens, oyster beds, fish farms, roadside trees,
rescue or rehabilitation centres, etc.
 Relation tissue -> voucher = “same ex situ individual” or “same ex situ
population”
Specimen Data – Relations
• Try to use GBIF identifiers for referencing from DNA to
Specimen
• Avoid storing foreign specimen data in your system with
different identifiers
• Check GBIF first, if the institution holding your voucher is
providing data to GBIF and if your voucher is available
 If so: use Catalog Number, Collection Code, and Institution
Code
 If not: store data in your system with self defined identifiers
DNA Module
•
•
•
•
Open source (MPL), MySQL and PHP
In use by 4 institutions
100% compatible with GBIF and GGBN
Can interact with every GBIF compliant database via GBIF
web services and with BioCASE providers directly
• In accordance with GGBN Data Standard
DNA Module
Input (connect DNA and
specimen)
Search (your DNA bank)
Sample requests
Data Cleaning
Specimen Tool (voucher/tissue
not in GBIF)
Help for users and admins
Config Tool (user
management, settings)
Input data
Input data
Input data
Input data – Specimen information
Input data – Specimen information
Input data – Specimen information
Input data – Specimen information

Input data – Specimen information

Try it out
• Find specimen records of a certain collection/institution
in GBIF, for example your own 
Input data – DNA facts
Input data – DNA facts
Such lab numbers often used
in papers, Store them!
Input data – DNA facts
Input data – DNA facts
Sample data and loan information shown in GGBN Data Portal
Sample data not shown in GGBN Data Portal
If „Stock Gone“ selected -> automatically blocked for loaning
Sequence Data at GGBN
 Provide raw sequence data, chromatograms, primer information, failed
amplifications
• Not mandatory, but nice to have and very valuable for other scientists
• Only for published sequences
Relations between voucher/tissue
• Currenly possible with BioCASE only
• Association between voucher and tissue not yet available via GBIF
Specimen Tool
Tissue here, Voucher somewhere else
Summary
• Use exisiting open source specimen databases, do not
create your own! e.g. Specify, BRAHMS,
DiversityWorkbench
• You can use Specimen Tool as your specimen or tissue
database, if you don‘t have one yet (e.g. RBI is doing this)
• DNA Module can interact with every specimen database if it
is a GBIF provider
Summary
1. Provide specimen data to GBIF (BioCASE or IPT)
• You need a webserver!
2. Provide tissue and DNA data to GGBN
 You need two separate mappings and a webserver!
 Want to join GGBN as Core Member now? -> Use
BioCASE
 From June 2015 you can choose between BioCASE
and IPT!
Summary
Technical Support
BioCASE for GBIF and GGBN
IPT for GBIF and GGBN
DNA Module
BGBM
GBIF/(BGBM)*
BGBM
 October 2014: Beta Version of „DNA Module“ available
 February 2015: Public release (planned)
*BGBM provides support for GGBN mapping questions only, not how to use
IPT in general.
Thank you
GGBN Interim Executive Committee
GGBN Members
GGBN Collaborators
GGBN Task Forces
Natural History Museum of London
Royal Botanic Gardens, Kew
GGBN Conference Sponsors
GGBN Conference Participants