How to Cite Datasets and Link to Publications Monica Duke Alex Ball

because good research needs good data
How to Cite Datasets and Link to Publications
A Report of the Digital Curation Centre
Monica Duke
Alex Ball
DCC/UKOLN, University of Bath
30 October 2012
Except where otherwise stated, this work is licensed
under Creative Commons Attribution 2.5 Scotland:
http://creativecommons.org/licenses/by/2.5/scotland/
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Funded by
Awareness Level
A Digital Curation Centre Briefing Paper
A Digital Curation Centre ‘working level’ guide
19th July 2011
Data Citation and Linking
By Alex Ball and Monica Duke, UKOLN, University of Bath
•Introduction
•Short-term Benefits and Long-term Value
•Perspectives on Data Citation
•Roles and Responsibilities
•Issues to be Considered
•Related Research
•Additional Resources
Introduction
On the surface, citing datasets is a trivially easy
thing to do. Style manuals such as the Publication
Manual of the American Psychological Association
and the Oxford Manual of Style have provided sample
citations for datasets since at least the early 2000s.
The process of making datasets citable, however, is
rather more difficult. In consequence of this and other
factors, a culture of citing datasets has been slow to
develop. Nevertheless, it is vital that researchers cite
the datasets they use, if datasets are to be regarded
as legitimate academic outputs in their own right.
• Data citations ensure that data contributors receive
proper credit when their work is reused by other
researchers.
• If a dataset links back to the paper that describes
its collection, a reader coming to the dataset
direct can use that link to put it in context and
understand the methodology used.
How to Cite Datasets
and Link to Publications
Alex Ball (DCC) and Monica Duke (DCC)
• If a dataset links to other papers that make use
of it, these links can be used by the contributors
and data publishers to demonstrate the impact of
the data. Potential reusers might use these links
to discover critiques of the data or to provide
inspiration for how to use them.
Once a culture of data citation has been established,
several other benefits are likely to become apparent.
• The publishing infrastructure that makes the data
citable will also help to ensure they are available
for reference and reuse long into the future.
• There will be less danger of rival researchers
‘stealing’ results from those who publish their data
openly, as failure to give due credit would amount
to plagiarism and thus be punishable.
• Services built around data citation will make
it easier for researchers to discover relevant
datasets.
Short-term Benefits
and Long-term Value
• Data citations could be used to measure the
impact of both individual datasets and their
contributors.
There are several short-term benefits to making
datasets citable, citing them in practice, and linking
datasets to papers that make use of the data.
• Researchers could gain professional recognition
and rewards for published data in the same way as
for more traditional publications.
• If the authors of a scientific publication properly
cite the data that underlies it, it is much easier for
the reader to locate that data. This in turn makes
it easier for the reader to validate and build on the
publication’s findings.
Taking these points together, there would likely be an
increase in the quantity and quality of data published,
with all the benefits this implies for the transparency
and rate of scientific research.
Digital Curation Centre, 2011.
Licensed under Creative Commons Attribution 2.5 Scotland:
http://creativecommons.org/licenses/by/2.5/scotland/
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
http://www.dcc.ac.uk/resources/how-guides/cite-datasets
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Outline
Motivation
Elements of a data citation
Issues and challenges
Guidance for researchers
Guidance for data repositories
Putting it into practice
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
What’s great about journal papers?
É
É
É
É
É
É
Awareness raising
Protection from plagiarism
Verification of results
Basis for future research
Reward models
Permanent access
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
What’s great about journal papers?
É
É
É
É
É
É
Awareness raising
Protection from plagiarism
Verification of results
Basis for future research
Reward models
Permanent access
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Data citations provide. . .
É
É
É
É
É
É
Visibility for data
Protection from plagiarism
Possibility for verification of results
Data on which to base future research
Possibility for reward models
Access
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Four data citation styles: which elements do they use?
Altman and King (2007): Dataverse
Lawrence et al. (2008): BADC
Green (2010): OECD
Starr and Gastl (2011): DataCite
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Author
Altman and King (2007): Dataverse
É
Sidney Verba.
NORC [Producer];
Lawrence et al. (2008): BADC
É
Iwi, A. and B. N. Lawrence
Green (2010): OECD
É
OECD
Starr and Gastl (2011): DataCite
É
Irino, T; Tada, R
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Publication date
Altman and King (2007): Dataverse
É
Sidney Verba. 1998.
NORC [Producer];
Lawrence et al. (2008): BADC
É
Iwi, A. and B. N. Lawrence (2004).
Green (2010): OECD
É
OECD (2009),
(Accessed on 14 September 2009)
Starr and Gastl (2011): DataCite
É
Irino, T; Tada, R (2009):
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Title
Altman and King (2007): Dataverse
É
Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,”
NORC [Producer];
Lawrence et al. (2008): BADC
É
Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3.
Green (2010): OECD
É
OECD (2009), “Key short-term indicators”, Main Economic Indicators
(Accessed on 14 September 2009)
Starr and Gastl (2011): DataCite
É
Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Version
Altman and King (2007): Dataverse
É
Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,”
NORC [Producer];
Lawrence et al. (2008): BADC
É
Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3.
Version 1.
Green (2010): OECD
É
OECD (2009), “Key short-term indicators”, Main Economic Indicators
(Accessed on 14 September 2009)
Starr and Gastl (2011): DataCite
É
Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Feature
Altman and King (2007): Dataverse
É
Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,”
NORC [Producer];
Lawrence et al. (2008): BADC
É
Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,
http://ndg.nerc.ac.uk/csml2/GridSeries] Version 1.
Green (2010): OECD
É
OECD (2009), “Key short-term indicators”, Main Economic Indicators
(Accessed on 14 September 2009)
Starr and Gastl (2011): DataCite
É
Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Resource type
Altman and King (2007): Dataverse
É
Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,”
NORC [Producer]; data set [Type (DC)]
Lawrence et al. (2008): BADC
É
Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,
http://ndg.nerc.ac.uk/csml2/GridSeries] Version 1.
Green (2010): OECD
É
OECD (2009), “Key short-term indicators”, Main Economic Indicators (database).
(Accessed on 14 September 2009)
Starr and Gastl (2011): DataCite
É
Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2.
Dataset.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Publisher
Altman and King (2007): Dataverse
É
Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,”
NORC [Producer]; data set [Type (DC)] ICPSR [Distributor].
Lawrence et al. (2008): BADC
É
Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,
http://ndg.nerc.ac.uk/csml2/GridSeries] Version 1. BADC.
Green (2010): OECD
É
OECD (2009), “Key short-term indicators”, Main Economic Indicators (database).
(Accessed on 14 September 2009)
Starr and Gastl (2011): DataCite
É
Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. Geological
Institute, University of Tokyo. Dataset.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Identifier
Altman and King (2007): Dataverse
É
Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” hdl:1902.4/00754
NORC [Producer]; data set [Type (DC)] ICPSR [Distributor].
Lawrence et al. (2008): BADC
É
Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,
http://ndg.nerc.ac.uk/csml2/GridSeries] Version 1. BADC. urn:badc.nerc.ac.uk_coapec500yr
Green (2010): OECD
É
OECD (2009), “Key short-term indicators”, Main Economic Indicators (database). doi: 10.1787/data-00039-en
(Accessed on 14 September 2009)
Starr and Gastl (2011): DataCite
É
Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. Geological
Institute, University of Tokyo. Dataset. doi:10.1594/PANGAEA.726855.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Location
Altman and King (2007): Dataverse
É
Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” hdl:1902.4/00754
NORC [Producer]; data set [Type (DC)] ICPSR [Distributor].
Lawrence et al. (2008): BADC
É
Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,
http://ndg.nerc.ac.uk/csml2/GridSeries] Version 1. BADC. urn:badc.nerc.ac.uk_coapec500yr [Available from
http://badc.nerc.ac.uk/data/coapec500yr].
Green (2010): OECD
É
OECD (2009), “Key short-term indicators”, Main Economic Indicators (database). doi: 10.1787/data-00039-en
http://dx.doi.org/10.1787/data-00039-en (Accessed on 14 September 2009)
Starr and Gastl (2011): DataCite
É
Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. Geological
Institute, University of Tokyo. Dataset. doi:10.1594/PANGAEA.726855. http://dx.doi.org/10.1594/PANGAEA.726855
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Citation styles
Unique Numeric Fingerprint
Altman and King (2007): Dataverse
É
Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” hdl:1902.4/00754
NORC [Producer]; data set [Type (DC)] ICPSR [Distributor].
UNF:3:ZNQRI14053UZq389x0Bffg?==
Lawrence et al. (2008): BADC
É
Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,
http://ndg.nerc.ac.uk/csml2/GridSeries] Version 1. BADC. urn:badc.nerc.ac.uk_coapec500yr [Available from
http://badc.nerc.ac.uk/data/coapec500yr].
Green (2010): OECD
É
OECD (2009), “Key short-term indicators”, Main Economic Indicators (database). doi: 10.1787/data-00039-en
http://dx.doi.org/10.1787/data-00039-en (Accessed on 14 September 2009)
Starr and Gastl (2011): DataCite
É
Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. Geological
Institute, University of Tokyo. Dataset. doi:10.1594/PANGAEA.726855. http://dx.doi.org/10.1594/PANGAEA.726855
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Key citation elements
É
É
É
É
Author
Publication date
Title
Location
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Key citation elements
É
É
É
É
Author
Publication date
Title
Location (= identifier)
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Key citation elements
É
É
É
É
É
Author
Publication date
Title
Location (= identifier)
Publisher
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Attributing datasets to many contributors
http://dx.doi.org/10.1038/ng.785
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Granularity
1 11
É
CODATA 2012: Data Publication and Data Citation (2)
Data points
30 October 2012
Granularity
1 11
2 44
3 99
CODATA 2012: Data Publication and Data Citation (2)
É
Data points
É
Data tables
30 October 2012
Granularity
1 11
1 11
2 44
2 44
3 99
3 99
CODATA 2012: Data Publication and Data Citation (2)
É
Data points
É
Data tables
É
Data files
30 October 2012
Granularity
1 11
1 11
2 44
2 44
3 99
3 99
1 11
1 11
2 44
2 44
3 99
3 99
CODATA 2012: Data Publication and Data Citation (2)
É
Data points
É
Data tables
É
Data files
É
Datasets
30 October 2012
Granularity
1 11
1 11
2 44
2 44
3 99
3 99
É
Data points
É
Data tables
É
Data files
1 11
1 11
É
Datasets
2 44
2 44
É
Data collections
3 99
3 99
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Granularity
É
É
Cite datasets at the finest level that is appropriate
and for which an identifier is provided.
If that is not fine enough, provide details of the
subset of data you are using at the point in the
text where you make the citation.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Placement of data citations
É
É
É
É
Special data resources section?
Acknowledgements?
Accession codes?
Reference list?
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Placement of data citations
É
É
É
É
É
Special data resources section?
Acknowledgements?
Accession codes?
Reference list?
Alongside or independent of a reference to the
related article?
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Placement of data citations
É
É
É
Include the citation in the reference list.
When your data collection paper is published,
notify the repository holding the dataset.
When you publish a paper in which you reuse a
prior dataset, notify the repository holding that
dataset.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Dynamic datasets
Two types:
É
Revised datasets
É
Expanding datasets
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Dynamic datasets
Three strategies:
1. Differentiate versions by access date rather than ID
A
2. Take time slices
A
B
C
3. Take snapshots
A
B
C
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Guidance for researchers publishing a paper
É
Deposit any data you have collected and used as evidence.
É
Ask for a persistent ID/URL for your deposited data.
É
When your data collection paper is published, notify the
repository holding the dataset.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Guidance for researchers citing a prior dataset
É
Use the data citation style required by the editor/publisher.
É
If no style is specified, use a standard data citation style, adapted
to match the style for textual publications.
É
Default to writing IDs in the form of URLs if possible.
É
Include the citation in the reference list.
É
Cite datasets at the finest level that is appropriate and for which
an identifier is provided.
É
If that is not fine enough, provide details of the subset of data you
are using at the point in the text where you make the citation.
É
Cite the exact version of the dataset you need.
É
When your paper is published, notify the repository holding the
dataset you used.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Guidance for data repositories
É
Provide persistent IDs for the datasets you host.
É
É
É
É
The ID should remain unique.
The ID should always point to the same version.
The ID should resolve to a URL.
The URL should locate the dataset’s landing page.
É
The explanatory metadata should not change for a dataset with a
persistent ID.
É
IDs should only be assigned once no further changes are
expected.
É
With dynamic datasets, provide IDs for snapshots or time slices.
É
Provide sample citations on dataset landing pages.
É
Link from landing pages to publications citing the dataset.
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Putting it into practice
n
ctio
l Se
ecia
Sp
AP)
(RD and
tion stigating , is
ervaiplines are einteverm suarggticesletsthat
s
s
e
e
r
th
& P ic disc n, as blished citations
Issu
ess ss academ ta citatiot of a pu al data the rule
and
Accstitutionstaaccirotations.eAredaferencarechlisresult.exFocermptionrthtoantheir soofurcea
ives
a
t
t
in
a
y
da
ia
in th n rese ore the lly refe section
g
an
ve
otin
m
hD
uded
a
ca
ds
Init
prom tion incl d to a gi are still rs typi e metho per. Dat ries and
earc
le
pa
th
to
tion
pape
of
but
a cita ta that
on, search tions in ion of a s, reposi e part
Res cades
da
mm
bl
Cita. Mayernik
cites ore co rs of re ext men ents sect oducer counta eir
a
ly
t
de
m
pr
-t
ho
al
r
ac
th
ially
d fo
form coming es. Aut as free ledgem any data me an to cite
Da hew S
tent
data
gnize
M
n
ch
ow
po
co
be sciplin
e
urge
reco
are
s, su e ackn al to m l to be ers begi ts can ugh th
been agencies ices
ia
th
way
ss di
has
rv
t
arch , data se ity thro
, appe potent
her tion in
acro
n se pment
data
men
rese
hand
in ot
sm mmun
eir
en
ratio
vern
arch
lo
data or a m e other e of th ocess. If echani
rese deral go ier regist in deve focused
co
r
th
ific
pr
fe
ntific
nm
er
caus
entif ines are Summit
pape ns, on
scie Several
s be ication citatio e scient
ct id
ual us
el
Y
n
ting
er
je
.
id
tio
ci
vid
AR
tio
l
th
m
al
ob
gu
di
M
tu
va
ally
cita ch fund mmun form
tal
ward
dera .
tion
of in
SUM
eser
ct on
form g momen its digi
e
co
ar
l cita
le fe
d Pr the lack ogress to g
ar
OR’S
e of
rese holarly rough th eir impa ces.
and
iona cess an
ultip past ye
EDIT portanc ly gainin
pr
n,
alon
rnat
th
Cite
sc
d
s. M
e
slow
nt
data
im
rts
citatio
ta Ac
ta th ed for tion indi
, inte
Data
the
ctor ithin th TA) an
rece
The
ers,
data arch Da ort data munities d linked tion. Effo
ss
ta
ce da
y se
A
only
arch
ting
sour tter asse data ci
cita
y an
com
man tions w (COD
Rese to supp
ial
but is by rese tice of ci
nc
ta
ch
ec
in
&T
re
da
ty
s
ar
of
ta
sp
n
IS
n
gy
ng
vi
on
ci
ac
nspa
the
rese
be be lation
ctive
12 AS ng reas
citatio e the pr
Acti emergi to data Technolo formatio ific and es
tia in r data tra fuel effe preciate
pi
d
e 20
ro
er
y.
m
ot
n
is
th
tic
d
In
st
in
lit
nt
a
d
l
fo
co
te
d
prom
eabi
ne to
ns, ap
and
nel at Despite cultura
rest citations ps rela ience an Data an for Scie and Prac[1].
te
a pa
dem rs combi institutio d manag
n.
ive
s
d
g
il
In
ho
tio
ta
h
in
Sc
as
an
nc
ta
ks
r
an
d
rv
rc
da
ion
de
ow
ad
dard
ta ci
d a pe t the gr stakehol recognize simplicity
Bro terest in held wor Data fo on Resea nal Cou n Stan d attribut to data
on da
s an
Bu
of
m
In
ize
ve
rd
natio
ted
itatio
ntive ptance.
ee on
n an
riety
e fro
phas
ince
es ha
mitt ies Boa TA-Inter Data C citatio ities rela pic.
a va
ce
com
em
d
ci
e
t
m
ac
ly
om
ur
en
us
m
d
al
to
ta
A
on
ag
.C
ng an
de
broa essure fro tation m and initi
struct
the
activ
U.S
d
COD
on da
Aca
udyi
roup
infra
ci
pr
ts
The ational ith the Task G t 2011 ODATA report on ely st 11 title
ation
with ing data data se
tiv
C
w
N
ot
of
inform n
ted rmation in Augus set of nsensus been ac arch 20 gration
the
prom teristics
tio
bora
er
p
co
in M d Inte
ac
also
cts
motiva
colla cal Info orksho
a larg lt in a
an
char
has orkshop
of
l aspe
ni
n
w
es
su
ra
rt
re
nc
w
pa
>
Tech sor a
itatio
cultu
E
p is entually Geoscie nded a
on
ORDS sets
le, C
CL
sp
ho
YW
I
yc
r
ks
T
fu
to
KE
data
AR
ill ev
ife C
te fo
t wor
NSF
arch
XT
ations
Tha n and w irectora . The ng the L
nal
NE
rese
hic cit
t
ns
Natio
D
lori
grap
the eric
emen
citatio NSF
citatio s: Exp
biblio
ry of
ph
anag
>
The g data
libra Atmos loping ta
s
E
atic
set m
the
r
ve
AG
urce
otin Inform
data
ist in ation fo d on de an for da rreso
T P
om
ial
X
a
to
or
pr
ec
be
se
pl
at
ss
NE
s sp ity Corp is focu tation rds, cy at
acce
eo-D
d
en
rvice
ry
da
rs
tin
Bulle
ation
orm
for Inf
ciety
n So
erica
e Am
of th
uly
ne/J
– Ju
logy
chno
d Te
ce an
Scien
–
2012
r5
mbe
, Nu
e 38
Volum
att
by M
nive R libra
plem d stan reache
ta se
ch da CAR)/U /UCA y and im es an can be
(N
tic
sear
NCAR a polic a prac ta. He
a re search
the
k is
erni eric Re within veloping metadat arch da
May
rk ng de
ph
de
rese
thew r Atmos . His wo
di
s of
inclu
Mat
fo
inclu s also l aspect
R)
er
s,
CA
Cent arch (U service interest d socia
Rese ch data research ent an
ar
pm
rese ns. His develo .
io
e
r.edu
citat ructur
st
>uca
infra nik<at
er
may
23
CO
NT
EN
“G
R
< P
EV
IOU
A
S P
GE
TS
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Putting it into practice
n
ctio
l Se
ecia
cess
Sp
Open Ac
5:223
s 2012,
arch Note
0/5/223
BMC Rese om/1756-050
s et al.
dcentral.c
Edmund
w.biome
http://ww
Sneddon
et
http://ww al. GigaScien
ce 2012
w.gigasc
, 1:11
iencejou
rnal.com
/content
/1/1/11
EDITO
RIAL
GigaDB:
me
E
AP) Tam P Sneddon, Pet announcing th
m geno
DENC
Open Ac
(RDating and
er Li and
e GigaS
n: sorghu
cess
Scott C
Edmund
cience da
, is
stig
ta citatio gold standardrvation
s
s ineda
tabase
e iplines are einteverm suarggticesletsthat Abstract
re
w
s
s
tu
ne
e
en
e
r
ford
th
ns With the
sc
d
T Bas P
Adv
suifies th
ic di tion, as blishe
s
&
citatio le the inte launch of GigaScie
Alexandra
pl
I
em
and
ta
ta
pu
s
em
e
grat
ad
ru
da
ci
d
nce
a
ion
journal,
n
, Brian Hol
es
ss ac
data e list of Foeasi
rmally than the urcegoals to promo of manuscript pub
here
dastaaex
J Pollard
te ope
lication we provide insig
adly,
nc
r so public repo
ion
Acc ns acro nsntia.eArel that
sult.
with sup
fere datachberebro
sitory doe n-data and repr
of pt r to thei of a
n ce
porting ht into the acco
tivScoett C Edmunds , Tom
odu
visio
ar , proth
s not exis
e ex
ata any instotituinchgtiodamaktaesciedtait tioesse
mpany
setice
in th nc re
refe section
t, for the cibility of researchdata and tools.
y by
ore ienc
lly
ntifi prac m
Reinforc ing database Giga
supporting
hD
ud d scie
, GigaDB
Initia
pica eetdata
ing and
ill arch effic
om, whinbein
hods r. Data Backgr
a give
DB, whi
sstrese
inclg gooto
also aim
data or
andoun
prtion
rove
upholdi
rs ttyextensiv
ch
to
ire pe
tools feat
s to pro
Inte
are that
e mof entpa
riesrnetofpiond
itiontio
eenabyrcthe availabilityry offieldinfo. Inrmaadd
leyd and imp
thge
vide a hom ng GigaScie allows
ured in
tion ik
papelimi
s
butstraints
renc
a they positoprec
in
, con
anta
eer Sir
spa
a cital tran
rt
that
tract
nce’s
a
e
ch
the jour
adv
of
on
e, whe
ns
Abs
pa
rt
ta
ar
the
Tim
re
n
ous
ious
eve
it
m
R driv archerss inincreases experimcientateare,s dahoweve
nal and
seri
Berners
io effo erlds,
thing
r, m
and
ble
resedatamtoentio
e ct
C . Mayern
gress is
beyond n a suitable
tim
uc
e cofounrs
wou
unta selveies”
t unt of
datiofon of ex
.
ts se ducprers
r [1], and and will last long -Lee has stated:
a
ntific pro essible to resedecade
re
od
or
ly
co
t
amo
Scie
The
DC
en
way
or
m
.
-t
ho
al
r
tial
pro
ac
th
C
ents
g a maj. Aut substan
ee
emdata data
dly acc
venientta experimfo
rm prom
teproduction
ially despite the cha er than the syst "Data is a the and DataCite
inng
an
d fo
vidi
dgfor
Da hew S
a con
ion
and rapi
es for the as fr gnitle
tent in area
me ntlyto ci
daion of
ems
bes
llen
pite co
ope
any
data ingnize
CORR
ESPON
*
1
2,5
2,3,4†
1*†
M
tentpo
erve such
t pra
ges crea
thems such as
n-d
ge
be
co y dup
s reco ow
istegin
iplin
urlicat
that, des
becopers
there
ted
can ially
supporting
are it theyscdes
most ope ata movement, ctice guidelin
genomi
s, providee ackn al to m artic
hfast
en re ssar
s beion of
ciesh cones
that
straint is
l toles as erlicat
than the
cs growin due to data
es.
em
the credss di
s beunnece
pein jour
th
sets attethmp
way
n
roug
cing
ts must
ntia
agen sucse
ability
pub data of
rvic
redu
and max CC0 waiver, data is also rele In promoting
g at rate
tenal
arch
, aped
lyt receive rd,acaro
nt
her tion in
formalotsyst
ta ha
ent One
still be
to stor
rece
much
ndrenc
da
s pocutting
ased und
imizing
r po . The
rese ni
refe
rega
tionducerslorare
rnmon. strapro
in.
the, field munity
sm
pms.enIn this
en d he
andr ha
venati
arch dissemi
any
theioutput
its pot
data
er the
s. Ifpracectice
me goals ofof these preciousmade to capture e and process it, BGI’s extensiv
, data
gi
am
data
ha in effecoctiv
sed re of theidar ta
rese deral gomunity
are cite
estion
ential re-u legal red
deveurcefo
re
ee of
or es
e ot owledg
and safe
GigaScienc
reso
cita
e com
tape [4],
th
been pop
e ine reso m
proc
fe comentifier ting thes
itshacure mo
ntific
se.
them
n m esiehow
perselvpro, on
er data pa
guard as
causgooio
y acknbe
puting
d ndata
ntific
seminat
e journal urces as possibl
ulat
scie Several
idnd creaelines arize nthem
l us
infrastr As GigaDB use
tatioillustrat
ion, and
smple of ic
Sum to vid
at g so. Itcialso
screpositories.
to ns perl exa
ch the
ctspe
ua
Y
it released ed with data
ting
e.
to
er
e
whi
je
.
way
id
ntiv
ucture,
Wit
tio
c
max
ci
in
al
AR
,
and
tio
tran
l
th
m
al
ob
s
nd
un
h the
gu ince erva citation
di
imize
first
d enti cita able fu
ly
M
sets
tu
curate
ra sparency,
in a cita
scie
tal
on ntifi
dethe
mmfor doin
isara pot
serv
tione to Presof
not
form
rmal
data ck of inentiss
SUM
Releasin
ble form produced by it has also
omen its digi
tiesto, w
standard
eent
coired
requ
ing this leallfeof
. por having som data reuse, disgrly
pact
ogy is a search
cess
la
th
l citaThe pro
d
ar
ic
sup
g
BGI, mu
pre
Biol
attin
OR’S nce of fo ining m
curr
re
d
data
e
data
h
im
rese
na
aph
g
an
-pu
ewh
IT
tip
ye
re
the
og in Genalome
of suc
ting
in this
arch st
ting Gig Mul
on
ere
ch of
ED
atio
n, thiogr w pres
te an
hola data to
s and form
rta
is essentia
ga
roug r their indices.
cess
novel man blication.
suppor
ble
tiobibl
ta practica
d l, and data and tools surrto host cing cesses to date,
lack
s.aDB (htt
e pa
slo om d dathe
e sc
sion of ta th
impo recently s, DataCi ta, intern Data Ac identatifia
rts litie subth
cita
of the
gen
the
fo
th
p://
an
or
mis
ner
r,
n
lt
fo
es
)
par
giga
hum
the
eve
ct
of
da
has
resu
iti
Ef
ke
in
ed
The
es
oun
ticu
tio
data
db.o
da ral sorg
GigaSc
er
da
how y se inves-ith
had
t the
lintrat tion.
d- bre
ss
TA rg) is key
larly spu
ta s as a
ce
dons
mun
only
arch
ortseve
cition
Again,
w
arch
ting
ta augmen
ak (also from the dea
rring the a number
tacita
sour tter asseitional
to achievi ience database,
linking Ma
y an tifie
ODA
comics andncdem
ns ors
man
Rese to supp
ial
le [7,8].
rscican
em
but is by rese tice of ci
da
dly 201
ta
(C
ssib
ch
ec
tio
om
disc
cro
in
syst
&T
re
in
ng
da
ty
s
ar
iden
auth
add
wds
acce
gen
of
usse
ta
this
sp
ing
tex
IS
1 E. coli
this.
n
gy t ation
pa ntctive
ourng hintader
on
n g publically
cied As
d in Mik
launch
ac
loto
tivied tagg
d
the
rese
be be latio
persiste
0104:H
ergi has da
of ns
nocan
12 AS ng reas
citatio e the pr
e Schatz’
Acgniz
iate
effe
use
reco
“open-sou issue [5]) resu
ta tra
chting
tia in r da
ic an tices
rela
e bein
form
to dals
seen in
mpi
s commen 4 outot
prec
the 20
fuel
versally
Inbe
stro
lity.
ntifGig
andr depnsositisedemdata
d Te icle
l iner
fo
Gen
coom
tedent scan
d[10]
of a uni
prom
aSc
ground rce genomics” lting in what
eabi
Prac
an repig
ne to
Rec
ns, ap
Scie
Human
and
nel at Despite cultura
ience’s
e an aticin
tio itps[9].
rela
rest theita
a ane on
has bee tary in
].
enomic
and mec
[6]. For
the
ands pipe
a pa
dem rs combi institutio d manag ned from the community [1],
n.
tors to ta gcidue
ted inhmed
ienc
data
cred
D mak
enc
n term
il fo
raw
Inte
plea
n [1 first issue,
and
data rds
tatio pervasive growing
to adtiga
r Scetec eanee
ed
d sion ity an
rc d to C
daivin worksho
cus
ons lear
elegans lable prior ro
ou
olde
utio lineta[2], in add a research art- sear se see our rece hanisms surrounmore on the
Disize
rece
t longfound
n-andaavailabl
in
ta ci
maiSt
es
gn
t the
l this
to and
da
plickey less e from the C.
B - from
the
backnt corresp
daBI
akeh
R
ldthat wenDatahighligh
the
ding data
ition to
ch Not
rest data
ofsim
attribe dintoNC
ly avai
n all thedsup
on da
s an
onnaand
But
of st from reco One
the
on also rd on
and free the field of gen Intefalsified
ve he ee[11]
to adng a pag
vali
cita
natio
tepig
ting data [SRP005934], having tion seri es Data Sharing ondence [7]
itatio
ntive ptance.
asize
n an repor
eno
mp
riety
late
erdati
e
to
atte
itt
ject, taki ing data broadly
phPro
Boa
es, usin
in the BM tion,
aC
poses of
tio
ince
es hapsychomlogy
also has Gig
, Standar
(totalin
able to
pipeline
a va
the micsestrac
able
spur
-Intnot
ce
com
c.
at
ta
em
e tried enciand
d
is
e
valu
t
for
ie
g
hav
m
ks
D
ci
ac
pi
ly
g
mak
the
le
om
an
s
C
ur
dly
[3],
data
TA
iti
diza
84
us
and
aDB
One
m
d
al
Ressibde
ct
release
[3]
GB), suc
A . up on
hosted the to
ss to
the
nce
and pub
ngtoo
project
lpro
was that
da
citetadfor
Rules ag
broa essure fro tation m and initi
scieD
strufoun
.S. lyC acce
of the sorgtion and Publica
activ
in Gig udyi
e acceon
in sma
lication
GigaDB
genomics the Bermuda
the
tion nwas
l Acatrust in
infra
e-he Ueasi na
e CO ing
ci
use TA
st . Thistitlelsd created for h as
pr
Groplet20
pap
ts
hum
rteronthroelyaDB
in
11 ehoIden
publica atioSub
sequent
in
sk com
io publicwithesthof gain
po
dale agreT
in rethe
with ing data data se
tifie
tain
ODA
11 datanset is link the of these currently compris Genome Biology genome by
t a stor C
s r),
as
tiv ugh a 20
laid out
s pro
n Ta serv
t Lauder nity has e Nat ss te
form[2].
d issu atio
ot
is a hep
of
of lica
vidiac
the
ominics
su
es over
last yea
addtion
ugesuswith
io I
ed and
ctice as
ch citable
tpub
ng stab
th
the For
which
mu
itio
en
DO
at
A
pra
n
en
,
dre
con
ra
ar
in
atoc
se
nal
30
r
prom teristics
com
ory
gr
[8].
this
be
d
tio
sists
(Dig
ed
rm
ns
in
data
ellu
in
ility
osit
bo ad rep
discove in M
er
te
ital
lar
of 15 Tb
sets
follomwotiva ely ens
science as outlined collaDry
ability s nt
s hrin
a co
ac
also
hop affilaiate
publi-ho
larg
prabilityanand
d In, and most imp Object individuals.
l Info or
of normal carcinoma data . The largest
mataspect er biological
ksctly
dire
rece
char
ha to be
lt in isjour
the
trac
nicadatasets
n
ks
ortantly
practices, International chler
and ulti ral The
traceab
nt kedtio
wid
esnal
or
orta
a w . part of lythis
resu
these sam Additional data and tumor raw set [9], which
in the
regardienc
ility thro
o
,
wtion
imp
>
Te
w similar
[4].
acita
t ltu
sor ces [12]
sam
sc ishof the
s., Cita
E
mencu
ugh its
to follo from the Toront has been held sp
data from
ogy
p is enard
derived
edome
oncien
ORDS sets
tualin thre
bios
strainsfu
also be e individuals,
C L as
ndLib
eeo
mpted
rary Biol
n
ycle csWorking and pare man
and pro
88
ksho
t step forw
T Iner
KEYW
e.g. tran
ev s te
lished
add
of for G SF
standar
adoptio infrastructureto
also atte
in ),Genthes
cessed
data
A R ing
omi DataCite
wornexdthe
ife C and
scriptom
users can ed to a DO
d
lish
tThe
N ed org
willgenome
[5], but
elines pub
T tner
ry l
ra lor pub
with the
arch
ations
con
I rapi
ofLthe egendata
e sequenc from
of
E Xsortium
na
ths eugh
the guid ase Workshop
immedia
reposito
thro
sets
ectobicons
ntives for Thacation
pro
tice
rese
Brit- project
. The prac
n an
access ence
hic cit
are N
e, can
tely acce dly after their
userful
Natioof ince
(http://
Dirhum
ring
Sorg
ent
andthei
in a sing
ry to citatio cropSF
loting
tio the best
central
grap
th
datacite
Data Rele lack of easy-to- an ofabse
ss the
env
re- met searchable
raw
xppor
ironmen e data
ericrt necessa
le, perm
foodhe N worta
citaws
follo
E
agem
data from generation so
.
a
and
asrary
The
k
biblio
ph
ada
l
by
labl
effo
an
s:
>
sup
tal
goa
ane
ta
k
ng
wel
har
s
os
T
repository
and lopi to oth
scienceas
the
bac
ers.
vest
lib
this ong
s
and, avai
ing
atic
tarea
g da havrm
set m
[13]. Thi
is exemp l of centralizin nt place.
G E long
s,
andP
the the tim
r Atem develabl
e ta
y fields
. Outside able
oing
urce
otinnity by
g data and
eT inAwethehav,e this data citation
fo e in the relevanBioMe
data
avai for da
lified by
for man
ist inughation foeasi
research
ly
tim
mu
thro
on
reso
of
In
om
rial
go
wor
the
the
d
com
X
is
an
whi
labl
a
to
nity
first
making
to ec
still quit
the mou
ked
se
pl e to be
ch we pro
a pr
at avai
the
ss
rpor
nly and
N E d Cenmu
tral
it reprodu
ope
authors
e a new lish
s sp itykCo
se met
tion lablrds, cy
focu taavai
ingure closely with
there is
vide all
acce
gratens
eoed-Ddata However, for tices of the com inteto
hylo
d at
en
cibl
rvice
ed
ry ising data
da ed [6]:
g cess
rs
their wor
our
data
tin
Bulle
ation
orm
for Inf
ciety
n So
erica
e Am
of th
uly
ne/J
– Ju
logy
chno
d Te
ce an
Scien
–
2012
r5
mbe
, Nu
e 38
Volum
att
by M
by
me
“G itories.
resu
e
se e nive
prac * Corre
anulat ache
latin
pub
- that citation
ally
spon
calc
libramakimplem bed st
tamak
dence:
re k accumu
pos
E
ve data
efitsR of
ment files lts. This include necessary to repl dataset [3] in
-sharing
specific
of data lisher
ecti@gig
an
R)/Uben
d le can
be wor
ch da CA
collscott
A Gplemented GigaScience,
hed data Psup
/UCAaslicay an
s the raw
who
icat
ascie3],
BGI-H
follows
canor’s
(N The ARnity
tices He
sear
auth
ncejourna
Estat for the
ong A04
establis
read-dep , the Medusa
fastq read e the pubKong684
U Sn
NC
a re search com
.
l.com
prac
ardstaan
[SR
Co. Ltd,
a citatione, NT, Hong
a po at
soft
has
atow
th
I O bee
dataKong
the mu
k is
s, bam
16 Dai Fu
section
process
EV
pingble tren
ad d ch da
examples files. This and ware package,
the raw
erni eric Re within velosura
alignStreet, Tai
P Rthe reference
and the
the sorg
for futu
May
<into
mea de met resear
rk ng de
ph
to having
Po Indu
ive
be done
strial
re
hum stud
bigwig
the Creat
addition
thew r Atmos . His wo
urnal.com
di
s of
inclu
Estate,
terms of
to not only data submitters
y are exc
Mat
Thus, in
fo
inclu s also l aspect e: scott@gigasciencejo
R)
Industrial
under the distribution, and
set.
Po
ima
er
s,
©
uted
Tai
elle
in
CA
2012
t,
l
com
regards
distrib
journal
nt
Sned
denc
Stree
tricted use,
Cent arch (U service interest d so*cia
Dai Fu
Commons don et al.; licens
to what
data poli ply with but
Access articlepermits unres
Correspon ributors
Ltd., 16
e
also
can
is an Open
cies. Aut
, which
reproducti Attribution Licen ee BioMed Centr
†
Rese ch data research ent an
of the articl
Equal contce, BGI-Hong Kong
al Ltd. This /licenses/by/2.0)
on in any
se
hors not go beyond min
al
ar
the end
pm
ed Centr
1
medium, (http://creativeco Ltd. This is an
able at
ee BioM
rly cited.
mmons.org
GigaScien
rese ns. His develo .
only adh
Open Acces
provided
mmons.org
Kong
on is avail
et al.; licens (http://creativeco al work is prope
io
the origin
e
ered to
s
NT, Hong author informati
Edmunds
r.edu
License
al work /licenses/by/2.0) article distributed
citat ructur
ed the origin
© 2012
of
is prope
Attribution
under
st
>uca
um, provid
Full list
rly cited. , which permits
Commons n in any medi
infra nik<at
unrestrictethe terms of the
Creative
d use, distrib
er
reproductio
ution
may
, and
23
CO
NT
EN
TS
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
Putting it into practice
n
ctio
l Se
ecia
cess
Sp
Open Ac
5:223
s 2012,
arch Note
0/5/223
BMC Rese om/1756-050
s et al.
dcentral.c
Edmund
w.biome
http://ww
Sneddon
et
http://ww al. GigaScien
ce 2012
w.gigasc
, 1:11
iencejou
rnal.com
/content
/1/1/11
EDITO
RIAL
GigaDB:
me
E
AP) Tam P Sneddon, Pet announcing th
m geno
DENC
Open Ac
(RDating and
er Li and
e GigaS
n: sorghu
cess
Scott C
Edmund
cience da
, is
stig
ta citatio gold standardrvation
s
s ineda
tabase
e iplines are einteverm suarggticesletsthat Abstract
re
w
s
s
tu
ne
e
en
e
r
ford
th
ns With the
sc
d
T Bas P
Adv
suifies th
ic di tion, as blishe
s
&
citatio le the inte launch of GigaScie
Alexandra
pl
I
em
and
ta
ta
pu
s
em
e
grat
ad
ru
da
ci
d
nce
a
ion
journal,
n
, Brian Hol
es
ss ac
data e list of Foeasi
rmally than the urcegoals to promo of manuscript pub
here
dastaaex
J Pollard
te ope
lication we provide insig
adly,
nc
r so public repo
ion
Acc ns acro nsntia.eArel that
sult.
with sup
fere datachberebro
sitory doe n-data and repr
of pt r to thei of a
n ce
porting ht into the acco
tivScoett C Edmunds , Tom
odu
visio
ar , proth
s not exis
e ex
ata any instotituinchgtiodamaktaesciedtait tioesse
mpany
setice
in th nc re
refe section
t, for the cibility of researchdata and tools.
y by
ore ienc
lly
ntifi prac m
Reinforc ing database Giga
supporting
hD
ud d scie
, GigaDB
Initia
pica eetdata
ing and
ill arch effic
om, whinbein
hods r. Data Backgr
a give
DB, whi
sstrese
inclg gooto
also aim
data or
andoun
prtion
rove
upholdi
rs ttyextensiv
ch
to
ire pe
tools feat
s to pro
Inte
are that
e mof entpa
riesrnetofpiond
itiontio
eenabyrcthe availabilityry offieldinfo. Inrmaadd
leyd and imp
thge
vide a hom ng GigaScie allows
ured in
tion ik
papelimi
s
butstraints
renc
a they positoprec
in
, con
anta
eer Sir
spa
a cital tran
rt
that
tract
nce’s
a
e
ch
the jour
adv
of
on
e, whe
ns
Abs
pa
rt
ta
ar
the
Tim
re
n
ous
ious
eve
it
m
R driv archerss inincreases experimcientateare,s dahoweve
nal and
seri
Berners
io effo erlds,
thing
r, m
and
ble
resedatamtoentio
e ct
C . Mayern
gress is
beyond n a suitable
tim
uc
e cofounrs
wou
unta selveies”
t unt of
datiofon of ex
.
ts se ducprers
r [1], and and will last long -Lee has stated:
a
ntific pro essible to resedecade
re
od
or
ly
co
t
amo
Scie
The
DC
en
way
or
m
.
-t
ho
al
r
tial
pro
ac
th
C
ents
g a maj. Aut substan
ee
emdata data
dly acc
venientta experimfo
rm prom
teproduction
ially despite the cha er than the syst "Data is a the and DataCite
inng
an
d fo
vidi
dgfor
Da hew S
a con
ion
and rapi
es for the as fr gnitle
tent in area
me ntlyto ci
daion of
ems
bes
llen
pite co
ope
any
data ingnize
CORR
ESPON
*
1
2,5
2,3,4†
1*†
M
tentpo
erve such
t pra
ges crea
thems such as
n-d
ge
be
co y dup
s reco ow
istegin
iplin
urlicat
that, des
becopers
there
ted
can ially
supporting
are it theyscdes
most ope ata movement, ctice guidelin
genomi
s, providee ackn al to m artic
hfast
en re ssar
s beion of
ciesh cones
that
straint is
l toles as erlicat
than the
cs growin due to data
es.
em
the credss di
s beunnece
pein jour
th
sets attethmp
way
n
roug
cing
ts must
ntia
agen sucse
ability
pub data of
rvic
redu
and max CC0 waiver, data is also rele In promoting
g at rate
tenal
arch
, aped
lyt receive rd,acaro
nt
her tion in
formalotsyst
ta ha
ent One
still be
to stor
rece
much
ndrenc
da
s pocutting
ased und
imizing
r po . The
rese ni
refe
rega
tionducerslorare
rnmon. strapro
in.
the, field munity
sm
pms.enIn this
en d he
andr ha
venati
arch dissemi
any
theioutput
its pot
data
er the
s. Ifpracectice
me goals ofof these preciousmade to capture e and process it, BGI’s extensiv
, data
gi
am
data
ha in effecoctiv
sed re of theidar ta
rese deral gomunity
are cite
estion
ential re-u legal red
deveurcefo
re
ee of
or es
e ot owledg
and safe
GigaScienc
reso
cita
e com
tape [4],
th
been pop
e ine reso m
proc
fe comentifier ting thes
itshacure mo
ntific
se.
them
n m esiehow
perselvpro, on
er data pa
guard as
causgooio
y acknbe
puting
d ndata
ntific
seminat
e journal urces as possibl
ulat
scie Several
idnd creaelines arize nthem
l us
infrastr As GigaDB use
tatioillustrat
ion, and
smple of ic
Sum to vid
at g so. Itcialso
screpositories.
to ns perl exa
ch the
ctspe
ua
Y
it released ed with data
ting
e.
to
er
e
whi
je
.
way
id
ntiv
ucture,
Wit
tio
c
max
ci
in
al
AR
,
and
tio
tran
l
th
m
al
ob
s
nd
un
h the
gu ince erva citation
di
imize
first
d enti cita able fu
ly
M
sets
tu
curate
ra sparency,
in a cita
scie
tal
on ntifi
dethe
mmfor doin
isara pot
serv
tione to Presof
not
form
rmal
data ck of inentiss
SUM
Releasin
ble form produced by it has also
omen its digi
tiesto, w
standard
eent
coired
requ
ing this leallfeof
. por having som data reuse, disgrly
pact
ogy is a search
cess
la
th
l citaThe pro
d
ar
ic
sup
g
BGI, mu
pre
Biol
attin
OR’S nce of fo ining m
curr
re
d
data
e
data
h
im
rese
na
aph
g
an
-pu
ewh
IT
tip
ye
re
the
og in Genalome
of suc
ting
in this
arch st
ting Gig Mul
on
ere
ch of
ED
atio
n, thiogr w pres
te an
hola data to
s and form
rta
is essentia
ga
roug r their indices.
cess
novel man blication.
suppor
ble
tiobibl
ta practica
d l, and data and tools surrto host cing cesses to date,
lack
s.aDB (htt
e pa
slo om d dathe
e sc
sion of ta th
impo recently s, DataCi ta, intern Data Ac identatifia
rts litie subth
cita
of the
gen
the
fo
th
p://
an
or
mis
ner
r,
n
lt
fo
es
)
par
giga
hum
the
eve
ct
of
da
has
resu
iti
Ef
ke
in
ed
The
es
oun
ticu
tio
data
db.o
da ral sorg
GigaSc
er
da
how y se inves-ith
had
t the
lintrat tion.
d- bre
ss
TA rg) is key
larly spu
ta s as a
ce
dons
mun
only
arch
ortseve
cition
Again,
w
arch
ting
ta augmen
ak (also from the dea
rring the a number
tacita
sour tter asseitional
to achievi ience database,
linking Ma
y an tifie
ODA
comics andncdem
ns ors
man
Rese to supp
ial
le [7,8].
rscican
em
but is by rese tice of ci
da
dly 201
ta
(C
ssib
ch
ec
tio
om
disc
cro
in
syst
&T
re
in
ng
da
ty
s
ar
iden
auth
add
wds
acce
gen
of
usse
ta
this
sp
ing
tex
IS
1 E. coli
this.
n
gy t ation
pa ntctive
ourng hintader
on
n g publically
cied As
d in Mik
launch
ac
loto
tivied tagg
d
the
rese
be be latio
persiste
0104:H
ergi has da
of ns
nocan
12 AS ng reas
citatio e the pr
e Schatz’
Acgniz
iate
effe
use
reco
“open-sou issue [5]) resu
ta tra
chting
tia in r da
ic an tices
rela
e bein
form
to dals
seen in
mpi
s commen 4 outot
prec
the 20
fuel
versally
Inbe
stro
lity.
ntifGig
andr depnsositisedemdata
d Te icle
l iner
fo
Gen
coom
tedent scan
d[10]
of a uni
prom
aSc
ground rce genomics” lting in what
eabi
Prac
an repig
ne to
Rec
ns, ap
Scie
Human
and
nel at Despite cultura
ience’s
e an aticin
tio itps[9].
rela
rest theita
a ane on
has bee tary in
].
enomic
and mec
[6]. For
the
ands pipe
a pa
dem rs combi institutio d manag ned from the community [1],
n.
tors to ta gcidue
ted inhmed
ienc
data
cred
D mak
enc
n term
il fo
raw
Inte
plea
n [1 first issue,
and
data rds
tatio pervasive growing
to adtiga
r Scetec eanee
ed
d sion ity an
rc d to C
daivin worksho
cus
ons lear
elegans lable prior ro
ou
olde
utio lineta[2], in add a research art- sear se see our rece hanisms surrounmore on the
Disize
rece
t longfound
n-andaavailabl
in
ta ci
maiSt
es
gn
t the
l this
to and
da
plickey less e from the C.
B - from
the
backnt corresp
daBI
akeh
R
ldthat wenDatahighligh
the
ding data
ition to
ch Not
rest data
ofsim
attribe dintoNC
ly avai
n all thedsup
on da
s an
onnaand
But
of st from reco One
the
on also rd on
and free the field of gen Intefalsified
ve he ee[11]
to adng a pag
vali
cita
natio
tepig
ting data [SRP005934], having tion seri es Data Sharing ondence [7]
itatio
ntive ptance.
asize
n an repor
eno
mp
riety
late
erdati
e
to
atte
itt
ject, taki ing data broadly
phPro
Boa
es, usin
in the BM tion,
aC
poses of
tio
ince
es hapsychomlogy
also has Gig
, Standar
(totalin
able to
pipeline
a va
the micsestrac
able
spur
-Intnot
ce
com
c.
at
ta
em
e tried enciand
d
is
e
valu
t
for
ie
g
hav
m
ks
D
ci
ac
pi
ly
g
mak
the
le
om
an
s
C
ur
dly
[3],
data
TA
iti
diza
84
us
and
aDB
One
m
d
al
Ressibde
ct
release
[3]
GB), suc
A . up on
hosted the to
ss to
the
nce
and pub
ngtoo
project
lpro
was that
da
citetadfor
Rules ag
broa essure fro tation m and initi
scieD
strufoun
.S. lyC acce
of the sorgtion and Publica
activ
in Gig udyi
e acceon
in sma
lication
GigaDB
genomics the Bermuda
the
tion nwas
l Acatrust in
infra
e-he Ueasi na
e CO ing
ci
use TA
st . Thistitlelsd created for h as
pr
Groplet20
pap
ts
hum
rteronthroelyaDB
in
11 ehoIden
publica atioSub
sequent
in
sk com
io publicwithesthof gain
po
dale agreT
in rethe
with ing data data se
tifie
tain
ODA
11 datanset is link the of these currently compris Genome Biology genome by
t a stor C
s r),
as
tiv ugh a 20
laid out
s pro
n Ta serv
t Lauder nity has e Nat ss te
form[2].
d issu atio
ot
is a hep
of
of lica
vidiac
the
ominics
su
es over
last yea
addtion
ugesuswith
io I
ed and
ctice as
ch citable
tpub
ng stab
th
the For
which
mu
itio
en
DO
at
A
pra
n
en
,
dre
con
ra
ar
in
atoc
se
nal
30
r
prom teristics
com
ory
gr
[8].
this
be
d
tio
sists
(Dig
ed
rm
ns
in
data
ellu
in
ility
osit
bo ad rep
discove in M
er
te
ital
lar
of 15 Tb
sets
follomwotiva ely ens
science as outlined collaDry
ability s nt
s hrin
a co
ac
also
hop affilaiate
publi-ho
larg
prabilityanand
d In, and most imp Object individuals.
l Info or
of normal carcinoma data . The largest
mataspect er biological
ksctly
dire
rece
char
ha to be
lt in isjour
the
trac
nicadatasets
n
ks
ortantly
practices, International chler
and ulti ral The
traceab
nt kedtio
wid
esnal
or
orta
a w . part of lythis
resu
these sam Additional data and tumor raw set [9], which
in the
regardienc
ility thro
o
,
wtion
imp
>
Te
w similar
[4].
acita
t ltu
sor ces [12]
sam
sc ishof the
s., Cita
E
mencu
ugh its
to follo from the Toront has been held sp
data from
ogy
p is enard
derived
edome
oncien
ORDS sets
tualin thre
bios
strainsfu
also be e individuals,
C L as
ndLib
eeo
mpted
rary Biol
n
ycle csWorking and pare man
and pro
88
ksho
t step forw
T Iner
KEYW
e.g. tran
ev s te
lished
add
of for G SF
standar
adoptio infrastructureto
also atte
in ),Genthes
cessed
data
A R ing
omi DataCite
wornexdthe
ife C and
scriptom
users can ed to a DO
d
lish
tThe
N ed org
willgenome
[5], but
elines pub
T tner
ry l
ra lor pub
with the
arch
ations
con
I rapi
ofLthe egendata
e sequenc from
of
E Xsortium
na
ths eugh
the guid ase Workshop
immedia
reposito
thro
sets
ectobicons
ntives for Thacation
pro
tice
rese
Brit- project
. The prac
n an
access ence
hic cit
are N
e, can
tely acce dly after their
userful
Natioof ince
(http://
Dirhum
ring
Sorg
ent
andthei
in a sing
ry to citatio cropSF
loting
tio the best
central
grap
th
datacite
Data Rele lack of easy-to- an ofabse
ss the
env
re- met searchable
raw
xppor
ironmen e data
ericrt necessa
le, perm
foodhe N worta
citaws
follo
E
agem
data from generation so
.
a
and
asrary
The
k
biblio
ph
ada
l
by
labl
effo
an
s:
>
sup
tal
goa
ane
ta
k
ng
wel
har
s
os
T
repository
and lopi to oth
scienceas
the
bac
ers.
vest
lib
this ong
s
and, avai
ing
atic
tarea
g da havrm
set m
[13]. Thi
is exemp l of centralizin nt place.
G E long
s,
andP
the the tim
r Atem develabl
e ta
y fields
. Outside able
oing
urce
otinnity by
g data and
eT inAwethehav,e this data citation
fo e in the relevanBioMe
data
avai for da
lified by
for man
ist inughation foeasi
research
ly
tim
mu
thro
on
reso
of
In
om
rial
go
wor
the
the
d
com
X
is
an
whi
labl
a
to
nity
first
making
to ec
still quit
the mou
ked
se
pl e to be
ch we pro
a pr
at avai
the
ss
rpor
nly and
N E d Cenmu
tral
it reprodu
ope
authors
e a new lish
s sp itykCo
se met
tion lablrds, cy
focu taavai
ingure closely with
there is
vide all
acce
gratens
eoed-Ddata However, for tices of the com inteto
hylo
d at
en
cibl
rvice
ed
ry ising data
da ed [6]:
g cess
rs
their wor
our
data
tin
Bulle
ation
orm
for Inf
ciety
n So
erica
e Am
of th
uly
ne/J
– Ju
logy
chno
d Te
ce an
Scien
–
2012
r5
mbe
, Nu
e 38
Volum
att
by M
by
me
“G itories.
resu
e
se e nive
prac * Corre
anulat ache
latin
pub
- that citation
ally
spon
calc
libramakimplem bed st
tamak
dence:
re k accumu
pos
E
ve data
efitsR of
ment files lts. This include necessary to repl dataset [3] in
-sharing
specific
of data lisher
ecti@gig
an
R)/Uben
d le can
be wor
ch da CA
collscott
A Gplemented GigaScience,
hed data Psup
/UCAaslicay an
s the raw
who
icat
ascie3],
BGI-H
follows
canor’s
(N The ARnity
tices He
sear
auth
ncejourna
Estat for the
ong A04
establis
read-dep , the Medusa
fastq read e the pubKong684
U Sn
NC
a re search com
.
l.com
prac
ardstaan
[SR
Co. Ltd,
a citatione, NT, Hong
a po at
soft
has
atow
th
I O bee
dataKong
the mu
k is
s, bam
16 Dai Fu
section
process
EV
pingble tren
ad d ch da
examples files. This and ware package,
the raw
erni eric Re within velosura
alignStreet, Tai
P Rthe reference
and the
the sorg
for futu
May
<into
mea de met resear
rk ng de
ph
to having
Po Indu
ive
be done
strial
re
hum stud
bigwig
the Creat
addition
thew r Atmos . His wo
urnal.com
di
s of
inclu
Estate,
terms of
to not only data submitters
y are exc
Mat
Thus, in
fo
inclu s also l aspect e: scott@gigasciencejo
R)
Industrial
under the distribution, and
set.
Po
ima
er
s,
©
uted
Tai
elle
in
CA
2012
t,
l
com
regards
distrib
journal
nt
Sned
denc
Stree
tricted use,
Cent arch (U service interest d so*cia
Dai Fu
Commons don et al.; licens
to what
data poli ply with but
Access articlepermits unres
Correspon ributors
Ltd., 16
e
also
can
is an Open
cies. Aut
, which
reproducti Attribution Licen ee BioMed Centr
†
Rese ch data research ent an
of the articl
Equal contce, BGI-Hong Kong
al Ltd. This /licenses/by/2.0)
on in any
se
hors not go beyond min
al
ar
the end
pm
ed Centr
1
medium, (http://creativeco Ltd. This is an
able at
ee BioM
rly cited.
mmons.org
GigaScien
rese ns. His develo .
only adh
Open Acces
provided
mmons.org
Kong
on is avail
et al.; licens (http://creativeco al work is prope
io
the origin
e
ered to
s
NT, Hong author informati
Edmunds
r.edu
License
al work /licenses/by/2.0) article distributed
citat ructur
ed the origin
© 2012
of
is prope
Attribution
under
st
>uca
um, provid
Full list
rly cited. , which permits
Commons n in any medi
infra nik<at
unrestrictethe terms of the
Creative
d use, distrib
er
reproductio
ution
may
, and
23
CO
NT
EN
TS
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012
because good research needs good data
Thank you for your attention
DCC Website: http://www.dcc.ac.uk/
Monica Duke, Alex Ball: http://www.ukoln.ac.uk/ukoln/staff/
8th International Digital Curation Conference
“Infrastructure, Intelligence, Innovation: driving the Data Science agenda”
14–16 January 2013, Amsterdam
http://www.dcc.ac.uk/events/idcc13
CODATA 2012: Data Publication and Data Citation (2)
30 October 2012