Duke, M. and Ball, A. (2012) How to Cite Datasets... Publications : A Report of the Digital Curation Centre. In:...

Duke, M. and Ball, A. (2012) How to Cite Datasets and Link to
Publications : A Report of the Digital Curation Centre. In: 23rd
International CODATA Conference, 2012-10-27 - 2012-10-31,
Taipei.
Link to official URL (if available): http://codata2012.tw/
Opus: University of Bath Online Publication Store
http://opus.bath.ac.uk/
This version is made available in accordance with publisher policies.
Please cite only the published version using the reference above.
See http://opus.bath.ac.uk/ for usage policies.
Please scroll down to view the document.
How to Cite Datasets and Link to Publications
A Report of the Digital Curation Centre
Monica Duke
Alex Ball
30 October 2012
Hello, my name is Alex Ball and I work for the Digital Curation Centre in the UK
with my colleague Monica Duke. The Digital Curation Centre, or DCC, is a centre of
expertise in digital curation and research data management funded by JISC, which is an
agency that helps to develop and maintain information systems in the higher and further
education sectors. For about five years now JISC has been pushing to improve research
data management in the UK, and as part of that, we at the DCC are publishing a series of
guidance documents based on themes set by JISC.
One of the themes is data citation, and at about this time last year we published both a
Briefing Paper and a How-to Guide on the subject (slide). We have some copies here to
give away and you can also download a copy (Figure 1) from our website or read it online.
http://www.dcc.ac.uk/resources/how-guides/cite-datasets
Figure 1: How-to Guide on data citation, on the Web
As I only have twenty minutes, I’m not going to be able to go through the whole
document. Instead, I’ll pick out some of the more interesting issues we came across when
putting the guide together.
1
Motivation
I guess I don’t need to convince anyone here about the need for data publication and
citation, but to understand it we have to think about scholarly communications more
generally. Journals are the big success story in this area, but what made them so popular
(Figure 2)?
• Awareness raising
• Protection from plagiarism
• Verification of results
• Basis for future research
• Reward models
• Permanent access
Figure 2: What’s great about journal papers?
1
They provided a way of communicating research results such that others could verify
the results and build on them, while also ensuring authors received due credit, and in time
rewards, for their work. Formal publication also meant formal archiving could take place.
But as the process of conducting research has become more specialist and complicated,
your average scientific journal paper can no longer contain all the information it needs to
make the research reproducible (transition); we also need the underlying data. But we
won’t get data routinely shared until all these things apply to data as well as to journal
papers. I would argue (Figure 3) that, given time, data citations are what will make it
happen, because the citation model is well understood and trusted.
• Visibility for data
• Protection from plagiarism
• Possibility for verification of results
• Data on which to base future research
• Possibility for reward models
• Access
Figure 3: What data citations provide
What should data citations look like? Well, every journal has its own idea of what a
citation should look like so the important point is what a citation should include (slide).
2
Elements of a data citation
Here are four standard citation styles I found in the literature: see the Guide for the full
references. Which elements do they use?
Author, Publication date, Title, Version, Feature, Resource type, Publisher, Identifier,
Location, Unique Numeric Fingerprint.
Altman and King (2007): Dataverse
• Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,”
hdl:1902.4/00754 UNF:3:ZNQRI14053UZq389x0Bffg?== NORC [Producer]; data
set [Type (DC)] ICPSR [Distributor].
Lawrence et al. (2008): BADC
• Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries, http://ndg.nerc.ac.uk/csml2/GridSeries] Version 1. BADC. urn:badc.nerc.ac.uk
_coapec500yr [Available from http://badc.nerc.ac.uk/data/coapec500yr].
Green (2010): OECD
• OECD (2009), “Key short-term indicators”, Main Economic Indicators (database).
doi: 10.1787/data-00039-en http://dx.doi.org/10.1787/data-00039-en (Accessed on
14 September 2009)
2
Starr and Gastl (2011): DataCite
• Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP
Site 127-797. V.2. Geological Institute, University of Tokyo. Dataset. doi:10.1594/
PANGAEA.726855. http://dx.doi.org/10.1594/PANGAEA.726855
There are five elements that occur in all four styles, four of which have a long pedigree
in scholarly citation:
• Author
• Publication date
• Title
• Location (= identifier)
Despite the fact that we’ve had ISBNs since 1970 and online catalogues in widespread
use since the mid-1980s, identifiers didn’t really start to catch on in citations until the
introduction of DOIs in the last five to ten years. I’d guess this is because with things like
ISBNs there is no central register that allows you to look up the item; booksellers and
libraries have had to build them up for themselves. So identifiers have tended to be used,
if at all, more like checksums for making sure you had the right item, rather than as a way
of accessing resources.
But the Web is changing all that. We now have ways of making locations persistent
enough to be used as an identifier (transition). While it’s possible to do this by carefully
managing URLs, it’s more usual to achieve it by using a fake location, made up of a resolver
service and the identifier, that redirects to the real location. DOIs are getting the most
traction for datasets that are considered ‘published’, with Handles and ARKs being used
more for ephemeral datasets.
• Publisher
Another change wrought by the Web is that we are now used to getting scholarly
content direct from there rather than from library shelves (transition). This makes the
publisher more important than ever as both the host of information and the guarantor of
its quality.
That might seem straightforward enough, but of course it’s never as simple as that.
3
Issues and challenges
Take the author, for example. Authorship is a strange concept in the concept of a dataset.
More natural roles might be a compiler, or a principal investigator, or a corporate owner.
Furthermore, it is far easier to rack up a silly number of contributors with datasets than
with textual publications. In such cases, a simple citation like this isn’t going to cut the
mustard. Most likely you’ll need some sort of microattribution approach (slide).
This spreadsheet was submitted as part of the supplementary data for an article
published in Nature Genetics last year. You’ll see it attributes each genetic variation in
the dataset to its contributor, as identified by a Thompson Reuter ResearcherID (other
contributor ID schemes are available). This was very much a proof of concept. In future
we might hope for this sort of information to be made available as linked data, preferably
somewhere more accessible than supplementary data, like DataCite’s metadata store.
3
1 11
1 11
2 44
2 44
• Data points
3 99
3 99
• Data tables
• Data files
1 11
1 11
• Datasets
2 44
2 44
• Data collections
3 99
3 99
Figure 4: Granularity
Granularity can also be an issue. Just as you might cite only a sentence or a page of an
article, with data you might find yourself citing only a single data point, or a table, or a file
containing several tables, or dataset made up of many files. You might want to cite a more
abstract subset of data such as one of the Features I mentioned earlier, or you might want
to cite a whole collection of datasets.
The practical answer is:
• Cite datasets at the finest level that is appropriate and for which an identifier is
provided.
• If that is not fine enough, provide details of the subset of data you are using at the
point in the text where you make the citation.
So, now you have an in-text citation and a bibliographic reference. Where should that
reference go?
• Special data resources section?
• Acknowledgements? These are already mined for funder information, so could be
mined for data citations as well.
• Accession codes? In 2011, Nature published a data DOI for the first time (see
http://dx.doi.org/10.1038/nbt.1992 – an article on the genomes of rhesus macaques),
and later, in a paper on the recent outbreak of E-Coli in Germany, published the
DOI for a dataset held by the Beijing Genomic Institute for the first time. In both
cases the editors decided to put the citation in with accession codes rather than the
reference list as the datasets hadn’t been peer-reviewed.
• Reference list?
This is something that’s still being worked out by the movers and shakers, but if data is
to be thought of as a first-class research output, it really should be in the reference list.
While we’re on this topic, there’s a related issue in the case of data reuse that if the data
4
citation is in the reference list, should it appear alongside or independently of a reference
to the related article?
The data might well be useless without the kind of context that a journal article
provides, but in print journals with a limit on the number of references, one could be
consider it a waste of a slot to include citations to both the paper and the data. This is
an area where pervasive forward linking would solve a lot of problems. If publishers can
be sure that when a reader follows a link to a dataset, the landing page would forward
them on to the data collection paper and any other papers using it, or even other high
quality documentation, they might be more open to accepting a lone data citation where
it is appropriate.
That is why we are recommending the following:
• Include the citation in the reference list – some reference management packages
now include support for datasets, which should make this easier.
• When your data collection paper is published, notify the repository holding the
dataset.
• When you publish a paper in which you reuse a prior dataset, notify the repository
holding that dataset.
The other issue I want to talk about is dataset identifiers, and how they should be
applied to dynamic datasets. There are two ways a dataset can be dynamic (Figure 5). The
first (animate) is where the dataset is fairly stable in its extent, but points are revised every
so often. A table of the masses of subatomic particles would fall under that category.
• Revised datasets
• Expanding datasets
Figure 5: Types of dynamic datasets (Click on illustrations to animate them)
The other, more common case (animate) is where a dataset is continually expanded
with new data, such as with sensor data.
There are three ways of making such datasets citable.
1. Differentiate versions by access date rather than ID
A
2. Take time slices
A
B
C
5
3. Take snapshots
A
B
C
The first option I know is adopted by the National Snow and Ice Data Center in the
US, because first, in the disciplines they serve the dataset itself is more important than
the version, and second, the Federation of Earth Science Information Partners of which
they are a part believe that the identifiers they assign aren’t identifiers at all but locations,
because you can resolve them to addresses.1 It’s not a view I share, and so I’m not keen
on this option. The second approach really only makes sense with expanding datasets, and
even then works best if the researchers tend to use one slice of the set at a time. Even
so, it is possible to combine it with the first approach, or the third one which is the one I
reckon is most generally suitable; if the rate of change is particularly frequent, it would
probably be best to take these snapshots on demand rather than at predefined intervals.
The apparent downside of the third option is that it seems to involve massive duplication
of data, but there’s nothing to stop the data backend generating these snapshots on the fly
from a single master sequence.
There’s plenty more I could go on to talk about, but time is pressing so instead I’ll flash
the headlines before your eyes.
4
Guidance for researchers
When publishing a paper. . .
• Deposit any data you have collected and used as evidence.
• Ask for a persistent ID/URL for your deposited data.
• When your data collection paper is published, notify the repository holding the
dataset.
When citing a prior dataset. . .
• Use the data citation style required by the editor/publisher.
• If no style is specified, use a standard data citation style, adapted to match the style
for textual publications.
• Default to writing IDs in the form of URLs if possible.
• Include the citation in the reference list – some reference management packages
now include support for datasets, which should make this easier.
• Cite datasets at the finest level that is appropriate and for which an identifier is
provided.
• If that is not fine enough, provide details of the subset of data you are using at the
point in the text where you make the citation.
1 http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Citations/provider_guidelines#
Note_on_Versioning_and_Locators
6
Sneddo
n et al.
GigaScie
http://w
nce 201
ww.giga
2, 1:11
sciencej
ournal.co
m/conte
nt/1
n
ctio
l Se
ecia
cess
Sp
Open Ac
5:223
s 2012,
arch Note -0500/5/223
756
BMC Rese
s et al.
al.com/1
edcentr
Edmund
ww.biom
http://w
/1/11
EDITO
GigaDB
genome DAP)
rghum
n (R ating and
tion: so
d
dsata cita gold standarservatios are invermstigsuggesetsth, isat
in
s
re
ew
tu
e
n
ticl
eies the
e te
ipline
s
BasfordPr
Adven Isp
disc on, as th ished ar tation
su
ic
xandra T &
lif
ci
Ale
ndem J Pollard , Brian Hole andccess across acs.adAemdata cietalitist of a.puFor,bleasmalilyndathtaan the soruurlece
deatsaaex
A
sultadly ptio
ions tation referenc
eir
berebro
s , Tom
a
ut
dat
iv
ce
a
und
it
ci
t
t any inst g datakes it essin ent
t Scott C Edm
ex of fer to th ion of a
arch
visieon
theial tha
a
sectic
ia
e, pro
th
re
re
by
ct
D
e
cy
y
it
pra
n
ed
se
in
or
ma
ic
ll
cien
ve
h
ud d scie
which
In
, ot
arch effi
pica etdat
ill m
hoads r. Data
a gintif
and
strese
m
rsittyextense ive
d to improv
irepe
arees pape
t lim
earc ility of informaprtionomitiotantitoonbeithinngatclrenlegoo
th of entpa
tories rt of
cy and, butstraints tha
tion
antagen ofrta they reposi
s in
adv
on con searchto the
Abstract
mous
e pa
Rsseis sdriven by theersavas inilabevery field. Inerimaddaentcials datrantaspaeve
io effo ers,
seri
r, m
re data ention e ct
and
Cita ernik
tabl
CORR
ESPON
DENC
E
Tam P Sne
2,3,4†
1*†
M
by M
n
leti
Bul
n
atio
form
for In
ciety
n So
erica
e Am
of th
logy
chno
d Te
ce an
Scien
–
2012
uly
ne/J
– Ju
5
ber
Num
38,
me
Volu
er Li and
>
seke the
anulated he
ting “G itories. Ho
em d st
libramakin
niveefitAR
gE practic specifically by
tama
s of
be calcbe reack accumula
implcan an
pos
ve dataa-sharin
dole
ed
ch da CThe
AR)/Uben
AG
n ’s wor
collecti
es
/UC licay an
hed datS Psup
hor
3],
(N
plement
sear
for the
e ca
as whpractics an
CARnity
establis
U n
A04684
po
a re search
. Haut
e Nmu
a citation
I O bee
s Vhas
k is
at
data [SR
com
datoward
data
ng ale tren
in th
section
proPces
erni
ic Re
lopirab
RE
the raw
etad search
veasu
reference
May ospher ork with g de
<
having
em
me
ive
w
into the in addition to
com
of re
clud
in
m
thew
the Creat
ejournal.
l Estate,
terms of
Mat r for At AR). His , includ s also in pects
gascienc
Thus,
Industria
under the distribution, and
t@gi
as
set.
Po
te
C
es
scot
st
Tai
buted
al
ce:
e distri
Street,
Cen arch (U servic intere d sociCorr
tricted use,
esponden ors
Dai Fu
Access articl
its unres
*
Ltd., 16
le
ribut
is an Open .0), which perm
artic
Kong
cont
Rese ch data research ent an
†
l
This
g
the
of
Hon
/by/2
Equa
ral Ltd.
ar
pm
ce, BGIthe end
rg/licenses
ed Cent
.
1
able at
rese ns. His develo .
GigaScien
see BioM
commons.o is properly cited
on is avail
et al.; licen (http://creative
g Kong
e
work
r.edu
NT, Hon author informati
original
Edmunds
citatio ructur
License
© 2012
ded the
of
st
>uca
s Attribution
um, provi
Full list
Common on in any medi
infra nik<at
ducti
er
may
23
TE
NT
e GigaS
Edmund *
s
properly
repro
N
CO
ncing th
Scott C
Open Ac
cess
cience d
atabase
With the
launch
of GigaSc
the inte
gra
ience jou
goals to tion of manus
rnal, her
crip
e we
pro
public rep mote open-d t publication wit provide insi
ght into
ata and
ository doe
h sup
the accom
s not exis reproducibility porting data
panying
and
t, for the
of
database
supporting research, GigaDB tools. Reinforc
Backgr
Gig
ing and
ound
also aim
data or
upholding aDB, which allo
tools feat
s to pro
Interne
ws
Gig
t pionee
vide
ured in
r Sir Tim
the journa a home, wh aScience’s
preciou
en a suit
s thing
Ber
l and bey
ners-Lee
and will
able
selves”
ond.
has stat
last lon
[1], and
DC
ed:
ger
C
"Da
des
produc
and
tha
ta is a
pite the
tion in
challen n the systems
the ope DataCite best
areas suc
tentially
ges created
the
n-d
pra
mctic
ata
h
as genom
e guidel
faster tha
movem
most ope
due to
ent
ines. In
attempts
ics
n
promoting
must still the ability to growing at rate data and ma n CC0 waiver, , data is also
released
much of
xim
s postore and
cutting
under the
these pre be made to cap
BGI’s ext izing its pot
any
proces
goals of
ential re-u legal red
ensive
GigaScienc cious resourc ture and safegua s it, been
tape [4],
se.
es as pos
populated computing
semina
e journa
tion
infrastr As GigaDB use
sible. Wi rd as it rele
with dat
l to ma
ucture,
and cur , and transpa
ased in
s
th the
ximize
asets
ren
ate all of
dat
Releasing a citable form produced by it has also
ing this
the sup cy, having som a reuse, disBGI, mu
pre-public
research
por
ewhere
of succes data in this
ch of
ation.
GigaDB
is essentia ting data and
to host
nov
ses
el
to
manner
(http://
dat
cing of
gigadb.org l, and the Gig tools surroundhas had
data from e, particularly
aScienc
a
) is key
num
spu
bre
the
e
ak (als
rring the
database,
ber
to achievi
dea
Main tex
crowdsou
ng this.
this lau o discussed in dly 2011 E. coli
t
rnch
Mik
0104:H
As can
e Schatz’
“open-sou issue [5]) resu
be seen
s comme 4 outin GigaSc
icle on
ntary in
ground rce genomics” lting in what
an
ience’s firs
has bee
and me
[6]. For
the raw epigenomics
t issue,
n term
please see
chanism
pipeline
data ava
ed
s surrou more on the
[2], in add a research art- sea
our rec
this and
ilable in
ent cor
backrch Not
all the
ition to
NCBI
respond nding data
suppor
es Data
the epi
having tion
[SRP00
citation
ence [7]
ting dat
gen
Sha
series,
,
a (totalin 5934], also has
pipeline[3 omics tracks
using the ring, Standardiz in the BMC
Gig
g
84
and
aD
Rerelease
atio
GB),
B and pub
the too
cited in ], hosted in Gig
of the sor n and Publica
ls created such as
lication
GigaDB
the pap
aDB. Thi
ghu
in
m
cur
er
Genom
Identifier)
for the
s datase
genom
through
rently
of these
e
t is link
is a hep comprises ove Biology last yea e by
additional , providing stab a citable DO
consists
r 30 dat
atocellular
I (Digita ed and
ility, and
discove
asets. The r [8].
of
ability
carcino
most imp l Object
individuals 15 Tb of nor
largest
to be trac rability and
ma dat
ma
ase
ortantly
trac
journal
these sam . Additional dat l and tumor raw t [9], which
,
citation ked in the sam eability thr
ough its
a derived
s.
ish Lib
dat
also be e individuals,
rary and Working and e manner as
and pro a from 88
e.g. tran
add
stan
par
org), the
DataCi
scriptome cessed from
users can ed to a DO
te con tnering with the dard
se dat
I rap
sequence,
sor
immedia
through
ase
Brit- project
tely acc idly after their
their cen ts are search tium (http://
in a sing
dat
generat can
environ
tral me
le, perma ess the data
The goa
me
tadata rep able and har acite.
from this ion so
nent plac
l of cen
vestable
area, and ntal sciences,
osit
ory
ongoing
e.
is
tral
data
. Ou
exemplifie
izin
we hav
BioMed
d by the g data and ma
e worked citation is stil tside of the
which we
Central
kin
mouse
provide
to ensure closely with l quite a new lish
methylom g it reproducib
all data
ed resu
our pub
* Correspo
le
that cita
e
lish
tion of
ment file lts. This includ necessary to rep dataset [3] in
GigaScien ndence: scott@gi
data foll er
es the raw
s, the Me
licate the
Estate, NT, ce, BGI-Hong Konggasciencejournal.
ows
rea
dus
fast
pubd-d
com
Hong Kong
a
q
epth
Co. Ltd,
rea
soft
16 Dai Fu
examples files. This and ware package, ds, bam alignStreet, Tai
and the
the sorghu
for futu
Po Indu
be done
strial
re
big
to not onl data submitters m study are exc wig
ima
© 2012
y
elle
in
l journa
comply
Sneddon
with but regards to what nt
l data
Common
et al.; licen
policies.
also go
can
reproductis Attribution Licen see BioMed Cent
Authors
beyond
on in any
se
ral
minnot onl
medium, (http://creative Ltd. This is an
comm
Open
provided
y adhere
the origin ons.org/license Access article
d to
distribute
al work
s/by/2.0),
is
d
whic
2,5
co
archde
timse
uc uld coun
of of ext m
cite
eases exp ly re
t of
ic progre ible to resede
r
are, how
rs tion
odwo
ore fou
ents
prers
Scientif
duc
-t amoun
honda
r ca way incr
ess
ac
thei ially
gm
a pro
ent
emdat
a major
rms.alThe m
inng
venientta experimfo
idly acc
Autthe sub
freetial itio
data me an to cite
ed fo
dgfor
s. for
as stan
tent
and rap
ingnaizcon
co vidi lierv
tion of
pitebepro
ogn
datreaco
owle nto many becoper
rec
licada
ginly can po h the
sistent
ip nee s,t pro
urge
such
kn
dup
y des
that, des
videsac
sc
are dit the
ssarciyes
supporting
l to as ers be
tha
been ece
straint is
the , appeal jouterna
sets
em
way
the crero
ss di
con
tion of
roug
ices
h rv
ntliaarticles
agen sucse
hasg unn
al syst
publica
arch
red
her tion in
taucin
lyt receive ardac
ent One
ent
se
ndrencedeiinr po . The rec
, a form
, data ofunity th
da
tionducerslorare
rnmion. strapro
in ot
refe
pmenIn this
en d he
the field
em
If re ctichae ni
veinat
m
arch diss
insm
andr ha
ta
tha output
d reg theda
a
s.
data. a m
m
go
se
ve
ir
gi
dat
cite
l
es.
e
se
,
of
dat
re
es
ot
pra
de
co
re
urc
ctiv
are
ra
nity
ge
c
e
e nowled
effe
oc tion meces howific
in reso it focure more of
mu
r orlves thack
ese
tifierating the
pemse
the
causgooio
ientifi veral fede com
m sha user data pa
nt
ndatpra cita
es ar the
iden el
perly s bele
illustrat
s, on
g sc
tation
al ich the
of icat d
pro
Summ to viduwh
scie ories.
Y
Se
cialso
er mp un
jectspendgucre
id in ntivvaize
citin
way titoon first
tion
theic reposit
ng so.alIt
MAR
tum.
ral
indi, in w
fundexa mm
ally
aradpotential citanotable
tal ob serv
on ntif
er a citaoftion
d scie
tione to ince
d for doi
form dar
SUM
ct
fede ar.
a arch ng y
form g momen its digi
Pres
s of dat
ss s,to is ge Biology is re
therent stan
l cita pro
dces
ent
se attilarlreqcouireug
a
OR’S
reitie
ple
e lack hic og
e of
h cur
impa
and
iona ce
The
on
ss an tio
ting dat Multi
EDIT portanc ly gainin
st ye
pr
n, thliograp
Genalom
a to
ho datth
and form
ro the r their indices.
rnat
Cite
suppor
iabl
d
ta e bib es gen
s.
e pa
slowomesdindata prafo
e sc
im
cticalities
sion of
fo
ta Ac identif
ta
, inte
Data
cent
mis
lt of the
ta ci sorgunhum
n th TA) an
ctorlack
ke es then. Ef rtsnt the subth
ever,sethe
The
ly re archers, ting data arch Da ort daera
lintrat
tation
l mm iti
ce da sessed cition
as a resu ]. Again, howan
y
d
A
on
ons
tio
s
invesme
withi
ur
sev
an
is
ta
se
pp
dem
aug
as
ci
l
ta
so
co ics andncy
m linking
Re
su
of
rese
onshors y (COD
er itio
but
da cita accessibleit[7,8
ta cican specia
ch om
inem
y gingngsyst
re ent ideentif
daiers
ofnal
tatiaut
n
IS&T asons to resear
gen
bett add
n by practice
pasist
v
e
ci
lly
ed
og
io
iv
tio
AS
tag
ns
on
ti
th
lica
gi
der
be
ct
ol
ta
at
ti
ta
c
ized
per
hin
pub
er
re
in
e
12
and es
ci
la ng
of
iate
effe
ta tra
ting to form
use
echn
to dandadlsTrela
e bei
mpi
ote th at the 20
tific
ny drecAognosit
prec
isedemdatatedhas
fuel
In]
strong ral inertia d for da
om
lity.
vertsall
a
ctic
s
co
en
ap
bi
to
d
uni
Gen
sca
[10
,
ite
a
l
ra
n
ci
an
dep
prom
la
s
on
ent
ea
of
ine
ma
P
an
ne
sp
an
ltu
ne
re Rec ence in meat
dica
re theirtati ditps[9].
rS
tions manag
[1],
the Hu
e cu
and n [1].
a pa
dem rs combi institu
n. De
tors to ta cidue
dat
creho
il afo
Inte
d
etected rch D
ned from ans community r to adtiga
and
tatio pervasiv growing
r Sci
de
to makenc in- ndards
g und
ed sionicity an
eiving works t lon
ons lear
utio
in
ta ci
Disizcus
recda
ta
a
ehol
eseaneed al Cou
ma
wenData fohlightRthe
pl key less e from the C. eleg ly available prioBro- tefrom
t the
the
rest d dat
m
n Sta d attrib d to da
stak m recogn
on da es and
aldthat on
on and to
hig
Onizeeofsi
a pag
atioto adve he tey e[11
] alsooa
and free the field of gen In falsifie
rd on vali
ce. Bu riety of
natiion Cmp
ntiv
erdat
e fro
, taking
on an relate c.
to
broadly
it
phas
ptan
va
B poses of
a it tcitaistithe
ince
es hapsychomlog
eatatte
spur
-Int
abl
es
e y valuable to s have tried enciand
t com ally em Project t making data ctur
ie
ma
D
acce
pi
not
for
ti
a
om
and
e
TA
us
m
d
dat
to
ble
fro
vi
ndl
On
ta
C
A . up onaccess da
to
[3]
essi de
nm
es ag
project
pro
was tha
broa essure
strufou
.S.ily acc
the
ying ed
actilODnce
d initi
te
Ueas
l Acatrust in
infra uent genomics the Bermuda Rulagreetion nwas
e Cscie
citatio
TAsmathe rt on ly stud
pr
Grople20
tl
ts an
11 onousDeAfor
The tain
ingk com
publica atioSub
public ith thof gain
seq
in
iona
as
with ing data data se
in po
ve
11 ti
t a storehCOtion
derdale
laid out
vesusas
form[2].
issuesation T
ot
ser
of
e Nat ssatthe
of lica enssus re en acti arch 20 ation
ed w
ug
Fort Lau munity has
ominics
t
ich
pub
th
ctice as
A
n
the
wh
h
pra
,
se
dre
prom teristics
io
in
wit
or repfo
be
com
rmory p in liatedrger
ns
this
M
ed
at
in
tegr
follow
a co
ac
In ositdire
also
ho affi a la
cts hrin logical science es, as outlined collab
motiv matelypeens
publi-hop in
d In
ctly
sent
Dryicad
al
ks
in
an
char
ha
bio
al
rec
as
lt
or
ulti
ts
of
ctic
l
n
er
ks
tion
ase
hn
es
pra
and ltura The wid
is the
a w ]. part
ortant itatio
resu
wor
regardienc the aimp
o Interna heldTec ler dat
similar
cu [4].
sornces [12
DS
is
C
nt
inlythis eoscins
al
Toront
follow
of
,
n
p
me
ed
on
OR
to
LE
ard
the
logy
tu
bee
scie
le
ts
sp
bio
nd omeCBio
from
se
yc
mpted
n has
r Ge stra F fu
kshostepl forw
TIC
KEYW
thre
evenes te
adoptio infrastructureto tThe
also atte ines published
data
AR
S in Gen
wornexdt wilgen
ns
om ra of fo publish
Life genomics
[5], but
del
arch
l
XT
ha on an
to lor The N ed
tatio
of the irec
rkshop
the gui
es eof the ful prorepository
NE
ionaincentives for T cation
rese
cticth
s. t pra
hic ci
Dghum bico
ease Wo
-access the Nat
ti
ent
loring
tionthe bes
and use
grap
enceerof
ic necessary to citafood cro
Data Rel lack of easy-to an ofabs
NpSFSor ta
ows
cita
Expporting raw ilable dat>a reagem
foll
biblio
rt
a
asrary
l lib
s: sup
ospheffo
T].heThisgwor
ava
da k ingatic
mand
t man
the
back by fields as wel
ers.
s
E longloping
e
e
and
At
oth
se
t
ta
G
tim
ce
[13
th
r
ve
to
hav
in
van
rm the rele
ler da
ur
the fo
ny
h
de ilabfo
ot nity by
inA the
data
eT P
on ava
list inughration easdily
for ma
reso
researc
Info
ommu
nity, this
firstN tim
prcom
ilable in
EX
eciathro
berse
planle tocythe
to go
ataa ava
ss to
orponly and
commu
re is a
tionilabrd
focu ataava
Cope
for the
s,
authoricses spwor
acce
sed-Ddat
ceseo
[6]:
d atthe
en
ir rsityk
wever,
rv
es of the
ry is g dat
da
integrating
: annou
ddon, Pet
Abstract
1
ay
a
Datatthew S. M
RIAL
cited.
h permits
under
unrestrict the terms of the
ed use,
distributio Creative
n, and
S
Figure 6: The How-to Guide has been cited in the literature
• Cite the exact version of the dataset you need.
• When your paper is published, notify the repository holding the dataset you used.
5
Guidance for data repositories
• Provide persistent IDs for the datasets you host.
– The ID should remain unique.
– The ID should always point to the same version.
– The ID should resolve to a URL.
– The URL should locate the dataset’s landing page. This URL should belong to
a landing page that contains descriptive information about the dataset, as well
as links or instructions for accessing it.
• The explanatory metadata should not change for a dataset with a persistent ID.
• IDs should only be assigned once no further changes are expected.
• With dynamic datasets, provide IDs for snapshots or time slices.
• Provide sample citations on dataset landing pages.
• Link from landing pages to publications citing the dataset. This may require collaboration with authors and publishers.
6
Putting it into practice
In the year since we published this guidance it has made quite an impression (Figure 6).
It was mentioned earlier this year by Matthew Mayernik in the Bulletin of the American
7
Society for Information Science and Technology2 in the same breath as the guidelines put
out some months earlier by the Federation of Earth Science Information Partners (ESIP).
Once they saw them, ESIP themselves called our guide ‘the most useful guide’ on data
citation (transition).3 The correspondence paper ‘Adventures in Data Citation’ by Edmunds,
Pollard, Hole, and Basford uses the guide as shorthand for best practice in data citation,4
as does the editorial in the inaugural issue of the data journal GigaScience.5 Most recently
(transition), Thomson Reuters refer to it in their essay on the selection policy for their new
Data Citation Index.6
It is good to see the guidelines being used in practice, but the landscape is developing
all the time. So we’re keeping a watchful eye on the evolution of data citation practices,
and hope to bring out an updated version of the guide in the first half of next year.
Monica Duke, Alex Ball. DCC/UKOLN, University of Bath. http://www.ukoln.ac.uk/ukoln/staff/
Except where otherwise stated, this work is licensed under Creative Commons Attribution 2.5 Scotland: http://creativecommons.org/licenses/by/2.5/
scotland/
The DCC is funded by JISC.
For more information, please visit http://www.dcc.ac.uk/
2 http://www.asis.org/Bulletin/Jun-12/JunJul12_MayernikDataCitation.html
3 http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Citations/provider_guidelines#
Introduction_and_Summary
4 http://dx.doi.org/10.1186/1756-0500-5-223
5 http://dx.doi.org/10.1186/2047-217X-1-11
6 http://wokinfo.com/media/pdf/DCI_selection_essay.pdf
8