because good research needs good data How to Cite Datasets and Link to Publications A Report of the Digital Curation Centre Monica Duke Alex Ball DCC/UKOLN, University of Bath 30 October 2012 Except where otherwise stated, this work is licensed under Creative Commons Attribution 2.5 Scotland: CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Funded by Awareness Level A Digital Curation Centre Briefing Paper A Digital Curation Centre ‘working level’ guide 19th July 2011 Data Citation and Linking By Alex Ball and Monica Duke, UKOLN, University of Bath •Introduction •Short-term Benefits and Long-term Value •Perspectives on Data Citation •Roles and Responsibilities •Issues to be Considered •Related Research •Additional Resources Introduction On the surface, citing datasets is a trivially easy thing to do. Style manuals such as the Publication Manual of the American Psychological Association and the Oxford Manual of Style have provided sample citations for datasets since at least the early 2000s. The process of making datasets citable, however, is rather more difficult. In consequence of this and other factors, a culture of citing datasets has been slow to develop. Nevertheless, it is vital that researchers cite the datasets they use, if datasets are to be regarded as legitimate academic outputs in their own right. • Data citations ensure that data contributors receive proper credit when their work is reused by other researchers. • If a dataset links back to the paper that describes its collection, a reader coming to the dataset direct can use that link to put it in context and understand the methodology used. How to Cite Datasets and Link to Publications Alex Ball (DCC) and Monica Duke (DCC) • If a dataset links to other papers that make use of it, these links can be used by the contributors and data publishers to demonstrate the impact of the data. Potential reusers might use these links to discover critiques of the data or to provide inspiration for how to use them. Once a culture of data citation has been established, several other benefits are likely to become apparent. • The publishing infrastructure that makes the data citable will also help to ensure they are available for reference and reuse long into the future. • There will be less danger of rival researchers ‘stealing’ results from those who publish their data openly, as failure to give due credit would amount to plagiarism and thus be punishable. • Services built around data citation will make it easier for researchers to discover relevant datasets. Short-term Benefits and Long-term Value • Data citations could be used to measure the impact of both individual datasets and their contributors. There are several short-term benefits to making datasets citable, citing them in practice, and linking datasets to papers that make use of the data. • Researchers could gain professional recognition and rewards for published data in the same way as for more traditional publications. • If the authors of a scientific publication properly cite the data that underlies it, it is much easier for the reader to locate that data. This in turn makes it easier for the reader to validate and build on the publication’s findings. Taking these points together, there would likely be an increase in the quantity and quality of data published, with all the benefits this implies for the transparency and rate of scientific research. Digital Curation Centre, 2011. Licensed under Creative Commons Attribution 2.5 Scotland: CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Outline Motivation Elements of a data citation Issues and challenges Guidance for researchers Guidance for data repositories Putting it into practice CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 What’s great about journal papers? É É É É É É Awareness raising Protection from plagiarism Verification of results Basis for future research Reward models Permanent access CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 What’s great about journal papers? É É É É É É Awareness raising Protection from plagiarism Verification of results Basis for future research Reward models Permanent access CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Data citations provide. . . É É É É É É Visibility for data Protection from plagiarism Possibility for verification of results Data on which to base future research Possibility for reward models Access CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Four data citation styles: which elements do they use? Altman and King (2007): Dataverse Lawrence et al. (2008): BADC Green (2010): OECD Starr and Gastl (2011): DataCite CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Author Altman and King (2007): Dataverse É Sidney Verba. NORC [Producer]; Lawrence et al. (2008): BADC É Iwi, A. and B. N. Lawrence Green (2010): OECD É OECD Starr and Gastl (2011): DataCite É Irino, T; Tada, R CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Publication date Altman and King (2007): Dataverse É Sidney Verba. 1998. NORC [Producer]; Lawrence et al. (2008): BADC É Iwi, A. and B. N. Lawrence (2004). Green (2010): OECD É OECD (2009), (Accessed on 14 September 2009) Starr and Gastl (2011): DataCite É Irino, T; Tada, R (2009): CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Title Altman and King (2007): Dataverse É Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” NORC [Producer]; Lawrence et al. (2008): BADC É Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. Green (2010): OECD É OECD (2009), “Key short-term indicators”, Main Economic Indicators (Accessed on 14 September 2009) Starr and Gastl (2011): DataCite É Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Version Altman and King (2007): Dataverse É Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” NORC [Producer]; Lawrence et al. (2008): BADC É Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. Version 1. Green (2010): OECD É OECD (2009), “Key short-term indicators”, Main Economic Indicators (Accessed on 14 September 2009) Starr and Gastl (2011): DataCite É Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Feature Altman and King (2007): Dataverse É Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” NORC [Producer]; Lawrence et al. (2008): BADC É Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,] Version 1. Green (2010): OECD É OECD (2009), “Key short-term indicators”, Main Economic Indicators (Accessed on 14 September 2009) Starr and Gastl (2011): DataCite É Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Resource type Altman and King (2007): Dataverse É Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” NORC [Producer]; data set [Type (DC)] Lawrence et al. (2008): BADC É Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,] Version 1. Green (2010): OECD É OECD (2009), “Key short-term indicators”, Main Economic Indicators (database). (Accessed on 14 September 2009) Starr and Gastl (2011): DataCite É Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. Dataset. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Publisher Altman and King (2007): Dataverse É Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” NORC [Producer]; data set [Type (DC)] ICPSR [Distributor]. Lawrence et al. (2008): BADC É Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,] Version 1. BADC. Green (2010): OECD É OECD (2009), “Key short-term indicators”, Main Economic Indicators (database). (Accessed on 14 September 2009) Starr and Gastl (2011): DataCite É Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. Geological Institute, University of Tokyo. Dataset. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Identifier Altman and King (2007): Dataverse É Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” hdl:1902.4/00754 NORC [Producer]; data set [Type (DC)] ICPSR [Distributor]. Lawrence et al. (2008): BADC É Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,] Version 1. BADC. Green (2010): OECD É OECD (2009), “Key short-term indicators”, Main Economic Indicators (database). doi: 10.1787/data-00039-en (Accessed on 14 September 2009) Starr and Gastl (2011): DataCite É Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. Geological Institute, University of Tokyo. Dataset. doi:10.1594/PANGAEA.726855. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Location Altman and King (2007): Dataverse É Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” hdl:1902.4/00754 NORC [Producer]; data set [Type (DC)] ICPSR [Distributor]. Lawrence et al. (2008): BADC É Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,] Version 1. BADC. [Available from]. Green (2010): OECD É OECD (2009), “Key short-term indicators”, Main Economic Indicators (database). doi: 10.1787/data-00039-en (Accessed on 14 September 2009) Starr and Gastl (2011): DataCite É Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. Geological Institute, University of Tokyo. Dataset. doi:10.1594/PANGAEA.726855. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Citation styles Unique Numeric Fingerprint Altman and King (2007): Dataverse É Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” hdl:1902.4/00754 NORC [Producer]; data set [Type (DC)] ICPSR [Distributor]. UNF:3:ZNQRI14053UZq389x0Bffg?== Lawrence et al. (2008): BADC É Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries,] Version 1. BADC. [Available from]. Green (2010): OECD É OECD (2009), “Key short-term indicators”, Main Economic Indicators (database). doi: 10.1787/data-00039-en (Accessed on 14 September 2009) Starr and Gastl (2011): DataCite É Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. Geological Institute, University of Tokyo. Dataset. doi:10.1594/PANGAEA.726855. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Key citation elements É É É É Author Publication date Title Location CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Key citation elements É É É É Author Publication date Title Location (= identifier) CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Key citation elements É É É É É Author Publication date Title Location (= identifier) Publisher CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Attributing datasets to many contributors CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Granularity 1 11 É CODATA 2012: Data Publication and Data Citation (2) Data points 30 October 2012 Granularity 1 11 2 44 3 99 CODATA 2012: Data Publication and Data Citation (2) É Data points É Data tables 30 October 2012 Granularity 1 11 1 11 2 44 2 44 3 99 3 99 CODATA 2012: Data Publication and Data Citation (2) É Data points É Data tables É Data files 30 October 2012 Granularity 1 11 1 11 2 44 2 44 3 99 3 99 1 11 1 11 2 44 2 44 3 99 3 99 CODATA 2012: Data Publication and Data Citation (2) É Data points É Data tables É Data files É Datasets 30 October 2012 Granularity 1 11 1 11 2 44 2 44 3 99 3 99 É Data points É Data tables É Data files 1 11 1 11 É Datasets 2 44 2 44 É Data collections 3 99 3 99 CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Granularity É É Cite datasets at the finest level that is appropriate and for which an identifier is provided. If that is not fine enough, provide details of the subset of data you are using at the point in the text where you make the citation. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Placement of data citations É É É É Special data resources section? Acknowledgements? Accession codes? Reference list? CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Placement of data citations É É É É É Special data resources section? Acknowledgements? Accession codes? Reference list? Alongside or independent of a reference to the related article? CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Placement of data citations É É É Include the citation in the reference list. When your data collection paper is published, notify the repository holding the dataset. When you publish a paper in which you reuse a prior dataset, notify the repository holding that dataset. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Dynamic datasets Two types: É Revised datasets É Expanding datasets CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Dynamic datasets Three strategies: 1. Differentiate versions by access date rather than ID A 2. Take time slices A B C 3. Take snapshots A B C CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Guidance for researchers publishing a paper É Deposit any data you have collected and used as evidence. É Ask for a persistent ID/URL for your deposited data. É When your data collection paper is published, notify the repository holding the dataset. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Guidance for researchers citing a prior dataset É Use the data citation style required by the editor/publisher. É If no style is specified, use a standard data citation style, adapted to match the style for textual publications. É Default to writing IDs in the form of URLs if possible. É Include the citation in the reference list. É Cite datasets at the finest level that is appropriate and for which an identifier is provided. É If that is not fine enough, provide details of the subset of data you are using at the point in the text where you make the citation. É Cite the exact version of the dataset you need. É When your paper is published, notify the repository holding the dataset you used. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Guidance for data repositories É Provide persistent IDs for the datasets you host. É É É É The ID should remain unique. The ID should always point to the same version. The ID should resolve to a URL. The URL should locate the dataset’s landing page. É The explanatory metadata should not change for a dataset with a persistent ID. É IDs should only be assigned once no further changes are expected. É With dynamic datasets, provide IDs for snapshots or time slices. É Provide sample citations on dataset landing pages. É Link from landing pages to publications citing the dataset. CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Putting it into practice n ctio l Se ecia Sp AP) (RD and tion stigating , is ervaiplines are einteverm suarggticesletsthat s s e e r th & P ic disc n, as blished citations Issu ess ss academ ta citatiot of a pu al data the rule and Accstitutionstaaccirotations.eAredaferencarechlisresult.exFocermptionrthtoantheir soofurcea ives a t t in a y da ia in th n rese ore the lly refe section g an ve otin m hD uded a ca ds Init prom tion incl d to a gi are still rs typi e metho per. Dat ries and earc le pa th to tion pape of but a cita ta that on, search tions in ion of a s, reposi e part Res cades da mm bl Cita. Mayernik cites ore co rs of re ext men ents sect oducer counta eir a ly t de m pr -t ho al r ac th ially d fo form coming es. Aut as free ledgem any data me an to cite Da hew S tent data gnize M n ch ow po co be sciplin e urge reco are s, su e ackn al to m l to be ers begi ts can ugh th been agencies ices ia th way ss di has rv t arch , data se ity thro , appe potent her tion in acro n se pment data men rese hand in ot sm mmun eir en ratio vern arch lo data or a m e other e of th ocess. If echani rese deral go ier regist in deve focused co r th ific pr fe ntific nm er caus entif ines are Summit pape ns, on scie Several s be ication citatio e scient ct id ual us el Y n ting er je . id tio ci vid AR tio l th m al ob gu di M tu va ally cita ch fund mmun form tal ward dera . tion of in SUM eser ct on form g momen its digi e co ar l cita le fe d Pr the lack ogress to g ar OR’S e of rese holarly rough th eir impa ces. and iona cess an ultip past ye EDIT portanc ly gainin pr n, alon rnat th Cite sc d s. M e slow nt data im rts citatio ta Ac ta th ed for tion indi , inte Data the ctor ithin th TA) an rece The ers, data arch Da ort data munities d linked tion. Effo ss ta ce da y se A only arch ting sour tter asse data ci cita y an com man tions w (COD Rese to supp ial but is by rese tice of ci nc ta ch ec in &T re da ty s ar of ta sp n IS n gy ng vi on ci ac nspa the rese be be lation ctive 12 AS ng reas citatio e the pr Acti emergi to data Technolo formatio ific and es tia in r data tra fuel effe preciate pi d e 20 ro er y. m ot n is th tic d In st in lit nt a d l fo co te d prom eabi ne to ns, ap and nel at Despite cultura rest citations ps rela ience an Data an for Scie and Prac[1]. te a pa dem rs combi institutio d manag n. ive s d g il In ho tio ta h in Sc as an nc ta ks r an d rv rc da ion de ow ad dard ta ci d a pe t the gr stakehol recognize simplicity Bro terest in held wor Data fo on Resea nal Cou n Stan d attribut to data on da s an Bu of m In ize ve rd natio ted itatio ntive ptance. ee on n an riety e fro phas ince es ha mitt ies Boa TA-Inter Data C citatio ities rela pic. a va ce com em d ci e t m ac ly om ur en us m d al to ta A on ag .C ng an de broa essure fro tation m and initi struct the activ U.S d COD on da Aca udyi roup infra ci pr ts The ational ith the Task G t 2011 ODATA report on ely st 11 title ation with ing data data se tiv C w N ot of inform n ted rmation in Augus set of nsensus been ac arch 20 gration the prom teristics tio bora er p co in M d Inte ac also cts motiva colla cal Info orksho a larg lt in a an char has orkshop of l aspe ni n w es su ra rt re nc w pa > Tech sor a itatio cultu E p is entually Geoscie nded a on ORDS sets le, C CL sp ho YW I yc r ks T fu to KE data AR ill ev ife C te fo t wor NSF arch XT ations Tha n and w irectora . The ng the L nal NE rese hic cit t ns Natio D lori grap the eric emen citatio NSF citatio s: Exp biblio ry of ph anag > The g data libra Atmos loping ta s E atic set m the r ve AG urce otin Inform data ist in ation fo d on de an for da rreso T P om ial X a to or pr ec be se pl at ss NE s sp ity Corp is focu tation rds, cy at acce eo-D d en rvice ry da rs tin Bulle ation orm for Inf ciety n So erica e Am of th uly ne/J – Ju logy chno d Te ce an Scien – 2012 r5 mbe , Nu e 38 Volum att by M nive R libra plem d stan reache ta se ch da CAR)/U /UCA y and im es an can be (N tic sear NCAR a polic a prac ta. He a re search the k is erni eric Re within veloping metadat arch da May rk ng de ph de rese thew r Atmos . His wo di s of inclu Mat fo inclu s also l aspect R) er s, CA Cent arch (U service interest d socia Rese ch data research ent an ar pm rese ns. His develo . io e citat ructur st >uca infra nik<at er may 23 CO NT EN “G R < P EV IOU A S P GE TS CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Putting it into practice n ctio l Se ecia cess Sp Open Ac 5:223 s 2012, arch Note 0/5/223 BMC Rese om/1756-050 s et al. dcentral.c Edmund w.biome http://ww Sneddon et http://ww al. GigaScien ce 2012 w.gigasc , 1:11 iencejou /content /1/1/11 EDITO RIAL GigaDB: me E AP) Tam P Sneddon, Pet announcing th m geno DENC Open Ac (RDating and er Li and e GigaS n: sorghu cess Scott C Edmund cience da , is stig ta citatio gold standardrvation s s ineda tabase e iplines are einteverm suarggticesletsthat Abstract re w s s tu ne e en e r ford th ns With the sc d T Bas P Adv suifies th ic di tion, as blishe s & citatio le the inte launch of GigaScie Alexandra pl I em and ta ta pu s em e grat ad ru da ci d nce a ion journal, n , Brian Hol es ss ac data e list of Foeasi rmally than the urcegoals to promo of manuscript pub here dastaaex J Pollard te ope lication we provide insig adly, nc r so public repo ion Acc ns acro nsntia.eArel that sult. with sup fere datachberebro sitory doe n-data and repr of pt r to thei of a n ce porting ht into the acco tivScoett C Edmunds , Tom odu visio ar , proth s not exis e ex ata any instotituinchgtiodamaktaesciedtait tioesse mpany setice in th nc re refe section t, for the cibility of researchdata and tools. y by ore ienc lly ntifi prac m Reinforc ing database Giga supporting hD ud d scie , GigaDB Initia pica eetdata ing and ill arch effic om, whinbein hods r. Data Backgr a give DB, whi sstrese inclg gooto also aim data or andoun prtion rove upholdi rs ttyextensiv ch to ire pe tools feat s to pro Inte are that e mof entpa riesrnetofpiond itiontio eenabyrcthe availabilityry offieldinfo. Inrmaadd leyd and imp thge vide a hom ng GigaScie allows ured in tion ik papelimi s butstraints renc a they positoprec in , con anta eer Sir spa a cital tran rt that tract nce’s a e ch the jour adv of on e, whe ns Abs pa rt ta ar the Tim re n ous ious eve it m R driv archerss inincreases experimcientateare,s dahoweve nal and seri Berners io effo erlds, thing r, m and ble resedatamtoentio e ct C . Mayern gress is beyond n a suitable tim uc e cofounrs wou unta selveies” t unt of datiofon of ex . ts se ducprers r [1], and and will last long -Lee has stated: a ntific pro essible to resedecade re od or ly co t amo Scie The DC en way or m . -t ho al r tial pro ac th C ents g a maj. Aut substan ee emdata data dly acc venientta experimfo rm prom teproduction ially despite the cha er than the syst "Data is a the and DataCite inng an d fo vidi dgfor Da hew S a con ion and rapi es for the as fr gnitle tent in area me ntlyto ci daion of ems bes llen pite co ope any data ingnize CORR ESPON * 1 2,5 2,3,4† 1*† M tentpo erve such t pra ges crea thems such as n-d ge be co y dup s reco ow istegin iplin urlicat that, des becopers there ted can ially supporting are it theyscdes most ope ata movement, ctice guidelin genomi s, providee ackn al to m artic hfast en re ssar s beion of ciesh cones that straint is l toles as erlicat than the cs growin due to data es. em the credss di s beunnece pein jour th sets attethmp way n roug cing ts must ntia agen sucse ability pub data of rvic redu and max CC0 waiver, data is also rele In promoting g at rate tenal arch , aped lyt receive rd,acaro nt her tion in formalotsyst ta ha ent One still be to stor rece much ndrenc da s pocutting ased und imizing r po . The rese ni refe rega tionducerslorare rnmon. strapro in. the, field munity sm pms.enIn this en d he andr ha venati arch dissemi any theioutput its pot data er the s. Ifpracectice me goals ofof these preciousmade to capture e and process it, BGI’s extensiv , data gi am data ha in effecoctiv sed re of theidar ta rese deral gomunity are cite estion ential re-u legal red deveurcefo re ee of or es e ot owledg and safe GigaScienc reso cita e com tape [4], th been pop e ine reso m proc fe comentifier ting thes itshacure mo ntific se. them n m esiehow perselvpro, on er data pa guard as causgooio y acknbe puting d ndata ntific seminat e journal urces as possibl ulat scie Several idnd creaelines arize nthem l us infrastr As GigaDB use tatioillustrat ion, and smple of ic Sum to vid at g so. Itcialso screpositories. to ns perl exa ch the ctspe ua Y it released ed with data ting e. to er e whi je . way id ntiv ucture, Wit tio c max ci in al AR , and tio tran l th m al ob s nd un h the gu ince erva citation di imize first d enti cita able fu ly M sets tu curate ra sparency, in a cita scie tal on ntifi dethe mmfor doin isara pot serv tione to Presof not form rmal data ck of inentiss SUM Releasin ble form produced by it has also omen its digi tiesto, w standard eent coired requ ing this leallfeof . por having som data reuse, disgrly pact ogy is a search cess la th l citaThe pro d ar ic sup g BGI, mu pre Biol attin OR’S nce of fo ining m curr re d data e data h im rese na aph g an -pu ewh IT tip ye re the og in Genalome of suc ting in this arch st ting Gig Mul on ere ch of ED atio n, thiogr w pres te an hola data to s and form rta is essentia ga roug r their indices. cess novel man blication. suppor ble tiobibl ta practica d l, and data and tools surrto host cing cesses to date, lack s.aDB (htt e pa slo om d dathe e sc sion of ta th impo recently s, DataCi ta, intern Data Ac identatifia rts litie subth cita of the gen the fo th p:// an or mis ner r, n lt fo es ) par giga hum the eve ct of da has resu iti Ef ke in ed The es oun ticu tio data db.o da ral sorg GigaSc er da how y se inves-ith had t the lintrat tion. d- bre ss TA rg) is key larly spu ta s as a ce dons mun only arch ortseve cition Again, w arch ting ta augmen ak (also from the dea rring the a number tacita sour tter asseitional to achievi ience database, linking Ma y an tifie ODA comics andncdem ns ors man Rese to supp ial le [7,8]. rscican em but is by rese tice of ci da dly 201 ta (C ssib ch ec tio om disc cro in syst &T re in ng da ty s ar iden auth add wds acce gen of usse ta this sp ing tex IS 1 E. coli this. n gy t ation pa ntctive ourng hintader on n g publically cied As d in Mik launch ac loto tivied tagg d the rese be be latio persiste 0104:H ergi has da of ns nocan 12 AS ng reas citatio e the pr e Schatz’ Acgniz iate effe use reco “open-sou issue [5]) resu ta tra chting tia in r da ic an tices rela e bein form to dals seen in mpi s commen 4 outot prec the 20 fuel versally Inbe stro lity. ntifGig andr depnsositisedemdata d Te icle l iner fo Gen coom tedent scan d[10] of a uni prom aSc ground rce genomics” lting in what eabi Prac an repig ne to Rec ns, ap Scie Human and nel at Despite cultura ience’s e an aticin tio itps[9]. rela rest theita a ane on has bee tary in ]. enomic and mec [6]. For the ands pipe a pa dem rs combi institutio d manag ned from the community [1], n. tors to ta gcidue ted inhmed ienc data cred D mak enc n term il fo raw Inte plea n [1 first issue, and data rds tatio pervasive growing to adtiga r Scetec eanee ed d sion ity an rc d to C daivin worksho cus ons lear elegans lable prior ro ou olde utio lineta[2], in add a research art- sear se see our rece hanisms surrounmore on the Disize rece t longfound n-andaavailabl in ta ci maiSt es gn t the l this to and da plickey less e from the C. B - from the backnt corresp daBI akeh R ldthat wenDatahighligh the ding data ition to ch Not rest data ofsim attribe dintoNC ly avai n all thedsup on da s an onnaand But of st from reco One the on also rd on and free the field of gen Intefalsified ve he ee[11] to adng a pag vali cita natio tepig ting data [SRP005934], having tion seri es Data Sharing ondence [7] itatio ntive ptance. asize n an repor eno mp riety late erdati e to atte itt ject, taki ing data broadly phPro Boa es, usin in the BM tion, aC poses of tio ince es hapsychomlogy also has Gig , Standar (totalin able to pipeline a va the micsestrac able spur -Intnot ce com c. at ta em e tried enciand d is e valu t for ie g hav m ks D ci ac pi ly g mak the le om an s C ur dly [3], data TA iti diza 84 us and aDB One m d al Ressibde ct release [3] GB), suc A . up on hosted the to ss to the nce and pub ngtoo project lpro was that da citetadfor Rules ag broa essure fro tation m and initi scieD strufoun .S. lyC acce of the sorgtion and Publica activ in Gig udyi e acceon in sma lication GigaDB genomics the Bermuda the tion nwas l Acatrust in infra e-he Ueasi na e CO ing ci use TA st . Thistitlelsd created for h as pr Groplet20 pap ts hum rteronthroelyaDB in 11 ehoIden publica atioSub sequent in sk com io publicwithesthof gain po dale agreT in rethe with ing data data se tifie tain ODA 11 datanset is link the of these currently compris Genome Biology genome by t a stor C s r), as tiv ugh a 20 laid out s pro n Ta serv t Lauder nity has e Nat ss te form[2]. d issu atio ot is a hep of of lica vidiac the ominics su es over last yea addtion ugesuswith io I ed and ctice as ch citable tpub ng stab th the For which mu itio en DO at A pra n en , dre con ra ar in atoc se nal 30 r prom teristics com ory gr [8]. this be d tio sists (Dig ed rm ns in data ellu in ility osit bo ad rep discove in M er te ital lar of 15 Tb sets follomwotiva ely ens science as outlined collaDry ability s nt s hrin a co ac also hop affilaiate publi-ho larg prabilityanand d In, and most imp Object individuals. l Info or of normal carcinoma data . The largest mataspect er biological ksctly dire rece char ha to be lt in isjour the trac nicadatasets n ks ortantly practices, International chler and ulti ral The traceab nt kedtio wid esnal or orta a w . part of lythis resu these sam Additional data and tumor raw set [9], which in the regardienc ility thro o , wtion imp > Te w similar [4]. acita t ltu sor ces [12] sam sc ishof the s., Cita E mencu ugh its to follo from the Toront has been held sp data from ogy p is enard derived edome oncien ORDS sets tualin thre bios strainsfu also be e individuals, C L as ndLib eeo mpted rary Biol n ycle csWorking and pare man and pro 88 ksho t step forw T Iner KEYW e.g. tran ev s te lished add of for G SF standar adoptio infrastructureto also atte in ),Genthes cessed data A R ing omi DataCite wornexdthe ife C and scriptom users can ed to a DO d lish tThe N ed org willgenome [5], but elines pub T tner ry l ra lor pub with the arch ations con I rapi ofLthe egendata e sequenc from of E Xsortium na ths eugh the guid ase Workshop immedia reposito thro sets ectobicons ntives for Thacation pro tice rese Brit- project . The prac n an access ence hic cit are N e, can tely acce dly after their userful Natioof ince (http:// Dirhum ring Sorg ent andthei in a sing ry to citatio cropSF loting tio the best central grap th datacite Data Rele lack of easy-to- an ofabse ss the env re- met searchable raw xppor ironmen e data ericrt necessa le, perm foodhe N worta citaws follo E agem data from generation so . a and asrary The k biblio ph ada l by labl effo an s: > sup tal goa ane ta k ng wel har s os T repository and lopi to oth scienceas the bac ers. vest lib this ong s and, avai ing atic tarea g da havrm set m [13]. Thi is exemp l of centralizin nt place. G E long s, andP the the tim r Atem develabl e ta y fields . Outside able oing urce otinnity by g data and eT inAwethehav,e this data citation fo e in the relevanBioMe data avai for da lified by for man ist inughation foeasi research ly tim mu thro on reso of In om rial go wor the the d com X is an whi labl a to nity first making to ec still quit the mou ked se pl e to be ch we pro a pr at avai the ss rpor nly and N E d Cenmu tral it reprodu ope authors e a new lish s sp itykCo se met tion lablrds, cy focu taavai ingure closely with there is vide all acce gratens eoed-Ddata However, for tices of the com inteto hylo d at en cibl rvice ed ry ising data da ed [6]: g cess rs their wor our data tin Bulle ation orm for Inf ciety n So erica e Am of th uly ne/J – Ju logy chno d Te ce an Scien – 2012 r5 mbe , Nu e 38 Volum att by M by me “G itories. resu e se e nive prac * Corre anulat ache latin pub - that citation ally spon calc libramakimplem bed st tamak dence: re k accumu pos E ve data efitsR of ment files lts. This include necessary to repl dataset [3] in -sharing specific of data lisher ecti@gig an R)/Uben d le can be wor ch da CA collscott A Gplemented GigaScience, hed data Psup /UCAaslicay an s the raw who icat ascie3], BGI-H follows canor’s (N The ARnity tices He sear auth ncejourna Estat for the ong A04 establis read-dep , the Medusa fastq read e the pubKong684 U Sn NC a re search com . prac ardstaan [SR Co. Ltd, a citatione, NT, Hong a po at soft has atow th I O bee dataKong the mu k is s, bam 16 Dai Fu section process EV pingble tren ad d ch da examples files. This and ware package, the raw erni eric Re within velosura alignStreet, Tai P Rthe reference and the the sorg for futu May <into mea de met resear rk ng de ph to having Po Indu ive be done strial re hum stud bigwig the Creat addition thew r Atmos . His wo di s of inclu Estate, terms of to not only data submitters y are exc Mat Thus, in fo inclu s also l aspect e: scott@gigasciencejo R) Industrial under the distribution, and set. Po ima er s, © uted Tai elle in CA 2012 t, l com regards distrib journal nt Sned denc Stree tricted use, Cent arch (U service interest d so*cia Dai Fu Commons don et al.; licens to what data poli ply with but Access articlepermits unres Correspon ributors Ltd., 16 e also can is an Open cies. Aut , which reproducti Attribution Licen ee BioMed Centr † Rese ch data research ent an of the articl Equal contce, BGI-Hong Kong al Ltd. This /licenses/by/2.0) on in any se hors not go beyond min al ar the end pm ed Centr 1 medium, (http://creativeco Ltd. This is an able at ee BioM rly cited. GigaScien rese ns. His develo . only adh Open Acces provided Kong on is avail et al.; licens (http://creativeco al work is prope io the origin e ered to s NT, Hong author informati Edmunds License al work /licenses/by/2.0) article distributed citat ructur ed the origin © 2012 of is prope Attribution under st >uca um, provid Full list rly cited. , which permits Commons n in any medi infra nik<at unrestrictethe terms of the Creative d use, distrib er reproductio ution may , and 23 CO NT EN TS CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 Putting it into practice n ctio l Se ecia cess Sp Open Ac 5:223 s 2012, arch Note 0/5/223 BMC Rese om/1756-050 s et al. dcentral.c Edmund w.biome http://ww Sneddon et http://ww al. GigaScien ce 2012 w.gigasc , 1:11 iencejou /content /1/1/11 EDITO RIAL GigaDB: me E AP) Tam P Sneddon, Pet announcing th m geno DENC Open Ac (RDating and er Li and e GigaS n: sorghu cess Scott C Edmund cience da , is stig ta citatio gold standardrvation s s ineda tabase e iplines are einteverm suarggticesletsthat Abstract re w s s tu ne e en e r ford th ns With the sc d T Bas P Adv suifies th ic di tion, as blishe s & citatio le the inte launch of GigaScie Alexandra pl I em and ta ta pu s em e grat ad ru da ci d nce a ion journal, n , Brian Hol es ss ac data e list of Foeasi rmally than the urcegoals to promo of manuscript pub here dastaaex J Pollard te ope lication we provide insig adly, nc r so public repo ion Acc ns acro nsntia.eArel that sult. with sup fere datachberebro sitory doe n-data and repr of pt r to thei of a n ce porting ht into the acco tivScoett C Edmunds , Tom odu visio ar , proth s not exis e ex ata any instotituinchgtiodamaktaesciedtait tioesse mpany setice in th nc re refe section t, for the cibility of researchdata and tools. y by ore ienc lly ntifi prac m Reinforc ing database Giga supporting hD ud d scie , GigaDB Initia pica eetdata ing and ill arch effic om, whinbein hods r. Data Backgr a give DB, whi sstrese inclg gooto also aim data or andoun prtion rove upholdi rs ttyextensiv ch to ire pe tools feat s to pro Inte are that e mof entpa riesrnetofpiond itiontio eenabyrcthe availabilityry offieldinfo. Inrmaadd leyd and imp thge vide a hom ng GigaScie allows ured in tion ik papelimi s butstraints renc a they positoprec in , con anta eer Sir spa a cital tran rt that tract nce’s a e ch the jour adv of on e, whe ns Abs pa rt ta ar the Tim re n ous ious eve it m R driv archerss inincreases experimcientateare,s dahoweve nal and seri Berners io effo erlds, thing r, m and ble resedatamtoentio e ct C . Mayern gress is beyond n a suitable tim uc e cofounrs wou unta selveies” t unt of datiofon of ex . ts se ducprers r [1], and and will last long -Lee has stated: a ntific pro essible to resedecade re od or ly co t amo Scie The DC en way or m . -t ho al r tial pro ac th C ents g a maj. Aut substan ee emdata data dly acc venientta experimfo rm prom teproduction ially despite the cha er than the syst "Data is a the and DataCite inng an d fo vidi dgfor Da hew S a con ion and rapi es for the as fr gnitle tent in area me ntlyto ci daion of ems bes llen pite co ope any data ingnize CORR ESPON * 1 2,5 2,3,4† 1*† M tentpo erve such t pra ges crea thems such as n-d ge be co y dup s reco ow istegin iplin urlicat that, des becopers there ted can ially supporting are it theyscdes most ope ata movement, ctice guidelin genomi s, providee ackn al to m artic hfast en re ssar s beion of ciesh cones that straint is l toles as erlicat than the cs growin due to data es. em the credss di s beunnece pein jour th sets attethmp way n roug cing ts must ntia agen sucse ability pub data of rvic redu and max CC0 waiver, data is also rele In promoting g at rate tenal arch , aped lyt receive rd,acaro nt her tion in formalotsyst ta ha ent One still be to stor rece much ndrenc da s pocutting ased und imizing r po . The rese ni refe rega tionducerslorare rnmon. strapro in. the, field munity sm pms.enIn this en d he andr ha venati arch dissemi any theioutput its pot data er the s. Ifpracectice me goals ofof these preciousmade to capture e and process it, BGI’s extensiv , data gi am data ha in effecoctiv sed re of theidar ta rese deral gomunity are cite estion ential re-u legal red deveurcefo re ee of or es e ot owledg and safe GigaScienc reso cita e com tape [4], th been pop e ine reso m proc fe comentifier ting thes itshacure mo ntific se. them n m esiehow perselvpro, on er data pa guard as causgooio y acknbe puting d ndata ntific seminat e journal urces as possibl ulat scie Several idnd creaelines arize nthem l us infrastr As GigaDB use tatioillustrat ion, and smple of ic Sum to vid at g so. Itcialso screpositories. to ns perl exa ch the ctspe ua Y it released ed with data ting e. to er e whi je . way id ntiv ucture, Wit tio c max ci in al AR , and tio tran l th m al ob s nd un h the gu ince erva citation di imize first d enti cita able fu ly M sets tu curate ra sparency, in a cita scie tal on ntifi dethe mmfor doin isara pot serv tione to Presof not form rmal data ck of inentiss SUM Releasin ble form produced by it has also omen its digi tiesto, w standard eent coired requ ing this leallfeof . por having som data reuse, disgrly pact ogy is a search cess la th l citaThe pro d ar ic sup g BGI, mu pre Biol attin OR’S nce of fo ining m curr re d data e data h im rese na aph g an -pu ewh IT tip ye re the og in Genalome of suc ting in this arch st ting Gig Mul on ere ch of ED atio n, thiogr w pres te an hola data to s and form rta is essentia ga roug r their indices. cess novel man blication. suppor ble tiobibl ta practica d l, and data and tools surrto host cing cesses to date, lack s.aDB (htt e pa slo om d dathe e sc sion of ta th impo recently s, DataCi ta, intern Data Ac identatifia rts litie subth cita of the gen the fo th p:// an or mis ner r, n lt fo es ) par giga hum the eve ct of da has resu iti Ef ke in ed The es oun ticu tio data db.o da ral sorg GigaSc er da how y se inves-ith had t the lintrat tion. d- bre ss TA rg) is key larly spu ta s as a ce dons mun only arch ortseve cition Again, w arch ting ta augmen ak (also from the dea rring the a number tacita sour tter asseitional to achievi ience database, linking Ma y an tifie ODA comics andncdem ns ors man Rese to supp ial le [7,8]. rscican em but is by rese tice of ci da dly 201 ta (C ssib ch ec tio om disc cro in syst &T re in ng da ty s ar iden auth add wds acce gen of usse ta this sp ing tex IS 1 E. coli this. n gy t ation pa ntctive ourng hintader on n g publically cied As d in Mik launch ac loto tivied tagg d the rese be be latio persiste 0104:H ergi has da of ns nocan 12 AS ng reas citatio e the pr e Schatz’ Acgniz iate effe use reco “open-sou issue [5]) resu ta tra chting tia in r da ic an tices rela e bein form to dals seen in mpi s commen 4 outot prec the 20 fuel versally Inbe stro lity. ntifGig andr depnsositisedemdata d Te icle l iner fo Gen coom tedent scan d[10] of a uni prom aSc ground rce genomics” lting in what eabi Prac an repig ne to Rec ns, ap Scie Human and nel at Despite cultura ience’s e an aticin tio itps[9]. rela rest theita a ane on has bee tary in ]. enomic and mec [6]. For the ands pipe a pa dem rs combi institutio d manag ned from the community [1], n. tors to ta gcidue ted inhmed ienc data cred D mak enc n term il fo raw Inte plea n [1 first issue, and data rds tatio pervasive growing to adtiga r Scetec eanee ed d sion ity an rc d to C daivin worksho cus ons lear elegans lable prior ro ou olde utio lineta[2], in add a research art- sear se see our rece hanisms surrounmore on the Disize rece t longfound n-andaavailabl in ta ci maiSt es gn t the l this to and da plickey less e from the C. B - from the backnt corresp daBI akeh R ldthat wenDatahighligh the ding data ition to ch Not rest data ofsim attribe dintoNC ly avai n all thedsup on da s an onnaand But of st from reco One the on also rd on and free the field of gen Intefalsified ve he ee[11] to adng a pag vali cita natio tepig ting data [SRP005934], having tion seri es Data Sharing ondence [7] itatio ntive ptance. asize n an repor eno mp riety late erdati e to atte itt ject, taki ing data broadly phPro Boa es, usin in the BM tion, aC poses of tio ince es hapsychomlogy also has Gig , Standar (totalin able to pipeline a va the micsestrac able spur -Intnot ce com c. at ta em e tried enciand d is e valu t for ie g hav m ks D ci ac pi ly g mak the le om an s C ur dly [3], data TA iti diza 84 us and aDB One m d al Ressibde ct release [3] GB), suc A . up on hosted the to ss to the nce and pub ngtoo project lpro was that da citetadfor Rules ag broa essure fro tation m and initi scieD strufoun .S. lyC acce of the sorgtion and Publica activ in Gig udyi e acceon in sma lication GigaDB genomics the Bermuda the tion nwas l Acatrust in infra e-he Ueasi na e CO ing ci use TA st . Thistitlelsd created for h as pr Groplet20 pap ts hum rteronthroelyaDB in 11 ehoIden publica atioSub sequent in sk com io publicwithesthof gain po dale agreT in rethe with ing data data se tifie tain ODA 11 datanset is link the of these currently compris Genome Biology genome by t a stor C s r), as tiv ugh a 20 laid out s pro n Ta serv t Lauder nity has e Nat ss te form[2]. d issu atio ot is a hep of of lica vidiac the ominics su es over last yea addtion ugesuswith io I ed and ctice as ch citable tpub ng stab th the For which mu itio en DO at A pra n en , dre con ra ar in atoc se nal 30 r prom teristics com ory gr [8]. this be d tio sists (Dig ed rm ns in data ellu in ility osit bo ad rep discove in M er te ital lar of 15 Tb sets follomwotiva ely ens science as outlined collaDry ability s nt s hrin a co ac also hop affilaiate publi-ho larg prabilityanand d In, and most imp Object individuals. l Info or of normal carcinoma data . The largest mataspect er biological ksctly dire rece char ha to be lt in isjour the trac nicadatasets n ks ortantly practices, International chler and ulti ral The traceab nt kedtio wid esnal or orta a w . part of lythis resu these sam Additional data and tumor raw set [9], which in the regardienc ility thro o , wtion imp > Te w similar [4]. acita t ltu sor ces [12] sam sc ishof the s., Cita E mencu ugh its to follo from the Toront has been held sp data from ogy p is enard derived edome oncien ORDS sets tualin thre bios strainsfu also be e individuals, C L as ndLib eeo mpted rary Biol n ycle csWorking and pare man and pro 88 ksho t step forw T Iner KEYW e.g. tran ev s te lished add of for G SF standar adoptio infrastructureto also atte in ),Genthes cessed data A R ing omi DataCite wornexdthe ife C and scriptom users can ed to a DO d lish tThe N ed org willgenome [5], but elines pub T tner ry l ra lor pub with the arch ations con I rapi ofLthe egendata e sequenc from of E Xsortium na ths eugh the guid ase Workshop immedia reposito thro sets ectobicons ntives for Thacation pro tice rese Brit- project . The prac n an access ence hic cit are N e, can tely acce dly after their userful Natioof ince (http:// Dirhum ring Sorg ent andthei in a sing ry to citatio cropSF loting tio the best central grap th datacite Data Rele lack of easy-to- an ofabse ss the env re- met searchable raw xppor ironmen e data ericrt necessa le, perm foodhe N worta citaws follo E agem data from generation so . a and asrary The k biblio ph ada l by labl effo an s: > sup tal goa ane ta k ng wel har s os T repository and lopi to oth scienceas the bac ers. vest lib this ong s and, avai ing atic tarea g da havrm set m [13]. Thi is exemp l of centralizin nt place. G E long s, andP the the tim r Atem develabl e ta y fields . Outside able oing urce otinnity by g data and eT inAwethehav,e this data citation fo e in the relevanBioMe data avai for da lified by for man ist inughation foeasi research ly tim mu thro on reso of In om rial go wor the the d com X is an whi labl a to nity first making to ec still quit the mou ked se pl e to be ch we pro a pr at avai the ss rpor nly and N E d Cenmu tral it reprodu ope authors e a new lish s sp itykCo se met tion lablrds, cy focu taavai ingure closely with there is vide all acce gratens eoed-Ddata However, for tices of the com inteto hylo d at en cibl rvice ed ry ising data da ed [6]: g cess rs their wor our data tin Bulle ation orm for Inf ciety n So erica e Am of th uly ne/J – Ju logy chno d Te ce an Scien – 2012 r5 mbe , Nu e 38 Volum att by M by me “G itories. resu e se e nive prac * Corre anulat ache latin pub - that citation ally spon calc libramakimplem bed st tamak dence: re k accumu pos E ve data efitsR of ment files lts. This include necessary to repl dataset [3] in -sharing specific of data lisher ecti@gig an R)/Uben d le can be wor ch da CA collscott A Gplemented GigaScience, hed data Psup /UCAaslicay an s the raw who icat ascie3], BGI-H follows canor’s (N The ARnity tices He sear auth ncejourna Estat for the ong A04 establis read-dep , the Medusa fastq read e the pubKong684 U Sn NC a re search com . prac ardstaan [SR Co. Ltd, a citatione, NT, Hong a po at soft has atow th I O bee dataKong the mu k is s, bam 16 Dai Fu section process EV pingble tren ad d ch da examples files. This and ware package, the raw erni eric Re within velosura alignStreet, Tai P Rthe reference and the the sorg for futu May <into mea de met resear rk ng de ph to having Po Indu ive be done strial re hum stud bigwig the Creat addition thew r Atmos . His wo di s of inclu Estate, terms of to not only data submitters y are exc Mat Thus, in fo inclu s also l aspect e: scott@gigasciencejo R) Industrial under the distribution, and set. Po ima er s, © uted Tai elle in CA 2012 t, l com regards distrib journal nt Sned denc Stree tricted use, Cent arch (U service interest d so*cia Dai Fu Commons don et al.; licens to what data poli ply with but Access articlepermits unres Correspon ributors Ltd., 16 e also can is an Open cies. Aut , which reproducti Attribution Licen ee BioMed Centr † Rese ch data research ent an of the articl Equal contce, BGI-Hong Kong al Ltd. This /licenses/by/2.0) on in any se hors not go beyond min al ar the end pm ed Centr 1 medium, (http://creativeco Ltd. This is an able at ee BioM rly cited. GigaScien rese ns. His develo . only adh Open Acces provided Kong on is avail et al.; licens (http://creativeco al work is prope io the origin e ered to s NT, Hong author informati Edmunds License al work /licenses/by/2.0) article distributed citat ructur ed the origin © 2012 of is prope Attribution under st >uca um, provid Full list rly cited. , which permits Commons n in any medi infra nik<at unrestrictethe terms of the Creative d use, distrib er reproductio ution may , and 23 CO NT EN TS CODATA 2012: Data Publication and Data Citation (2) 30 October 2012 because good research needs good data Thank you for your attention DCC Website: Monica Duke, Alex Ball: 8th International Digital Curation Conference “Infrastructure, Intelligence, Innovation: driving the Data Science agenda” 14–16 January 2013, Amsterdam CODATA 2012: Data Publication and Data Citation (2) 30 October 2012
© Copyright 2025