Crowdsourcing at MNHN : label transcription and beyond

Crowdsourcing at MNHN : label
transcription and beyond
ISTC 2015 Joensuu
A « conveyor belt » operation


Building renovation is an oportunity for reconditionning,
sorting and digitizing collections
a dedicated industrial site running from 2008 to 2012
ISTC2015 Joensuu
Results


5,3 million images online 
only a minimum set of attributes




scientific name
family
catalog number
folder color ≈ continent of origin
ISTC2015 Joensuu
Coffea arabica L.
Rubiaceae
P04010591
AME
Labels


Information in the images
Available without physical access
ISTC2015 Joensuu
Optical Character Recognition


Tesseract open source solution
Months of computing
ISTC2015 Joensuu
Don’t expect to much from OCR

Only 4 specimens mention ‘Joensuu’

P03653443, P04903872, P01721163, P04307778
ISTC2015 Joensuu
Zooming in the specimens « from Joensuu »
ISTC2015 Joensuu
Far from Joensuu
ISTC2015 Joensuu
So what use of OCR ?

As a filter for human driven-process


Reflora program
Crowdsourcing
ISTC2015 Joensuu
Crowdsourcing label retranscription

From December 2012 to January 2015
 1859
volunteers
 99 160 specimens seen
 1 292 722 contributions
 to compare with 95528 transcriptions at herbarium
ISTC2015 Joensuu
Answers to be found
 Where collected?

Contributing is exploring the world
 Who collected it?

Contributing is meeting great people
 When collected?

Contributing is travelling in time
 What is it?

For the time being, not for the general public …
ISTC2015 Joensuu
Propose tasks you can complete
ISTC2015 Joensuu
Make it fun
ISTC2015 Joensuu
Train the crowd
ISTC2015 Joensuu
Ensuring quality by redundancy
ISTC2015 Joensuu
A social website
ISTC2015 Joensuu
Don’t forget science
ISTC2015 Joensuu
A lasting keen interest
NbContrib
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
ISTC2015 Joensuu
Heavy contributors

50% contributions from a minority of 10 people
ISTC2015 Joensuu
Good overall quality

First missions included already databased specimens




Much more geolocalization
More accurate location description
Occasional differences in collector name spelling
Stricter vocabulary for geography

Quality is better than expected

Too much redundancy ?
ISTC2015 Joensuu
Any questions ?

Marc Pignal


pignal
@
mnhn.fr
Simon Chagnoux

http://www.webdoc-herbier.com/#!91
chagnoux @ mnhn.fr
http://lesherbonautes.mnhn.fr
@LesHerbonautes
ISTC2015 Joensuu
What’s next ?
ISTC2015 Joensuu
Crowdsourcing quality control

Another massive digitization program for « small »
herbariums



Checking





2,3 million images
10000/day
Focus
Scientific name
Alignment
Barcode label
Recruiting lesherbonautes community
ISTC2015 Joensuu
Sister websites

The code is now open



Different websites for different linguistic communities




https://forge.mnhn.fr/svn/mnhn/dev/lesrecolteurs/tags/herbonautes_w
eb/public/
Login opensource/forall
Spain
Portugal
Sharing code
Sharing « missions »
ISTC2015 Joensuu
A global label transcription dashboard





Atlas of living Australia
Notes from nature
LesHerbonautes
Herbaria@Home
…
ISTC2015 Joensuu