Production Manual

Production Manual
Project Acronym:
OpenUp!
Grant Agreement No: 270890
Project Title:
Opening up the Natural History Heritage for Europeana
C2.4.3 - OAI-PMH Interface final version - Production manual
Revision:
Version 1
Authors:
Astrid Höller
AIT Forschungsgesellschaft mbH
Gerda Koch
AIT Forschungsgesellschaft mbH
Odo Benda
AIT Forschungsgesellschaft mbH
Project co-funded by the European Commission within the ICT Policy Support Programme
Dissemination Level
P
Public
C
Confidential, only for members of the consortium and the Commission Services
Revision History
Revision
Date
Author
Organisation Description
Draft
17.6.2013
A. Höller
AIT
First Version (Draft)
Draft
19.08.2013 A. Höller
AIT
Adding The harvest does not start and/or there is
no progress in the console window
Draft
20.08.2013 A. Höller
AIT
Adding “Quick Guide”
Draft
28.11.2013 A. Höller
AIT
Adding The gbif log file can not be opened
(permission denied).
Draft
12.12.2013 A. Höller, O. AIT
Benda
Revision of Parameters and EDM
Version 1
28.02.2014 A. Höller, G. AIT
Koch
final version
Statement of Originality
This deliverable contains original unpublished work except where clearly
indicated otherwise. Acknowledgement of previously published material and of
the work of others has been made through appropriate citation, quotation or
both.
Distribution
Recipient
Date
Version
Accepted YES/NO
Table of Contents
Description of Work ................................................................................................................................. 1
The GBIF Harvesting and Indexing Toolkit (HIT) ......................................................................................... 2
User Interface of HIT..................................................................................................... 2
Adding a new bioDatasource and harvesting it .................................................................. 5
Pentaho Kettle (Data Transformation) .................................................................................................... 16
Databases.................................................................................................................. 16
Creating a folder structure ........................................................................................... 17
01-transform.............................................................................................................. 19
02-validate ................................................................................................................ 23
03-oai-import ............................................................................................................. 25
The OAI-Provider ................................................................................................................................... 25
Logging in .................................................................................................................. 25
Adding a new collection ............................................................................................... 26
Advanced search ........................................................................................................ 29
Browse ...................................................................................................................... 31
Error handling ........................................................................................................................................ 33
Harvesting ................................................................................................................. 33
The bio datasource can not be saved ........................................................................................................ 33
After harvesting the metadata updater no operator is created ............................................................... 33
The harvest does not start and/or there is no progress in the console window ...................................... 35
No inventoried list is created .................................................................................................................... 36
No name ranges file is created .................................................................................................................. 36
Why less than 100 % of the target records are harvested? ...................................................................... 36
Why more than 100 % of the target records are harvested?.................................................................... 37
Error in the response document ............................................................................................................... 38
Installing the Biocase Provider on IIS Server ............................................................................................. 38
The gbif log file can not be opened (permission denied). ......................................................................... 39
An error concerning the title of the datasource occurs ............................................................................ 39
Transforming ............................................................................................................. 40
An error occurs when trying to execute a transformation........................................................................ 40
OAI-Import ................................................................................................................ 41
The Transformation stops during the import ............................................................................................ 41
The imported collection can not be found on the OAI-Provider platform ................................................ 42
I.
Quick Guide .................................................................................................................................... 44
GBIF-HIT Harvester................................................................................................................................. 44
Existing data source .......................................................................................................... 44
New data source ............................................................................................................. 44
Pentaho ................................................................................................................................................. 44
Existing data source .......................................................................................................... 44
New data source ............................................................................................................. 44
OAI-Provider .......................................................................................................................................... 45
Existing data source .......................................................................................................... 45
New data source ............................................................................................................. 45
Conclusion ............................................................................................................................................. 45
II.
List of Figures ................................................................................................................................. 46
Description of Work
This document illustrates the complete procedure of harvesting, transforming and uploading data during the OpenUp!
project. This includes harvesting datasources from the data provider BioCASe with the GBIF Harvesting and Indexing
Toolkit (HIT), transforming the harvested ABCD records with Pentaho Kettle and finally uploading the created ESE
records on the OAI-Provider-platform with the Zebra information management system.
Figure 1 gives an overview of the whole process. As can be seen the data has to pass six steps before it is
finally delivered to Europeana.
1. Message from Data Provider that a new datasource is available
2. Harvesting the datasource with the GBIF-HIT Harvester
3. Transforming the ABCD files into ESE records with Pentaho Kettle (Data Transformation)
4. Informing the OpenUp! Meta Data Management that the data is transformed
5. (Informing us if the data was correct)
5. Uploading the records on the OAI-Provider-platform
6. Deliver the data to Europeana
3
4
5
6
2
1
Raw
data
Figure 1 Diagram showing the infrastructure of the OpenUp! project with its main steps
In the next chapters an example datasource will be processed step by step. In this document we are concentrating on
the three Action steps (compare
AIT, 2014
C2.4.3
p. 1
Figure 1): The HIT Harvester (step 2), Pentaho Kettle (step 3) and the OAI-PMH-Service (step 5).
The GBIF Harvesting and Indexing Toolkit (HIT)
User Interface of HIT1
When going to http://ait117:8080/hit/ the following window can be seen (see Figure 2).
Figure 2 Logging in the GBIF Harvesting and Indexing Toolkit (HIT)
After logging in (in the upper right corner) the interface of the HIT Harvester can be seen (see Figure 3).
1
http://code.google.com/p/gbif-indexingtoolkit/wiki/UserManual 17th July, 2013
AIT, 2014
C2.4.3
p. 2
Figure 3 The HIT user interface
There are five main sections: Datasources, Jobs, Console, Registry and Report. The tab used at the moment is
always green (like Datasources in Figure 3), the others are grey.
In Datasources all datsources that are available can be seen. The orange datasources are metadata
updaters, the green ones operators. For both one of the following protocols can be chosen: DIGIR, BioCASE,
TAPIR or DwC Archive.
PLEASE NOTE: In this project ONLY the BioCASE protocol is used.
An operator is only created when a datasource has been created and the metadata updater has been
successfully harvested. This case will be described in the next chapter.
When clicking on the Jobs tab all jobs that have been started or jobs that are waiting for execution can be
seen. The jobs are listed with their Job ID, name, description, their creation and their starting date (see
Figure 4). If one ore more jobs shall be stopped one id can be filled in and the “kill” button is pressed. It is
also possible to check the “all” option or to reschedule a job.
AIT, 2014
C2.4.3
p. 3
Figure 4 The Jobs section
When a Job has been started its progress can be watched in the Console section. Every few seconds the log
messages of the application are being refreshed with date and time (see Figure 5).
Figure 5 The Console section with the Log Event List
The Registry tab is used to synchronise with the GBIF Registry. Before clicking on “schedule” the datasources
can be filtered by endorsing Node or organisation name (see Figure 6).
AIT, 2014
C2.4.3
p. 4
Figure 6 The Registry tab
Finally a report in the Report section can be written or generated (see Figure 7). Again there are different
options for filtering the result.
Figure 7 Writing or generating a report
Adding a new bioDatasource and harvesting it
First of all click on “add bioDatasource” in the lower right corner (see Figure 8).
AIT, 2014
C2.4.3
p. 5
Figure 8 Clicking on “add bioDatasource”
After doing this the datasource has to be configured (see Figure 9). The name of the bioDatasource, the
name of the provider, the URL and the factory class need to be filled in. It is very important to choose
BioCASe in the drop-down-menu of “Factory class”. Typing in the name of the country is optionial. In this
example it is done.
When everything has been filled in correctly click on “save” and the datasource should now appear in orange
in the datasource list (see Figure 10).
Figure 9 Adding a new datasource
AIT, 2014
C2.4.3
p. 6
Figure 10 The newly added datasource “Sahlberg”
PLEASE NOTE: It is easier to find the newly created datasource by clicking on “Recently Added” at the top
line.
Now tick the box in front of the datasource “Sahlberg” to select this metadata updater. Then click on
“schedule”. When switching to the Jobs tab the two Jobs can be seen waiting to be executed:
“issueMetadate” and “scheduleSynchronisation” (see Figure 11).
AIT, 2014
C2.4.3
p. 7
Figure 11 Job list after scheduling the metadata updater for “Sahlberg”
When switching to the Console tab (see Figure 12 ) not only the progress of the Jobs can be seen but also
error messages if something is missing (marked red).
When the Jobs have been finished they must not appear anymore in the Job list. When going to Datasources
again you can see that a “Sahlberg” operator has been created (see Figure 13).
AIT, 2014
C2.4.3
p. 8
Figure 12 The Log Event List after scheduling the metadata updater “Sahlberg”
Figure 13 The newly created operator “Sahlberg – Sahlberg”
Now it is time to gather records from the data provider. To achieve this select the (green) operator “Sahlberg
– Sahlberg” (tick the box) and click on “schedule”. Right after this there should be six Jobs in the list:
Inventory, processInventoried, search, processHarvested, synchronise and extract (see Figure 14). The order
of these operations is essential for a correct harvesting process.
AIT, 2014
C2.4.3
p. 9
Figure 14 The Job list after scheduling the operator “Sahlberg – Sahlberg”
During the Inventory operation a list of all scientific names occurring in the datasource is generated. One can
follow this process in the Console section (see Figure 15).
Figure 15 Console section during the Inventory operation
As can be seen in Figure 16 an inventory_request and an inventory_response is created (see Figure 17). Both
are saved in /opt/hit (…) – the harvest directory determined during the HIT installation.
AIT, 2014
C2.4.3
p. 10
Figure 16 The inventory_request of the Inventory operator
Figure 17 The inventory_response of the Inventory operator
Figure 18 shows the result of the processInventoried operation: the text document inventoried.txt with an
alphabetical list of all scientific names.
AIT, 2014
C2.4.3
p. 11
Figure 18 Alphabetical list of all scientific names
Another document containing all the name ranges that were constructed is created too: nameRanges.txt
(see Figure 19).
AIT, 2014
C2.4.3
p. 12
Figure 19 The nameRanges.txt document
After this it is time for the search operation. In this phase the later with Pentaho Kettle transformed abcd
records are created. There is always a search_request and a search_response (see Figure 20).
Figure 20 The search operation creates search_requests and search_responses
AIT, 2014
C2.4.3
p. 13
If the response was encoded using ABCD, there is one core file after the processHarvested operation:
unit_records.txt (see Figure 21). It contains a header line with column names, with each line representing a
single Unit (record) element.
Figure 21 The unit_records.txt file
In addition six files all relating back to the core file are created during this process:

image_records.txt - a text file containing a header line with column names, with each line
representing a multimedia record relating to a given Unit (record) element.

identifier_records.txt - a text file containing a header line with column names, with each line
representing an identifier record (i.e. GUID) relating to a given Unit (record).

identification_records.txt - a text file containing a header line with column names, with each line
representing an Identification element relating to a given Unit (record) element.

higher_taxon_records.txt - a text file containing a header line with column names, with each line
representing higher taxon elements relating to some Unit (record) element.

link_records.txt - a text file containing a header line with column names, with each line
representing a link record (i.e. URL) relating to a given Unit (record) element.

typification_records.txt - a text file containing a header line with column names, with each line
representing a typification record (i.e. type status) relating to a given Unit (record) element.
Finally there are the synchronisation and the extraction operations. During the synchronisation the data is
updated and old data is deleted (see Figure 22).
AIT, 2014
C2.4.3
p. 14
Figure 22 The synchronisation and the extractions operations in the Console section
The extraction operation creates the ABCD records as search_responses with continuing numbers in .gz
format (see Figure 23).
Figure 23 The result of the extraction process
AIT, 2014
C2.4.3
p. 15
When there are no more Jobs in the Job list the Harvesting process with HIT is finished. A folder structure
with the root directory /opt/hit/ and the search_responses should have been created (compare Figure 23).
Pentaho Kettle (Data Transformation)
Pentaho Data Integration (PDI, also called Kettle) is the component of Pentaho responsible for the Extract,
Transform and Load (ETL) processes.2
Pentaho Kettle is used to transform the ABCD records into correct ESE files. The complete process in Pentaho
is categorized in three steps:
1. transform
2. validate
3. oai-import
This structure is also represented in the Pentaho repository (see Figure 24).
Figure 24 Repository structure in Pentaho
Before starting Jobs and Transformations it is useful to understand the database structure behind Pentaho.
Databases
Figure 25 shows the database “etl” with the four tables “Biocase_Harvest_to_ESE”,
“Biocase_Harvest_to_ESE_result”, “Biocase_Harvest_to_ESE_tasks” and “BGBM_Media_URLS”. The fields for
each table are listed in Figure 25.
All the Jobs in Pentaho are based on the table “Biocase_Harvest_to_ESE”. The Job parameters need to be
adapted before starting transforming the data. These parameters are all saved in “Biocase_Harvest_to_ESE”.
All the correct finished ESE records are saved in the table “Biocase_Harvest_to_ESE_result”. It contains the
transformation results.
In the table “Biocase_Harvest_to_ESE_tasks” all tasks per Job (transform, validate, oai-import) are saved. It
shows also the error messages if something goes wrong during the transformation.
Finally there is the table “BGBM_Media_URLs” where all media data sources (images) are saved. It has the
function of a lookup table.
2
http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+%28Kettle%29+Tutorial 17th July, 2013
AIT, 2014
C2.4.3
p. 16
Figure 25 Structure of the “etl” database with four tables
Creating a folder structure
The three main folders have sub-folders representing the countries from the content providers. When a new
datasource has been harvested a hierarchically correct folder has to be created in every of the three main
folders.
Remember our example datasource we have harvested before – “Sahlberg” from the University of Finland.
First of all create a new country folder named “Finland” in the main folders “transform”, “validate” and
“oaiimport” – if there is not already one (see Figure 26).
AIT, 2014
C2.4.3
p. 17
Figure 26 Creating a “Finland” folder in every category
As can be seen there are already a few countries in every category. The Transformations or Jobs do never
change, no matter which datasource is processed. So an existing Job can be copied and saved as a new one.
The only thing that must be adapted before starting the Jobs are the Job Parameters.
Before that three Jobs must be created – one for every category. The names of the Jobs are consistent. The
“transform”-Job is named after the collection name (see Figure 27 for our example “Sahlberg”).
Figure 27 Two Finnish Jobs in the transform category
In the “validate” folder the Jobs are named after the collection plus the word “validate”. Between the
collection name and “validate” you have to type the symbol “#” (see Figure 28).
Important: The three Jobs for one datasource MUST have the same name (the collection name). Everything
behind the “#” symbol is ignored by the system.
AIT, 2014
C2.4.3
p. 18
Figure 28 The Finnish Jobs in the validate category
Now one Job is missing for the oai-import. Again the name of the Job has the same collection name followed
by “# oai import” (see Figure 29).
Figure 29 The Finnish Jobs in the oaiimport category
01-transform
In the “Sahlberg” example the Job in the “transform” directory looks like shown in Figure 30.
Figure 30 The Job “Sahlberg” in the transform category
The Job parameters can be opened by double-clicking on the orange Job icon in the middle and switching to
the last tab called “Parameters” (see Figure 31). Figure 31 shows the parameters for a Transformation in ESE.
AIT, 2014
C2.4.3
p. 19
Figure 31 Parameters for Transformation of “Sahlberg” in ESE
In Figure 31 the Parameters are already filled in correctly. First of all it has to be defined whether the
collection is “Restricted” or “Unrestricted” (Parameter number 4). This is done by typing Y (for Yes, it is
restricted) or N (for No, it is not restricted = unrestricted) in the correct value field.
Parameter number 5 is the collection identifier.
COLLECTION_NAME:CONTENT_PROVIDER:COUNTRY
It
has
always
the
same
pattern:
PLEASE NOTE: For the collection identifier only capital letters are used.
As can be seen in Figure 31 the collection identifier of the example collection “Sahlberg” is
SAHLBERG:UH:FINLAND. This collection identifier is also added on the OAI-Provider platform (see Adding a
new collection).
Parameter number 6 shows the base directory /opt/hit defined in the installation process of the HIT
harvester (compare Fehler! Verweisquelle konnte nicht gefunden werden.).
The “dataset_name” and the “dataset_uddi_key” (Parameter 7 and 8) are taken from the SQL database
“Biocase_Harvest_to_ESE” (see Figure 32, compare Figure 25).
Figure 32 The columns “dataset_name” and “dataset_uddi_key” in “Biocase_Harvest_to_ESE”
AIT, 2014
C2.4.3
p. 20
Parameter 12 is the variable ${Internal.Job.Name}. Therefore it is important that the three Jobs for one
collection have the same name. The Parameter “idzebra_dir” is the zebra directory. Parameter 20 indicates
whether the Transformation is done in ESE or EDM (compare Figure 33).
Figure 33 Parameters for Transformation of “Sahlberg” in EDM
In Figure 33 the differences to the ESE parameters have been highlighted. First the idzebra directory is
adapted to the “oai-provider-edm”. Instead of “no” “vocabulary_service_uri” the URL
http://ait117:8080/Vocabulary/rest/~Mapping/NHMW_common_name/perform is filled in. Finally the
“EDM” parameter is switched to “Y” for yes.
When everything has been filled in correctly click “OK”. The Job is started by clicking on the “Play” symbol
(see Figure 34) and then on “Launch” (see Figure 35).
Figure 34 Starting the Job “Sahlberg”
AIT, 2014
C2.4.3
p. 21
Figure 35 Launching a Job
The “validate” Job must not be started before the “transform” Job is finished. It is very important to keep the
order 01-transform, 02-validate, 03-oaiimport.
The result of this first Job are XML files in ABCD format in the folder “extracted” (see Figure 36).
Figure 36 ABCD records in the folder “extracted” after running the “Sahlberg” Job
AIT, 2014
C2.4.3
p. 22
02-validate
When the first Job is finished the Job “Sahlberg # validate” can be opened and started (see Figure 37).
Figure 37 The Job “Sahlberg # validate”
This Job simulates ESE validation by copying the records in the “ESEvalidated” directory.
PLEASE NOTE: The “validate” function is not used at the moment.
Before the oai-import is started – which can be quite time-consuming with link validation – the transformed
records can be controlled under http://ait117/analyse/tasks.php. This application shows the executed
Pentaho Jobs (see Figure 38).
Figure 38 Analysis of Pentaho Jobs
When clicking on the “EA” link in the column “Job_ID” an error analysis is done (see Figure 39).
AIT, 2014
C2.4.3
p. 23
Figure 39 Error analysis of the “Sahlberg” transformation
Another possibility to control the data is with phpMyAdmin. In the table “Biocase_Harvest_to_ESE_result”
every transformed record is listed (see Figure 40).
Figure 40 Controlling the transformed data with phpMyAdmin
AIT, 2014
C2.4.3
p. 24
03-oai-import
Finally open the Job “Sahlberg # oai import” in the 03-oaiimport directory (see Figure 41). Start this Job after
the “validate” Job is finished.
Figure 41 The Job “Sahlberg # oai import”
When this is done the work with Pentaho Kettle is done. Now correct ESE records should have been created
that can be controlled on the OAI-Provider-platform.
The OAI-Provider
The OAI-Provider platform can be reached by typing http://ait117/oai-provider/index.php into the internet
browser. The ESE records that have been uploaded with the Pentaho Job “Sahlberg # oai import” can be
seen there.
PLEASE NOTE: It is exactly the same with the OAI-Provider-EDM which can be opened by typing
http://ait117/oai-provider-edm/index.php in the browser.
Logging in
To log in click on one of the “Login” options shown in Figure 42.
Figure 42 Logging in
AIT, 2014
C2.4.3
p. 25
When clicking on one of these links the following window appears (see Figure 43).
Figure 43 Login window
When “User account” and “Password” have been typed in there is the possibility to check the option
“Remember me” to avoid logging in every time the OAI-Provider platform is opened.
When clicking finally on “Login” the following message indicates the login was successful (see Figure 44).
Figure 44 “You are now logged in as admin”
Adding a new collection
The OAI-Provider platform has an “Admin Area” (see Figure 45).
Figure 45 Entering the Admin Area
When clicking on “Admin Area” the following window opens (see Figure 46).
AIT, 2014
C2.4.3
p. 26
Figure 46 The Admin Area
To add a new collection select the second icon “Collections” (compare Figure 46). The option “Edit
Collection” appears (see Figure 47).
Figure 47 “Edit Collections”
When clicking on “Edit Collections” an alphabetical list of all collections can be seen (see Figure 48). Every
collection includes three parts and every part has to be created separately.
Figure 48 List Collections
AIT, 2014
C2.4.3
p. 27
On the top right corner (compare Figure 48) is an icon for adding a new collection (the left one). When
clicking on this symbol the following mask can be seen (see Figure 49).
Figure 49 Adding a new collection
The only field used is “Collection Identifier” on top. For the example SAHLBERG:UH:FINLAND three “new”
collections are created. The first one with the collection identifier SAHLBERG; the second one with
SAHLBERG:UH and the third one with SAHLBERG:UH:FINLAND. Every collection is saved separately with the
disc symbol in the top right corner (compare Figure 49).
When this is done a hierarchical structure has been created (see
Figure 50).
Figure 50 Newly added collection Sahlberg
To have a look at the newly added records (via Pentaho) the “Advanced search” or the “Browse” function
can be used.
AIT, 2014
C2.4.3
p. 28
Advanced search
The “Advanced search” is started by clicking on the link on top of the page (see Figure 51).
Figure 51 Starting the “Advanced Search”
The query can be simply typed in the search box and started by clicking on “Go” (see Figure 52). Furthermore
it can be defined in which field the search term should appear (see Figure 53).
Figure 52 Searching for “Sahlberg”
Figure 53 Using the “in the field” search option
AIT, 2014
C2.4.3
p. 29
If help is needed during the research the “Lookup” function can be used (see Figure 54).
Figure 54 Looking up titles of the collection “Sahlberg”
When the query is finished the “Go” button can be clicked and the results are listed (see Figure 55).
Figure 55 Result of “Advanced Search”
AIT, 2014
C2.4.3
p. 30
When clicking on one of the result records the ESE record with the different fields can be seen (see Figure
56). When switching to the “Info” tab the collection information can be controlled (see Figure 57).
Figure 56 Displaying the ESE record
Figure 57 Displaying the collection information
Browse
The “Browse” function can be used to find records as well. As can be seen in Figure 58 the records can be
browsed by Europeana Data Provider, Collection, Europeana Type, Europeana Rights and OAI published. In
brackets the number of records is shown.
Figure 58 Browsing the records
AIT, 2014
C2.4.3
p. 31
To check which records of “Sahlberg” are valid two queries are combined in the “Advanced Search”. First the
name of the collection is chosen. Additionally the records must be “OAI published” (see Figure 59).
Figure 59 Looking for valid records
AIT, 2014
C2.4.3
p. 32
Error handling
The aim of this chapter is to show possible error scenarios that can occur during the OpenUp! process (see
Figure 1 for an overview of the process). In this section the process is divided into the three parts Harvesting,
Transforming, and OAI-import. On the next pages potential errors and solutions are illustrated.
Harvesting
The bio datasource can not be saved
When one is trying to edit an already existing datasource and save it afterwards the error message “Invalid
field value for field "bioDatasource.lastHarvested". could appear (see Figure 60).
Figure 60 Error message when editing a bio datasource
In this case the bio datasource needs to be deleted and a new one has to be created.
After harvesting the metadata updater no operator is created
After creating a new bio datasource the metadata operator (orange colour) has to be harvested. Not till then
an operator is created (green colour) (see Figure 61, first and second line for example).
AIT, 2014
C2.4.3
p. 33
Figure 61 Metadata updaters and operators
First try clicking on “Recently Added” to make sure, that the bio datasource has not been created yet (see
Figure 62).
Figure 62 Clicking on “Recently Added”
When the datasource is not in the list there is most likely an error concerning the access point URL or the
BioCASE protocol. The access URL should be checked again to exclude spelling mistakes. When copying and
pasting the URL into a browser a BioCASE protocol should appear (see Figure 63).
AIT, 2014
C2.4.3
p. 34
Figure 63 BioCASE protocol
The BioCASE protocol should be checked again. If there is a mistake the data provider needs to be contacted.
Furthermore the hit log file can be controlled to find out why the operator has not been created.
The harvest does not start and/or there is no progress in the console window
When the console window does not change even though a Job has been started first make sure the dynamic
view is active. If it is not click on “switch to Dynamic View” (see Figure 64).
Figure 64 Switching to Dynamic View
If this is not the case the Tomcat server may need a restart. This is done by typing “sudo service tomcat6
restart” into a terminal window (see Figure 65). This may take a few seconds.
AIT, 2014
C2.4.3
p. 35
Figure 65 Restarting the tomcat server
No inventoried list is created3
Often the reason no inventoried list could be generated is because the inventory response was empty. From
the "Console" tab, the xml requests and responses can be checked directly from the browser.
The integrity of the inventoried list is paramount to the success of subsequent harvesting operations. Ideally
the list of scientific names in the inventoried file will contain no duplicates, and arrange the scientific names
alphabetically. If the list does not have these characteristics, double check the inventory response(s) to
ensure that the names are in fact returned in order.
No name ranges file is created4
The only reason that the name ranges file couldn't be generated is if the inventoried list of scientific names
was empty, or all scientific names were invalid. Note that scientific names containing SQL breaking
characters such as "&" are still included, but the breaking characters are replaced automatically. Therefore
at this level there is a data quality check on scientific names, and any errors are outputted as log messages to
the "Console".
Often the reason a harvest does not retrieve 100% of a dataset/resource's records is that not all records are
covered by the name ranges that have been generated. From the expanded BioDatasource in the
BioDatasources list in the "Datasources" tab, you can view the name ranges file directly from within the
browser. Compare this file against the inventoried.txt file for any inconsistencies.
Why less than 100 % of the target records are harvested? 5
There are different reasons why records are dropped (see Figure 66).
3
https://code.google.com/p/gbif-indexingtoolkit/wiki/UserManual#5.5.1.1_Inventory 20th August, 2013
4
https://code.google.com/p/gbif-indexingtoolkit/wiki/UserManual#5.5.1.2_Process_inventoried 20th August, 2013
5
http://code.google.com/p/gbif-indexingtoolkit/wiki/FAQ#Why_do_I_harvest_LESS_than_100%_of_the_target_records? 19th July,
2013
AIT, 2014
C2.4.3
p. 36
Figure 66 Dropped records are shown in red
It could be because of a problem constructing the name ranges file. To ensure that the proper name ranges
were constructed the name ranges file should be examined for any peculiarities.
There could also have been a problem parsing some of the XML responses. The logs can be examined to see
that there were no parsing errors. Moreover check the actual search response(s), to see that they in fact
contain records and that these correspond to the appropriate name range.
Why more than 100 % of the target records are harvested?6
One source of inflated record count in Darwin-Core archives can be illegal line terminating characters (see
Figure 67). A record containing such a character would break in two and appear to the parser as two lines
with an insufficient number of columns. Consequently these two lines would be replaced by blank lines but
still appear in the record count turning a single line into two. One could search for lines containing line
terminating characters ”inside” the records and remove these.
Figure 67 Additionally harvested records are shown in purple
6
http://code.google.com/p/gbif-indexingtoolkit/wiki/FAQ#Why_do_I_harvest_MORE_than_100%_of_the_target_records? 19th July,
2013
AIT, 2014
C2.4.3
p. 37
Error in the response document
That could have a number of reasons: One of the table/column names that were set up in the configuration
does not exist (because it was renamed or removed) or the credentials used by the BPS do not have
sufficient privileges. Simply copy the SQL statement and execute it manually on the database with a regular
database client that will show you the detailed error message returned by the DBMS.7
For each database you want to publish, you need to set up a BioCASe data source (do not mix that up with
ODBC data sources on Windows machines). The resulting BioCASe web service is uniquely identified by its
URL, which is a combination of the BioCASe installation’s URL and the name of the data source. So if you
made your installation available at http ://www.foobar.org/biocase during the installation process and set
up a data source named Herbar, the URL of the BioCASe web service would be http
://www.foobar.org/biocase/pywrapper.cgi?dsa=Herbar (data source names can be case sensitive, depending
on your server’s operating system).8
Figure 68 shows a possible error message:
Figure 68 SQL error in a response document
The most common reason that a search response is invalid, is that it contains an XML breaking character.
When a name range representing 500 records fails, for example, it could be due to a single invalid record and
as a result the other 499 records do not get harvested. In an effort to harvest as many records as possible,
and help the user identify where the breaking characters are found, the system will break a request that fails
into several smaller requests. Keep a careful eye on the output log messages for which responses are invalid,
and provide feedback to the data publisher which will help them improve the quality of their dataset.9
Installing the Biocase Provider on IIS Server10
It is possible to run BioCASe on a IIS server (Microsoft HTTP server) but there are some important points to
remember:
1) on IIS version 7.5, the maximum length in the HTTP GET part of the query string is limited to 2048 bytes,
while the length of the URL is limited to 4096. This may prevent the harvesting of a dataset. This problem
can be solved easily by updating the configuration of the IIS, but is hard to identify.
2) Some versions of the IIS disallow by default the submission of accented characters in HTTP GET queries,
especially if IIS is associated to one of the following services:
7
http://wiki.bgbm.org/bps/index.php/Debugging 19th July 2013
8
http://wiki.bgbm.org/bps/index.php/DatasourceSetup 19th July 2013
9
https://code.google.com/p/gbif-indexingtoolkit/wiki/UserManual#5.5.1.3_Harvest 20th August, 2013
10
http://open-up.cybertaxonomy.africamuseum.be/forum_topic/issues_when_installing_biocase_provider_iis_server 19th July 2013
AIT, 2014
C2.4.3
p. 38
-Microsoft Exchange
-ISA server
-Microsoft Forefront Threat Management Gateway
This setting may prevent the harvesting of your dataset as IIS generates an error message when a scientific
name with accented characters is being harvested, instead of publishing the data.
If the BPS provider is placed behind an Microsoft Exchange/ISA server/Microsoft Forefront Threat
Management Gateway the problem can be solved by changing the following setting:
1.
Start the ISA Server or Microsoft Forefront Threat Management Gateway, Medium Business Edition
Management tool.
2.
Expand ServerName, where ServerName is the name of your ISA Server or Microsoft Forefront
Threat Management Gateway, Medium Business Edition computer.
3.
Click Firewall Policy, click the Web publishing rule that you created to publish the Exchange Server
computer for access by OWA users, and then click Edit Selected Rule.
4.
Click the Traffic tab, click Filtering, and then click Configure HTTP.
5.
Click to clear the Block high-bit characters check box, and then click OK two times.
6.
Click Apply to update the firewall policy, and then click OK
The gbif log file can not be opened (permission denied).
The error message can be seen in Figure 69.
Figure 69 Could not open gbif log even file
When this error occurs the rights for the directory concerned need to be changed.
An error concerning the title of the datasource occurs
Figure 70 shows the error message in the HIT log.
AIT, 2014
C2.4.3
p. 39
Figure 70 Error message concerning the datasource title
One possible cause of the problem is the use of the “&” symbol in
DataSets/DataSet/Metadata/Description/Representation/Title. It can be replaced by “and” or if a shorter
version is needed by “+”.
Transforming
An error occurs when trying to execute a transformation
During the execution of a transformation in Pentaho the output can be seen by clicking on “Logging” (see
Figure 71)
Figure 71 Logging section with output
Marked in Figure 71 is the error message “Check if archive exists and is harvested (result = [false])”. The
most likely reason is that the parameter values dataset_name or dataset_uddi_key of the transformation are
not correct (see Figure 72).
AIT, 2014
C2.4.3
p. 40
Figure 72 Adapting the parameters
The bio datasource table should be checked again to make sure there are no mistakes (see Figure 73).
Figure 73 Checking the table “biodatasource“
OAI-Import
The Transformation stops during the import
When a great number of records have been imported the zebra index can have trouble. From time to time it
is therefore necessary to rebuild the zebra index. This is done in Pentaho with the Transformation “rebuildidzebra-index” (see Figure 74) that can be found under OpenUp>>programs>>tools.
AIT, 2014
C2.4.3
p. 41
Figure 74 Opening the Transformation “rebuild-idzebra-index”
This Transformation is started like every other else with the “Play” symbol and the “Launch” button (see
Figure 75
Figure 75 Rebuilding the zebra index
Rebuilding the zebra index take at least some hours.
The imported collection can not be found on the OAI-Provider platform
One reason could be the unique identifier created on the platform differs from the one filled in the
parameters in Pentaho. It is very important when creating a new collection to use the same ID in Pentaho
(compare 01-transform and Adding a new collection).
When creating the ID on the OAI-Provider platform each part of it has to be created separately (see Figure
76).
AIT, 2014
C2.4.3
p. 42
Figure 76 Creating the ID for a collection
That means, for example, the collections “BATS”, “BATS:ETI” and “BATS:ETI:NETHERLANDS” are created and
saved separately (compare Figure 76). In Pentaho the ID has to be filled in completely
(BATS:ETI:NETHERLANDS, see Figure 77).
Figure 77 Parameter “collection_name“ with the complete ID
AIT, 2014
C2.4.3
p. 43
I. Quick Guide
GBIF-HIT Harvester
Existing data source
1) Harvesting Metadata Updater (orange colour)
2) Harvesting the Operator (green colour)
estimated time: 5 minutes
New data source
1) Adding a new bioDatasource
2) Harvesting Metadata Updater (orange colour)
3) Harvesting the Operator (green colour)
estimated time: 10 minutes
Pentaho
Existing data source
1) Execute Pentaho transform-Job
2) Execute Pentaho validate-Job
3) Execute Pentaho import-Job
estimated time: 5 minutes
New data source
1) Creating a new directory
2) Creating Pentaho transform-Job
3) Fill in the parameters
4) Execute the transform-Job
AIT, 2014
C2.4.3
p. 44
5) Creating Pentaho validate-Job
6) Executing Pentaho validate-Job
7) Creating Pentaho import-Job
8) Executing Pentaho import-Job
estimated time: 15 minutes
OAI-Provider
Existing data source
1) Using the Advanced Search or Browse function to control the data
estimated time: 5 minutes
New data source
1) Creating a new collection in the Admin Area
2) Using the Advanced Search or Browse function to control the data
estimated time: 10 minutes
Conclusion
total time existing data source: 15 minutes
total time new data source: 35 minutes
AIT, 2014
C2.4.3
p. 45
II. List of Figures
Figure 1 Diagram showing the infrastructure of the OpenUp! project with its main steps .............. 1
Figure 2 Logging in the GBIF Harvesting and Indexing Toolkit (HIT) ........................................... 2
Figure 3 The HIT user interface ............................................................................................. 3
Figure 4 The Jobs section ..................................................................................................... 4
Figure 5 The Console section with the Log Event List ................................................................ 4
Figure 6 The Registry tab ..................................................................................................... 5
Figure 7 Writing or generating a report .................................................................................. 5
Figure 8 Clicking on “add bioDatasource” ................................................................................ 6
Figure 9 Adding a new datasource ......................................................................................... 6
Figure 10 The newly added datasource “Sahlberg” ................................................................... 7
Figure 11 Job list after scheduling the metadata updater for “Sahlberg” ..................................... 8
Figure 12 The Log Event List after scheduling the metadata updater “Sahlberg” .......................... 9
Figure 13 The newly created operator “Sahlberg – Sahlberg” .................................................... 9
Figure 14 The Job list after scheduling the operator “Sahlberg – Sahlberg” ............................... 10
Figure 15 Console section during the Inventory operation ....................................................... 10
Figure 16 The inventory_request of the Inventory operator..................................................... 11
Figure 17 The inventory_response of the Inventory operator................................................... 11
Figure 18 Alphabetical list of all scientific names .................................................................... 12
Figure 19 The nameRanges.txt document ............................................................................. 13
Figure 20 The search operation creates search_requests and search_responses ........................ 13
Figure 21 The unit_records.txt file ....................................................................................... 14
Figure 22 The synchronisation and the extractions operations in the Console section ................. 15
Figure 23 The result of the extraction process ....................................................................... 15
Figure 24 Repository structure in Pentaho ............................................................................. 16
Figure 25 Structure of the “etl” database with four tables ....................................................... 17
Figure 26 Creating a “Finland” folder in every category .......................................................... 18
Figure 27 Two Finnish Jobs in the transform category ............................................................ 18
AIT, 2014
C2.4.3
p. 46
Figure 28 The Finnish Jobs in the validate category ................................................................ 19
Figure 29 The Finnish Jobs in the oaiimport category ............................................................. 19
Figure 30 The Job “Sahlberg” in the transform category ......................................................... 19
Figure 31 Parameters for Transformation of “Sahlberg” in ESE ................................................ 20
Figure 32 The columns “dataset_name” and “dataset_uddi_key” in “Biocase_Harvest_to_ESE” ... 20
Figure 33 Parameters for Transformation of “Sahlberg” in EDM ............................................... 21
Figure 34 Starting the Job “Sahlberg” .................................................................................. 21
Figure 35 Launching a Job .................................................................................................. 22
Figure 36 ABCD records in the folder “extracted” after running the “Sahlberg” Job ..................... 22
Figure 37 The Job “Sahlberg # validate” ............................................................................... 23
Figure 38 Analysis of Pentaho Jobs ...................................................................................... 23
Figure 39 Error analysis of the “Sahlberg” transformation ....................................................... 24
Figure 40 Controlling the transformed data with phpMyAdmin ................................................. 24
Figure 41 The Job “Sahlberg # oai import”............................................................................ 25
Figure 42 Logging in .......................................................................................................... 25
Figure 43 Login window ...................................................................................................... 26
Figure 44 “You are now logged in as admin” ......................................................................... 26
Figure 45 Entering the Admin Area ...................................................................................... 26
Figure 46 The Admin Area .................................................................................................. 27
Figure 47 “Edit Collections” ................................................................................................. 27
Figure 48 List Collections .................................................................................................... 27
Figure 49 Adding a new collection ........................................................................................ 28
Figure 50 Newly added collection Sahlberg ........................................................................... 28
Figure 51 Starting the “Advanced Search” ............................................................................ 29
Figure 52 Searching for “Sahlberg” ...................................................................................... 29
Figure 53 Using the “in the field” search option ..................................................................... 29
Figure 54 Looking up titles of the collection “Sahlberg” ........................................................... 30
Figure 55 Result of “Advanced Search” ................................................................................. 30
Figure 56 Displaying the ESE record ..................................................................................... 31
AIT, 2014
C2.4.3
p. 47
Figure 57 Displaying the collection information ...................................................................... 31
Figure 58 Browsing the records ........................................................................................... 31
Figure 59 Looking for valid records ...................................................................................... 32
Figure 60 Error message when editing a bio datasource ......................................................... 33
Figure 61 Metadata updaters and operators .......................................................................... 34
Figure 62 Clicking on “Recently Added” ................................................................................ 34
Figure 63 BioCASE protocol ................................................................................................. 35
Figure 64 Switching to Dynamic View ................................................................................... 35
Figure 65 Restarting the tomcat server ................................................................................ 36
Figure 66 Dropped records are shown in red ......................................................................... 37
Figure 67 Additionally harvested records are shown in purple .................................................. 37
Figure 68 SQL error in a response document ......................................................................... 38
Figure 69 Could not open gbif log even file ........................................................................... 39
Figure 70 Error message concerning the datasource title ........................................................ 40
Figure 71 Logging section with output .................................................................................. 40
Figure 72 Adapting the parameters ...................................................................................... 41
Figure 73 Checking the table “biodatasource“ ........................................................................ 41
Figure 74 Opening the Transformation “rebuild-idzebra-index” ................................................ 42
Figure 75 Rebuilding the zebra index ................................................................................... 42
Figure 76 Creating the ID for a collection .............................................................................. 43
Figure 77 Parameter “collection_name“ with the complete ID .................................................. 43
AIT, 2014
C2.4.3
p. 48