How To Deliver My Data? Ronan Ysebaert UMS RIATE

How To Deliver My Data?
Ronan Ysebaert
UMS RIATE
Draft
Draft
How To Deliver My Data?:
Ronan Ysebaert
UMS RIATE
Abstract
This technical report proposes a description of the expectations for the integration of different ESPON projects
data.
Draft
Draft
Table of Contents
1. Introduction .............................................................................................................. 1
2. The Key Indicators .................................................................................................... 3
2.1. Concepts behind the key indicators delivery ......................................................... 3
2.1.1. RULE 1 – LIMITED NUMBER OF INDICATORS .................................... 3
2.1.2. RULE 2 – INNOVATIVE INDICATORS ................................................. 4
2.1.3. RULE 3 – HIGH LEVEL OF METADATA .............................................. 4
2.1.4. RULE 4 - PROMOTE THE CORE DATABASE STRATEGY ...................... 5
2.1.5. RULE 5 – A GOOD COMPLETENESS OF THE INDICATOR .................... 5
2.2. Key Indicators Delivery ................................................................................... 5
2.2.1. XLS Template With Examples ................................................................ 6
2.2.2. ESPON Data and Metadata Specifications ................................................. 7
2.2.3. Frequently Asked Questions (FAQ) ......................................................... 8
2.2.4. Collected Data Estimation Methods ......................................................... 9
2.3. The Data Delivery Process .............................................................................. 12
2.3.1. Data and metadata upload .................................................................... 13
2.3.2. Syntactic check .................................................................................. 13
2.3.3. Semantic check .................................................................................. 15
2.3.4. Quality control ................................................................................... 17
2.3.5. Integration into the database ................................................................. 17
2.4. M4D Support to ESPON TPGs ........................................................................ 18
2.4.1. Beginning of the project: guidance phase ................................................ 18
2.4.2. During the project: help for data creation ................................................ 19
2.4.3. End of the project: help for integration ................................................... 19
3. The Zoom-in Delivery .............................................................................................. 20
3.1. The Zoom-in Delivery Strategy ....................................................................... 20
3.2. Expected delivery .......................................................................................... 23
3.2.1. Data file ............................................................................................ 23
3.2.2. Geometry file ..................................................................................... 24
3.2.3. Documentation file .............................................................................. 26
3.3. What happens to my data? The zoom-in data integration process ............................ 30
3.4. Support to TPG producing zoom-in data ............................................................ 30
4. The Background Data of the Database ......................................................................... 31
4.1. Strategy for the Background Data .................................................................... 31
4.2. Expected delivery .......................................................................................... 31
4.2.1. Indicator description ............................................................................ 32
4.2.2. Source description ............................................................................... 33
4.3. What happens to my Data? ............................................................................. 33
5. Conclusion and Advice ............................................................................................. 35
5.1. Advice for a perfect management of the data process ........................................... 35
5.2. A good practice for filling data and metadata ..................................................... 36
A. Data Flow Process of the Key Indicators ..................................................................... 39
B. References ............................................................................................................. 43
C. About .................................................................................................................... 44
iii
Draft
Draft
List of Figures
1.1. Three possibilities to deliver my data ......................................................................... 1
2.1. On-line availability of the xls templates ...................................................................... 6
2.2. Header of the ESPON Data and Metadata Specification ................................................. 7
2.3. Excel Data Model for Priority 1 projects ..................................................................... 8
2.4. Example of the Label field description ..................................................................... 8
2.5. Header of the FAQ ................................................................................................. 9
2.6. Example of an estimation method ............................................................................ 11
2.7. Dataset Integration Tracking Details ......................................................................... 12
2.8. Syntactic check: example of an invalid input .............................................................. 14
2.9. Syntactic check: example of a valid input despites warnings .......................................... 15
2.10. Example of a Semantic Check Report ..................................................................... 16
2.11. Semantic Check Example: Input Information ............................................................ 17
2.12. References for a Semantic Check Expertise .............................................................. 17
2.13. Semantic Check Example: Fixed Information ............................................................ 17
3.1. Overview page of Case Studies ............................................................................... 21
3.2. Information Page of a Case Study ............................................................................ 22
3.3. Data Template for Zoom-in Indicators and Project Database .......................................... 24
3.4. Data Model Example for Zoom-in projects ................................................................ 24
3.5. Example of a Case Study Geometries Input ............................................................... 25
3.6. Mapping of Geometries Codes in Data ...................................................................... 26
3.7. Case Study Dataset Sheet ....................................................................................... 27
3.8. Case Study Indicator Sheet ..................................................................................... 28
4.1. Project Database Metadata Sheet .............................................................................. 32
4.2. Example of a Background Data Page ........................................................................ 34
5.1. Starting point: a table with empty values ................................................................... 36
5.2. Resulting dataset with estimated values and associated labels ........................................ 37
5.3. Description of the label 1 in the metadata .................................................................. 37
5.4. Description of the label 13 in the metadata ................................................................ 38
5.5. Description of the label TE6b in the metadata ............................................................ 38
A.1. Data Flow Process: Upload and Syntactic Check ........................................................ 39
A.2. Data Flow Process: Semantic Check ........................................................................ 39
A.3. Data Flow Process: Semantic Check Approval ........................................................... 40
A.4. Data Flow Process: Outliers Check .......................................................................... 40
A.5. Data Flow Process: Outliers Check Approval ............................................................. 41
A.6. Data Flow Process: ESPON CU Aggreement ............................................................. 41
A.7. Data Flow Process: Integration ............................................................................... 42
iv
Draft
Draft
List of Tables
2.1. TGPs' M4D Contact Team ...................................................................................... 19
v
Draft
Draft
Chapter 1. Introduction
The ESPON Database 1 project (2008-2011) has experienced a lot of difficulties to overcome the
heterogeneity of information provided by ESPON Projects (integration of local data, integration of
sophisticated indicators with little metadata description…). In order to improve this non-sustainable
situation, the M4D Project has tried to better define what is expected from ESPON Projects in terms
of data deliveries.
The ESPON P1, P2 and P3 Projects are obliged to deliver all data collected and produced within their
project. These data should be delivered in the form of three types:
• Key indicators, covering the entire ESPON Space;
• Zoom-in data, covering the case-studies;
• Background data, covering all data produced by the project.
As you can see in Figure 1.1, the dataflow depends on the nature of the delivery.
Figure 1.1. Three possibilities to deliver my data
This figure shows the three possibilities for data to be integrated into the database depending on the
nature of itself.
This document proposes useful information for these three different types of ESPON projects:
• THE KEY INDICATORS, further described in the chapter entitled The Key Indicators
The key indicators are innovative indicators highly relevant for policy making and should cover the
entire ESPON Space (EU27+4). These indicators will be the only ones searchable from the query
interface.
The ESPON projects deliver in principle the indicators related to the maps included in Part B of the
(Draft) Final Report - around 10 indicators. In case a typology or composite indicator is included,
the data and methodology used to build it should also be delivered.
The requirement in terms of data and metadata is high for this delivery and the ESPON Projects are
requested to upload the data via the Upload page [http://database.espon.eu/login]
The key indicators delivery has to follow the ESPON Data and Metadata specifications. To build a
strong and efficient query interface, these indicators will be checked in depth before integration.
1
Draft
Introduction
Draft
This process includes three steps:
1. Syntax check, metadata format analysis (are all the mandatory fields filed?);
2. Semantic check, metadata content analysis (are the metadata understandable?);
3. Outlier detection, Outlier detection (are there unusual values in the dataset?).
• ZOOM-IN DATA, further described in the chapter entitled The Zoom-in Delivery
Besides the key indicators delivery, some ESPON Projects (in particular for Targeted Analysis,
but not only) analyze specific territories of the ESPON Area at local scale. To make this kind of
complementary and very interesting data easy accessible, a case-study interface will be developed.
To set up this interface, the projects are requested to deliver their most representative data, their
geometries (in a shape file format) and a documentation highlighting the content of the data and
geometries (following a dedicated template).
Regarding to this delivery, the M4D project will only check if it is possible to map the data and if
all mandatory fields of the documentation file are correctly filled.
• BACKGROUND DATA, further described in the chapter entitled The Background Data of the
Database
In order to fill their contractual obligations and to make all data as a coherent set available, each
ESPON Project has to deliver a zip file that contains all data, metadata and geometries (if different
than the usual ones delivered via ESPON) used in the project.
This zip file is considered as an annex to the final report of the project and is stored on the ESPON
Website project page.
This document also proposes useful advices in Conclusion and Advices and templates files in Data
Flow Process of the Key Indicators.
2
Draft
Draft
Chapter 2. The Key Indicators
The 10 best indicators delivery is probably the most restrictive one: taking into account that ESPON
is a community where knowledge and material is shared, it needs to define some basics to ensure the
harmonization of the ESPON identity. Of course, it concerns reports (50 pages maximum by report,
following some required typographic styles), maps (following the map-kit template) or reporting (inception report, interim report(s), draft final report, final report). It also concerns data and metadata.
To be useful for ESPON projects and other end-users, data should always be accompanied by metadata,
including information about their quality and sources. It is also particularly important that the metadata
should be compliant with international (ISO) and European (INSPIRE) standards so as to ensure the
use of the database in the longer-run and to make it compatible with other national and international
database initiatives.
To ensure correct data processing and integration into the ESPON 2013 Database, the ESPON Metadata Specifications provided by M4D project must be carefully respected by all the data providers
participating to the project and by the organizations/persons who intend to create new software implementations interacting with the ESPON Database.
The ESPON Metadata is relatively complex, but quite complete. As a result, the metadata creation
in ESPON is a huge work BUT only concerns a limited number of indicators. It implies that TPGs
should take into consideration at the very beginning of the implementation of the project.
In Section 2.1 of this chapter, we firstly describe the concepts behind the key indicators delivery, or
"What shall I deliver?". In Section 2.2, we detail the resources you can use to deliver your data, "How
shall I deliver my data?". Finally, Section 2.3 is dedicated to the data flow process, or "What happens
to my data?".
2.1. Concepts behind the key indicators delivery
Before delivering the key indicators, four basic rules are to be kept in mind. The M4D Project has
defined rules in order to give a common understanding of the future content of the ESPON Database
and to avoid the integration of too much heterogeneous information. It is the unique way to propose
a database that could be managed in the future. Four basic rules are described below with concrete
situations of good or bad practices.
2.1.1. RULE 1 – LIMITED NUMBER OF INDICATORS
Each ESPON Project has to choose 10 key indicators covering all the ESPON Area at NUTS level. With this basic rule, we want to limit the discrepancy between projects, which deliver hundreds
indicators (residuals of statistical models, generally not very well explained in metadata) and other
projects, which deliver few indicators, embedded in a monstrous information flow. In general terms,
we prefer to include into the database a single indicator with a real added value, rather than hundreds
of indicators which may never be queried by users of the database.
Good practises:
• If the main result of the project is a typology, please provide all the indicators used for
calculating it (e.g. if the typology is based on population, age group 20-39, age group 65+,
natural population increase and net migration, deliver all these indicators if they are not
already included into the database).
• Deliver indicator that could be helpful for the ESPON Community in the future, e.g. policy-makers, researchers and practitioners.
3
Draft
The Key Indicators
Draft
Bad practises:
• Deliver GDP per capita and all its statistical derivates (which could be automatically calculated further): GDP per capita EU27=100, GDP per capita ESPON Area=100 etc.
• Provide all the residuals of a complex statistical model.
2.1.2. RULE 2 – INNOVATIVE INDICATORS
By the past, the ESPON M4D project has received ten indicators describing total population in 2006!
This kind of figure makes the database impossible to use (which indicator to download?). This is why,
in the key indicators delivery, we kindly ask the project to propose innovative indicators that are not
yet into the ESPON Database.
Good practises:
• Before collecting data, look into the ESPON Database to see if the indicators you are looking
for are not already available.
• If mistakes are detected in the ESPON Database, please notice the ESPON M4D team and
propose a revision of the dataset.
Bad practises:
• Deliver an indicator already contained in the database without explaining the added value
of the indicator you propose (estimations of better quality, mistakes corrected).
2.1.3. RULE 3 – HIGH LEVEL OF METADATA
The metadata related to indicators must be very well explained. If you propose indicators derived from
statistical analysis or models, make sure your data is understandable by non-specialists users!
Good practises:
• Take time to correctly fill each field of the metadata model.
• Reference all the sources you use to create your dataset. In that way, the user will be able to
define which data is coming from official data sources (Eurostat, national statistical institutes, ...) and which one you have estimated. The total population 1990-2010 file, available
in the ESPON Database is a good example of systematic description of the data source.
• Make sure that it is possible to rebuild the indicator your propose in the database. Use
the methodology property (part 1.7.2 of the specifications [1]) to describe your calculation
methodology.
• Enclose to your data delivery methodological notes (field URI of the indicator description).
Bad practises:
• Put in the methodology field of the indicators: “cf Final Report for further explanations”…
• Deliver indicators that will never be updated in the future without your TPG knowledge (e.g.
composite indicators based on a data model which is your property and not diffusible).
• In the source part of the metadata, mention your project as data provider (generally the
dataset is a combination of data coming from Eurostat, national sources and estimations).
4
Draft
The Key Indicators
Draft
2.1.4. RULE 4 - PROMOTE THE CORE DATABASE
STRATEGY
Out of the key indicators, each project can suggest the inclusion into the "Core Database" of indicators
of interest for territorial monitoring (time series, added value for the database), which could be updated
and maintained in the future, out of your project.
Good practises:
• The M4D Project proposes total population at NUTS0, 1, 2 and 3 levels for the period 1990-2010. A good practice could be to extend this temporal coverage to the period
1980-2010.
• The M4D Project proposes age structure data (5 years age-class) at NUTS 0, 1, 2 levels. A
good practice could be to extend the hierarchical coverage to the NUTS3 level.
• The M4D Project has collected total area and population for the UMZ (Urban Morphological
Zones): Extend the thematic coverage to other indicators (Land Use, etc).
Bad practises:
• Deliver a derived indicator (for example, unemployment rate) without delivering the count
data behind this indicator (e.g. unemployed population and active population).
• Deliver a dataset with a high number of missing values.
2.1.5. RULE 5 – A GOOD COMPLETENESS OF THE
INDICATOR
At the moment, the ESPON Database supports several nomenclatures: NUTS division in the 1995,
1999, 2003, 2006 and 2010 revisions for the ESPON Area; United Nations division (for World countries). Whatever the nomenclature used, the degree of completeness of the indicator must be relatively
good. Ideally, most of the missing values must be estimated with a description of the method used.
In that order, a guidance paper has been written by the M4D project, proposing a set of estimation
methods [2].
The key indicators concern Applied research projects (ESPON Priority 1) and projects from the Scientific Platform (Priority 3). For targeted analysis, most of the data will be integrated in the zoom-in
interface (cf The Zoom-in Delivery).
Good practises:
• If Eurostat (main data provider) does not provide data for some territorial units of the 10 best
indicators, look at external data sources (National Statistical Institutes) if the data exists.
• When no data is available, estimate it and refer systematically in metadata the methodology
used for the estimation.
Bad practises:
• Deliver data for three countries of the ESPON Area: In this case of figure, go to the zoomin interface (part 2 of the Technical Report)
• No description of the estimation made.
2.2. Key Indicators Delivery
This section details the expected deliveries and available resources to fill ESPON Data and metadata.
5
Draft
The Key Indicators
Draft
In order to ensure an efficient way to create data and metadata in the ESPON format, the M4D Project
has produced some useful guidance documents (available from the help menu of the ESPON Database
Web site at http://database.espon.eu [http://database.espon.eu/]).
As the M4D Project is still working on improving the Database interface, documentation and tools
will be improved in the next steps of our project (for example, the availability of an on-line metadata
editor). Updates and news will be regularly sent to concerned people.
2.2.1. XLS Template With Examples
As shown in Figure 2.1, under the "Upload" menu of the ESPON Database Web site (login required),
an XLS template fully compatible with the ESPON Metadata specifications [1] is available to download. It contains all the required information described in the metadata specification (cf part 1b2) and
it is structured in four parts:
1. the dataset sheet (information related to the dataset)
2. the indicator sheet (information related to the indicator)
3. the source (information related to the data source)
4. the data (contains ID and data)
This XLS is the current solution to integrate data into the database. A metadata editor is under construction to ease the data integration process.
Figure 2.1. On-line availability of the xls templates
The data and metadata templates, available from the upload part of the ESPON Database Web site.
6
Draft
The Key Indicators
Draft
2.2.2. ESPON Data and Metadata Specifications
The document entitled ESPON Data and Metadata Specification [1], whose header is shown in Figure 2.2, is the reference document for the Priority 1 Projects datasets. It proposes a specification of the
metadata model. Firstly, it describes the generic conceptual model of the ESPON Metadata (called as
the Abstract Metadata Model). Secondly, it presents the implementation of the abstract model using
the international standards (ISO-19115 and INSPIRE Directive). Finally, it explains the implementation of the abstract model in a tabular file format.
Figure 2.2. Header of the ESPON Data and Metadata Specification
This figure shows the header of the on-line HTML document, available on the ESPON Database Portal
[3].
Please find below some advices to use these specifications:
• Do not be impressed by the 150 pages of the paper format document! From the user point of view,
the first, the second and the third parts of the metadata model specifications explain in a different
way (conceptually, in a xml version, in a tabular version, e.g. Excel) the same topic: description of
all the fields of the ESPON Metadata model.
• To begin with, we strongly advise you to carefully read the introduction of the Metadata specifications, explaining the main concepts and also the third part, showing the tabular model and all the
fields to be filled with concrete examples.
• Download the metadata template (requires login) from the "Upload" menu (see Figure 2.1). On the
basis of this .xls document, fill your metadata. For example, Figure 2.3 shows how colors and
7
Draft
The Key Indicators
Draft
comments in this template help at filling cells. When something is not clear, please refer to the
metadata specifications: as an example, Figure 2.4 shows the description of the Label field.
Following Figure 2.3 and Figure 2.4 illustrate an example of a good practise by using the metadata
specifications.
Figure 2.3. Excel Data Model for Priority 1 projects
I want to reference my data. First of all, I want to know what kind of information is mandatory. On
the right part of each cell, a description box (in red on the figure) helps me to answer to this question.
Each cell colored in green needs to be filled.
When going to the source part of the metadata template, I do not understand the meaning of the label
field (in orange). When looking on the right part of the cell, one can see that this element is described
in the part 1.6.1 of the Specifications. When going to the ESPON Metadata specifications, shown in
Figure 2.4, the label property gives a full description of the element.
Figure 2.4. Example of the Label field description
This figure is an extract of the specification. It shows the description of the Label field.
As a next step, the stabilization of these Metadata Specifications is the first step to feed the ESPON
Database.
The ESPON M4D team is now working on the creation of a metadata editor to easily and dynamically
generates (without using the XLS template) your metadata. Once available, this guidance paper will
be updated.
2.2.3. Frequently Asked Questions (FAQ)
By the past, the M4D project has had to respond to a lot of questions regarding to the data integration.
We have tried to capitalize all these exchanges by writing a FAQ, available on-line from the help
8
Draft
The Key Indicators
Draft
menu of the Web application [3] since February 2012. As shown in Figure 2.5, questions are ordered
by topics:
1. What is M4D?
2. Content of the database
3. Access to the database
4. Data delivery
5. Metadata process
6. Support to data creation
7. Mapkit
8. Local/urban data
Figure 2.5. Header of the FAQ
This figure shows the header of the FAQ available on-line from the Help menu of the Web application
[3]
2.2.4. Collected Data Estimation Methods
One of the M4D Technical Reports entitled "The Core Database Strategy, a new paradigm for data
collection" (see annex in [2]), proposes a general strategy named ESTIM for data collection at regional
level.
9
Draft
The Key Indicators
Draft
An interesting added value of the document is a dictionary of estimation methods adapted for nonspecialists (pp 70-103) inspired from the Data Navigator 2 framework produced within the ESPON
3.2 Project (2007). Among other, this dictionary has been used to estimate missing values of the total
population 1990-2010 dataset). This document will be updated depending on the user feedbacks. The
aim of this document is twofold: first of all, to formalize procedures of data estimations as regard to
regular concrete situations. We try to explain step by step the methodology employed for estimating
data by using the ESTI terminology; and secondly, to provide information in order to correctly fill
the ESPON metadata.
Each estimation method (an example is shown in Figure 2.6) is organised by synthetic sheet, explaining the conditions of use of the estimation method, a graphic illustration of the situation, textual explanation (what is described in the methodology field of the metadata source), a mathematic formalization and an example of use.
One of the striking points here is to let the user know how the data has been estimated.
10
Draft
The Key Indicators
Draft
Figure 2.6. Example of an estimation method
This figure is an extacted screenshot from [2], showing the output of an estimation method based on
time retropolation and space harmonization.
11
Draft
The Key Indicators
Draft
2.3. The Data Delivery Process
This section aims at responding to the following question: "What happens to my data?"
The data integration process aims to apply a very steady quality control of datasets delivered by
ESPON projects. This process is divided in 5 steps. When the TPG integrates its key indicators, he
activates a dedicated module in the ESPON Data Portal ("Upload" menu): the Tracking Tool.
The tracking tool is being developed to follow the state of advancement of the data integration process
(Figure 2.7). Please note that this tool requires to be logged in. For further information about the
integration workflow (Who? When? etc), please consult Data Flow Process of the Key Indicators.
Figure 2.7. Dataset Integration Tracking Details
This screen (work in progress) allows to consult details on the achieved and pending activities concerning the dataset integration. The "Semantics" and "Outlier detection" reports of this dataset are
available here.
The data integration process is composed of main steps described in following sub-sections.
12
Draft
The Key Indicators
Draft
2.3.1. Data and metadata upload
When a project is ready to deliver its data and metadata, it activates a dedicated module in the metadata
editor. It means that the data integration process has started and the tracking tool (on the ESPON
Database Portal) has been activated. A notification is sent to the ESPON Coordination Unit and to the
M4D team in charge of the TGP.
2.3.2. Syntactic check
The syntaxic check step aims at checking the compliance of delivered data and metadata with the
specification. In concrete terms, it checks if all the mandatory fields of the ESPON data and metadata
are correctly filled. This control is automatically done when the project uploads its datasets from
the "Upload" menu of the Web application. This is the only compulsory step of the data integration
process. Once successfully checked, the dataset is saved on the server. A notification is sent to ESPON
CU and to the M4D team in charge of the next step.
The syntaxic check step is performed on all uploaded datasets. As shown in Figure 2.8, the page
displays all the necessary information to fix eventual syntactic errors or warnings. Three types of
messages are displayed in the logs boxes:
• INF prefix indicates an information message, e.g. some information about the syntactic check
process.
• WRN prefix indicates a warning message. Warning messages are triggered for ambiguous values
that may be problematic during the next steps of the integration. Nevertheless, warning messages do
not make the syntactic check fail. As shown in Figure 2.9, the TGP is invited to eventually review
his dataset, though he can also submit it to the semantic check.
• ERR prefix indicates an error message. Error messages refer to missing values or errors in mandatory fields of the metadata. These errors constraint the user to review his dataset that can no pass
this step and continue the integration process.
13
Draft
The Key Indicators
Draft
Figure 2.8. Syntactic check: example of an invalid input
This screen shows the information messages (prefixed with [INF]), warning ([WRN]) and error
messages ([ERR]) returned by the syntactic parser. Example:
1
2
3
4
5
6
7
8
WRN
ERR
ERR
ERR
ERR
ERR
ERR
ERR
No value found for the indicator 'IXP'. Skipping data validation for this
The 'Temporal Extent' property is null.
The 'Dataset Information' element is not valid.
The 'Temporal Reference' element is not valid.
Unable to check the global temporal extent, because it is null.
The 'Temporal Reference' property is not valid.
The 'Temporal Reference' property is not valid.
The 'Lineage' property is not valid.
14
Draft
The Key Indicators
Draft
Figure 2.9. Syntactic check: example of a valid input despites warnings
This screen shows that the uploaded file is valid (no errors) but still contains warnings. The user can
pass this step or fix the dataset by clicking respective buttons at the bottom of the page.
2.3.3. Semantic check
After the syntaxic check step, the dataset is transferred to the M4D contact team in charge of the TGP.
This step aims at analyzing the content of the data and metadata (and namely the free-text fields). The
aim of this step is to analyze if all the indicators of the dataset are correctly described and understandable by a large public. The result of this expert check is achieved by the edition of a semantic report.
Note that this semantic report feedback does not forbid the data integration process, but the project
is sollicitated to consult this report and to decide to follow up the integration process, or to fix his
dataset according to this expertise.
An example of such a semantic report, filled with annotations, warnings and remarks, is shown in
Figure 2.10.
15
Draft
The Key Indicators
Draft
Figure 2.10. Example of a Semantic Check Report
This example semantic check report extract proposes annotations remarks and suggestions besides
problematic cells. Further details are given below.
Concretely, the semantic check is composed of two files: a report and a proposal of correction.
The report (as shown in Figure 2.10) contains the following information:
• First lines: who did the check and when.
• First column: description of the error(s) detected by sheet (dataset, indicator, and source). For instance it could be “the indicators should be better described”, “the methodology of calculation of
the indicator should be better precised”, “keywords are not adapted to the indicator etc.)
• Second/third column: description of the location of the error in the metadata (name of the indicator/label etc.)
• Last column: action made on the metadata. Three cases are possible:
• the deletion of the information (a bad keyword…);
• detected mistakes are corrected by the M4D Contact team (precision of the name of an indicator);
• when it is impossible to correct the information, the following coment is displayed: the project
is strongly advised to precise the information.
The proposal of correction is a new metadata file. This step is an expertise. In other terms, if the
TPG is not able (or does not want) to correct his metadata, the dataset can be submitted to the next
step of the integration process.
Following screenshots illustrate an example of the semantic check expertise performed by the M4D
Team on a problematic dataset. Figure 2.11 shows the initially received information. Figure 2.12 shows
the consulted documents to help at understanding and fixing the received information. Figure 2.13
shows proposal of correction returned to the TGP.
The M4D contact team is not in charge of filling this kind of information! We support you in
the process but please make sure that your delivered indicators are understandable by external
users!
16
Draft
The Key Indicators
Draft
Figure 2.11. Semantic Check Example: Input Information
This figure shows a lack of information in the initially received metadata. This kind of description (4digit classes) is not enough to understand how the indicator has been build.
Figure 2.12. References for a Semantic Check Expertise
This figure shows the material available (TGP report) to complete the missing information.
Figure 2.13. Semantic Check Example: Fixed Information
This figure shows the fixed information returned to the TGP in the proposal of correction document.
2.3.4. Quality control
At this stage, an outlier detection tool will run on the key indicators. The aim of this check is to
provide an expertise on unusual values contained into the dataset according to various statistical tests
(statistical outliers, spatial outliers). Like the semantic check, this is an expertise. This step is achieved
by the edition of the outlier report.
ESPON TPGs can validate or not the result of the check after consulting this report.
The conceptualization of this check is still in progress. Until its implementation, the uploaded data
passes to the next step.
2.3.5. Integration into the database
Previous checks and steps of the dataflow give us a strong expertise on the quality of the datasets
delivered by projects. Before integrating a dataset into the database, the ESPON M4D Project first
needs the agreement of both the ESPON project and the ESPON Coordination Unit. This validation
aggreement is mainly based on the provided reports (semantic/outlier).
After its integratio into the database, it will be possible to dynamically query the database composed
by the 10 best indicators through the search interface. If metadata are very well described, it gives a
real added value to the indicators.
17
Draft
The Key Indicators
Draft
2.4. M4D Support to ESPON TPGs
There are two critical phases during the lifetime of an ESPON Project:
• The beginning, when the project has to find the material and guidance for beginning its investigations: mapkit, basic data, ESPON metadata rules, understanding of the data process.
• The ending, when the project delivers its data and metadata in the specified ESPON format and
checks.
To ensure the good integration of ESPON Data in the expected format (e.g. key indicators strategy,
good quality of data and metadata), continuous exchanges with ESPON TPGs are also strongly needed.
The idea behind the follow up of ESPON Projects is to help them as much as possible in their data
creation process, and not to wait for the end of the project to discover mistakes in the data or metadata.
Though the M4D Team can help at its integration, please remind that it does not have in charge
the creation of the ESPON TPGs datasets files.
Consequently, the follow-up of ESPON TPGs implies to define some tasks, which can be described
regarding to the lifetime of the project. On the top of that, it is necessary to distinguish:
• ESPON projects under the priority 1 and 3, delivering data which will feed the web interface and
following the "key indicators" principle (delivered for the ESPON area with high quality metadata);
• Case Studies data that will feed a dedicated part of the Web application, and for whom the requirements are different.
Following sub-sections propose some guidelines for the different phases of a project.
2.4.1. Beginning of the project: guidance phase
Main issues to be taken into account:
1. Ensuring that each Priority 1 Project have access to the entire ESPON Database (public and private
part). It implies to give a login and password pair to each project.
2. Inform the ESPON TPG on the resource available in the database:
• Guidelines concerning data and metadata
• Mapkits
• Technical reports
• data available in the database, geodatabases coming from Eurogeographics, etc.
3. Presentation of what is expected from the ESPON TPG at the end of the project if it is still not clear
with this technical report. Please respect "The key indicators principle" described in Section 2.1.
4. Presentation of the ESPON data and metadata templates and representative examples if needed.
5. Identification of the persons in charge of the data collection and creation in the ESPON Priority 1
project. It is always more efficient to be in touch with the engineers – who generally create the data
rather than the scientific coordinator, who use the data created in the project.
6. Explain the data process at the end of the project: syntactic check, semantic check and quality
control and describe how the ESPON tracking tool manages the data flow, if it is still not clear
with this technical report.
7. Presentation of the way to manage case-study data (cf The Zoom-in Delivery).
18
Draft
The Key Indicators
Draft
2.4.2. During the project: help for data creation
Issues to be taken into account during the project mainly concern the data creation:
• Answer to each question asked by the TPG regarding to data creation.
• Meet the project during each ESPON Seminar.
• Be present at least at one meeting of the project.
• Make the link with the ESPON M4D project.
2.4.3. End of the project: help for integration
Main issues to be taken into account at the end of the project:
• In case of problems regarding to the syntactic check, the contact team helps the project to solve
the problems.
• For each ESPON project, Table 2.1 shows the M4D contact team that is in charge of the semantic
check.
Table 2.1. TGPs' M4D Contact Team
ESPON Project
M4D Contact Team
ATTREG
UMR Géographie-cités (FR)
TRACC
Anne Bretagnolle
SGPTDE
<[email protected]>
EU LUPA
Universitat Autònoma de Barcelona (ES)
ESaTDOR
Roger Milego
<[email protected]>
KIT
National Center for Geocomputation (IE)
TERCO
Martin Charlton
SeGI
<[email protected]>
ARTS
University of Iasi TIGRIS (RO)
Alexandru Rusu
<[email protected]>
TIGER
UMS RIATE (FR)
Ronan Ysebaert
<[email protected]>
All ESPON Priority 3 Projects
(monitoring, map updates…)
Laboratoire d'Informatique de
Grenoble LIG STeamer (FR)
Jérôme Gensel
<[email protected]>
19
Draft
Draft
Chapter 3. The Zoom-in Delivery
This chapter focuses on the networking activities with the Priority 2 projects (and more generally case
study data). It is crucial for the ESPON database to integrate all the data and indicators provided in
the framework of these projects, even if the information does not cover all the ESPON space.
However, the characteristics of data provided at local scale make impossible a homogeneous integration of such information in the query interface described above, and this for different reasons:
• Too precise information: One of the aims of the web interface consists by providing datasets for all
the ESPON area. Make available indicators for only a couple of NUTS2 at local space will produce
noise into the database.
• Heterogeneous nomenclatures: Some datasets can be produced in heterogeneous geographical delineation, out of the NUTS or the LAU nomenclatures (bassin de vie in France, Super Output Area
in UK). It will be very difficult to store on a systematic way all the nomenclatures provided.
• Too specific indicators: When analyzing territorial dynamics at local scale, some indicators of high
interest may be collected for these case studies, but are totally useless at the ESPON scale (for
instance, number of commuters going from Germany to Luxemburg in the Grande Region).
• Difficulty to easily identify what is available: When multiplying case studies, at a very local scale,
make possible to have an overview of what is available is a challenge. The query interface is clearly
not adapted to this kind of request.
The data storage of data coming from ESPON Priority 2 projects raised a lot of conceptual and practical
problems, which has been solved by proposing an alternative solution to enter the data.
3.1. The Zoom-in Delivery Strategy
The ESPON M4D considers as a “zoom-in delivery” a dataset that does not cover the entire ESPON
Area (EU27+4). It includes several cases of figures:
• Local data for a region or a group of regions (e.g. Greater Manchester at LAU2 level, Ile-de-France
at employment basin level –not including in the LAU nomenclature etc.)
• Non ESPON Area and non ESPON Neighbourhood data (e.g. data on American, Brazilian or Japanese regions).
The M4D proposal consists by building a specific interface for querying such data. The data will be
stored following a simple template (in a zip format, cf Section 3.2 for further explanations) and will be
downloade following the two proposed pages shown in Figure 3.1 (overview) and Figure 3.2 (details).
20
Draft
The Zoom-in Delivery
Draft
Figure 3.1. Overview page of Case Studies
This overview page of case studies is a proposal that will be further improved, but it presents some
clear advantages for the users:
• A clear overview of the location of case studies produced within the ESPON Program.
• Data integration is not limited to Europe and it is easily possible to integrate data coming from case
studies outside Europe (USA, China, etc)
• It is a simple solution for displaying in a homogeneous way the heterogeneity of the ESPON production.
Some possibilities will be integrated in order to interact with the map (e.g. select only the location of
case studies coming from a given project; select only the case studies located in a given country).
Then, when selecting a project pin, the user is redirected to the case study information page shown
in Figure 3.2.
The pins solution to see case studies data is certainly not the best way to display the one in
cross-border areas (Grande Région), large areas (North Calotte) etc. But taking into account
the heterogeneity of case studies data and the difficulty to predict by advance what kind of
geometries could be proposed by ESPON Projects, the M4D Project has chosen this solution,
which may be improved in a future version of the interface.
21
Draft
The Zoom-in Delivery
Draft
Figure 3.2. Information Page of a Case Study
This figure shows the information page of a case study, previously selected from the list in the
Overview page (Figure 3.1).
Five main parts compose the page:
1. General information related to the ESPON TPG (aim of the data collection, contact, upload date
of the datasets).
2. Data information: a listing of the available indicators, temporal extent of the indicators.
3. Geometries: location and name of the case study, nomenclatures used to collect data.
4. Data source: name of the data provider(s), URL, precaution of use.
5. Downloads: this part of the page proposes to download separately the data (.zip format), the
geometries (as a .zip), and the metadata page as a .pdf file. Note that the download rights
may be specified and restricted, particularly for the geometries not free of use, for example the
Eurogeographics data.
22
Draft
The Zoom-in Delivery
Draft
3.2. Expected delivery
To feed the zoom-in interface and the metadata page, the M4D Project needs three main deliveries
from the ESPON Projects: data, geometries and documentation. The following sub-sections describe
each of these elements.
3.2.1. Data file
The format of the data file, shown in Figure 3.3, is not significantly different than the one proposed
for the key indicators. An example of the data file is also given in Figure 3.4. The elements that differ
from P1 projects are:
• The two first lines of the Excel sheet (the temporal extent and the ID of the indicator have been
concatenated in a single cell)
• The source column, on the right column of each indicator, has been deleted. It means that the source
description is made at the level of the dataset.
The main elements of the data file are:
1. Code (first column): A code for the territorial units contained in the database (which has to be the
same than the one displayed in the geometries).
2. Name (second column): name of the territorial unit
3. Object (third column): Object type (LAU1, LAU2, River Basin…)
4. Version (fourth column): Object type version (like NUTS versions). If the version is not adapted
or not available for the dataset, put n/r (not relevant) or n/a (not available).
5. Indicator code (first line): Code of the indicators (concatenation of an identification “POP” and the
year of reference “1990”)
6. Values: When data is not available, put n/a in the cell; when data is not relevant (e.g. location of
harbour for non-costal territorial units), put n/r in the cell.
23
Draft
The Zoom-in Delivery
Draft
Figure 3.3. Data Template for Zoom-in Indicators and Project Database
This figure shows the content of the values sheet expected for Case Studies project data file.
Figure 3.4. Data Model Example for Zoom-in projects
This figure shows an example of the expected data model for Case Studies projects.
3.2.2. Geometry file
In term of geometries, the M4D Project expects georeferenced information (Figure 3.5) in the ESRI
Shapefile format [4]. The information contained in the .dbf linked to a shape file has to be at least
a code (ID) that is similar than the one contained in the data files (Figure 3.6). Thus, it is possible
for the user to:
1. Analyse the exact territorial coverage of each case study.
2. Build some maps thanks to the data gathered for each case study of the ESPON Community.
24
Draft
The Zoom-in Delivery
Geometries have to be delivered in
name_of_the_project_geom.zip.
Draft
a
.zip
archive
whose
filename
Figure 3.5. Example of a Case Study Geometries Input
This figure is an example of the ESPON TeDi Project Case Study, available at LAU 2 level.
25
is
Draft
The Zoom-in Delivery
Draft
Figure 3.6. Mapping of Geometries Codes in Data
This figure shows the full correspondance between geometries and data files codes.
3.2.3. Documentation file
The documentation file aims at providing the information that is finally available to end-users on
the page shown in Figure 3.2. The file structure is inspired from the metadata specifications of the
key indicators with some simplifications and adjustments linked to the specificities of such a project
delivery.
In the xls template, mandatory fields must be filled in two sheets, these mandatory cells are indicated
with a green backgound color in Figure 3.7 and Figure 3.8.
Following sub-sections describe each of the sheets.
26
Draft
The Zoom-in Delivery
Draft
3.2.3.1. The dataset sheet
Figure 3.7. Case Study Dataset Sheet
This figure shows the dataset sheet of the TeDi Case Study data file. The green color shows mandatory fields. The purple color shows optional fields.
The expected information in the dataset sheet is:
• Name: name of the delivery. It is to give an idea of the dataset content. We encourage all dataset
providers to produce the most short and meaningful dataset names that directly reflect the data
semantics.
• Project: ESPON project in which the dataset was produced. This should be an acronym of one
of the existing ESPON projects. If this property is not specified, the default project "ESPON 2013
Database" will be applied.
• Abstract: Free-text description of the contents of the dataset, in a way to make understandable
the aim of the case study (both geographical coverage and thematic scope of the delivery).
• Access classification: Classification of the access rule applied to the dataset/geometries
separately. Three possibilities can be mentioned in this field:
1. unclassified - available for general disclosure (public access)
2. restricted - not for general disclosure (for registered users only, e.g. belonging to the
ESPON Program). This possibility has to be used when the geometries comes from Eurogeographics, which cannot be diffused out of ESPON. But as far as possible, try to create your own
geometries with no limitations of use…
3. confidential - available for someone who can be entrusted with information (for the administrator of the database only, e.g. ESPON Coordination Unit and the ESPON Database administrator)
27
Draft
The Zoom-in Delivery
Draft
• Use restriction: Information useful to know for the future user of the dataset. It might be
incoherencies between indicators definition (e.g. “be careful to the unemployment rate definition for
Belgian territorial units”), content of the dataset (e.g. data are not available for the same year) etc.
• Responsible party: Organization or person responsible for the entire dataset. Name, organization and email contact are required.
• Metadata contact: Organization or person who created the metadata for the dataset. Name,
organization and email contact are required.
• Spatial binding: Describes the spatial link between the data part of the dataset and the territorial units used. Four elements are required: the name of the case study and its country of belonging,
the latitude and the longitude location of the case study (by convention, we propose to use the center
of the case-study); and information related to the geographical level of analysis (nomenclature name
and/or version and/or level). The number of case studies per dataset is not limited.
3.2.3.2. The indicator sheet
Figure 3.8. Case Study Indicator Sheet
This figure shows the indicator sheet of the TeDi Case Study data file. The green color shows
mandatory fields. The purple color shows optional fields.
The expected fields in the indicator sheet is:
28
Draft
The Zoom-in Delivery
Draft
• Code: A short acronym that reflects the meaning of the indicator
• Name: A short expression that reflects the meaning of the indicator
• Abstract: The abstract of the indicator. This property must describe the indicator in a more
extended way than it is done by the Name property. The abstract must not repeat only the name of
the indicator, but propose more information about it, that is not given by the Name.
• Methodology description (optional): Describes the methodology used to produce indicator
values. This methodology can concern a particular indicator independently of data sources or be
specific to a particular source that provided indicator values (e.g. when a typology is produced,
explain the cluster method used and the meaning of values shown in the data file – 1 for decreasing;
2 for increasing).
• Methodology URI (optional): Reference to the resource where a detailed description of the
methodology is made. This may be a reference to an online/paper publication or to the name of a
file attached to the dataset. If this property specifies a file name, it must be present in the package
delivered to the data processors; otherwise the data provider will be requested to supply this file.
• Temporal extent: groups temporal references of periods or instances covered by the values
of an indicator in the dataset. When the indicator is available at different time period (e.g. DNS_1a
indicator on the figure 15), add several temporal extents.
• Provider: Refers to the data provider of the indicator value. The provider may be an institution
or even a person who is the originator of the data. This property should not be confused with the
reference to the publication source: the data provider is the actor who contributed to the data
production or publication.
• Provider URI (optional): Official Uniform Resource Identifier (URI) of the data provider. In
most cases, this is the URL (Internet address) of the data provider's site. This property must not
represent a reference to the publication, but to the organization or the person who provided the
data. For example, this property can take the value "http://ec.europa.eu/eurostat", which refers to
the home page of Eurostat
• Publication title (optional): Title of the publication or name of the source where data were
taken from, if it exists (for instance "Switzerland Statistics Public Database")
• Publication URI (optional): Official Uniform Resource Identifier (URI) of the publication.
In most cases, this is the URL (Internet address) where the data is available online or can be accessed or obtained. This can also be an ISBN if the source is a paper publication (for instance http://
www.espon.eu/reports/report001.pdf).
• Publication reference (optional): Indicates the element of the referenced publication
(page, part, chapter etc) to refer to. (for instance. p.50, chapter 2).
• Methodology description (optional): This property describes a source-specific methodological details that make the data from this source distinct from the data coming from other sources
of the dataset (for instance “coming from heterogeneous data provider, the data has been harmonized using Eurostat data”). Cf the Technical Report on Core indicators, which proposes some examples of estimation methods.
• Methodology URI (optional): Reference to the resource where a detailed description of the
methodology is made. This may be a reference to an online/paper publication or to the name of a
file attached to the dataset. If this property specifies a file name, it must be present in the package
delivered to the data processors, otherwise the data provider will be requested to supply this file.
• Copyright (optional): Text describing the copyright rules and/or restrictions applied to the data
associated with this source. The default value of this property is "(c) ESPON 2013 Database".
29
Draft
The Zoom-in Delivery
Draft
3.3. What happens to my data? The zoom-in
data integration process
At the moment, zoom-in delivery must be sent both to your TPG Project officer, the ESPON M4D manager (<[email protected]>) and the TIGRIS team
(<[email protected]>). When zoom-in data is delivered, a compliance check is
organised by the M4D Project (TIGRIS team in particular) in order to check that:
1. The codes of the territorial units contained in the geometries and the dataset are the same (is it
possible to make a map?)
2. The geometries are georeferenced (is it possible to display the case-study on the ESPON Mapkit?)
3. All the mandatory fields of the documentation file are correctly filled.
At the end of the compliance check, a notification is sent both to the coordinator of the project and
to the project officer. It means that the zoom-in delivery will be available from the zoom-in interface,
shown in Figure 3.1 and Figure 3.2.
In the next months, it will be possible to deliver zoom-in data in the upload part of the ESPON Data
Portal, allowing to centralise all this material in a dedicated part of the Database for a better dataflow
management.
3.4. Support to TPG producing zoom-in data
TIGRIS Team (University of Iasi, Romania) is the team in M4D team in charge of the follow-up of
projects producing Case-study data. The TIGRIS team has a good experience of local data and have
produced several technical reports on that topic. Indeed, the ESPON TPGs are welcomed to ask any
question regarding to the case study data flow or availability of data at local level to the TIGRIS team
For any question, please send an email to Alexandru Rusu (<[email protected]>).
30
Draft
Draft
Chapter 4. The Background Data of
the Database
4.1. Strategy for the Background Data
ESPON TPGs may have produced a lot of data useful for specialists (e.g. residuals of a regression
model) but not for ordinary (e.g. non-expert) users, such a policy makers or practitioners. Or TPGs
may produce intermediate data that has been used to produce a synthetic index delivered in the "key
indicators".
In such a case, the M4D Project has produced a simplified data and metadata template derived from
the Metadata Specifications of P1 projects. The aim of this template is to propose to external users the
minimal piece of information useful to understand the meaning of the indicator, the origin of data and
some precisions on the data producer. In fact, this template helps to define harmonised information
related to data.
4.2. Expected delivery
The XLS template developed in that order is quite easy and not time-consuming to feed. It is structured
in two parts. One is dedicated to data and the other one to metadata.
The data template is structured as the one proposed for case-study data (cf Section 3.2.1), and has to
be delivered as a .xls file including a single sheet entitle data.
The metadata file contains 10 compulsory fields (Figure 4.1) and has to be delivered as a .xls file
including a single sheet entitled metadata. This sheet is structured in columns (one for each indicator). The first part is dedicated to the indicator definition, the second part to the data sources.
31
Draft
The Background Data of the Database
Draft
Figure 4.1. Project Database Metadata Sheet
This figure shows the content of the metadata sheet expected for Background project data file. A
description of fields is given in following sub-sections.
4.2.1. Indicator description
The indicator sheet content is described below:
• Code: A short acronym that reflects the meaning of the indicator.
• Name: A short expression that reflects the meaning of the indicator
• Abstract: The abstract of the indicator. This property must describe the indicator in a more
extended way than it is done by the Name property. The abstract must not repeat only the name of
the indicator, but propose more information about it, that is not given by the Name.
• Temporal extent : groups temporal references of periods covered by the values of the indicator.
• Methodology description (optional): Describes the methodology used to produce indicator
values. This methodology can concern a particular indicator independently of data sources or be
specific to a particular source that provided indicator values (e.g. when a typology is produced,
explain the cluster method used and the meaning of values shown in the data file – 1 for decreasing;
2 for increasing). .
• Keyword (optional): Groups a list of keywords and/or keyword expressions related to the indicators. Ideally, these keywords must refer to the GEMET Thesaurus (http://www.eionet.europa.eu/
gemet/).
• Upload/metadata date: Date of creation of the metadata file in the following format: DAY/
MONTH/YEAR
• Use constraint (optional): Access and use constraints applied to ensure the protection of
privacy or intellectual property, and any special restrictions or limitations on obtaining the resource.
32
Draft
The Background Data of the Database
Draft
• Point of Contact: Persons or organizations that may be contacted for different issues related
to the dataset/metadata. A name, an email and a name of an organization of reference is required.
• Project: Name of the ESPON Project who has created the file.
• How to source the indicator (optional): Like on the ESPON Mapkit, the rule is the
following: name of the team, name of the ESPON Project, dataset date.
4.2.2. Source description
All the data providers (structured in column) are listed below the indicator description. A source may
be described as follows:
• Provider Name: Refers to the data provider of the indicator value. The provider may be an
institution or even a person who is the originator of the data.
• Reference (optional): Official Uniform Resource Identifier (URI) of the data provider. In most
cases, this is the URL (Internet address) of the data provider's site. For example, this property can
take the value "http://ec.europa.eu/eurostat", which refers to the home page of Eurostat
• Copyright (optional): Text describing the copyright rules and/or restrictions applied to the data
associated with this source.
• Publication title (optional): Title of the publication or name of the source where data were
taken from, if it exists (for instance "Switzerland Statistics Public Database")
• Methodology description (optional): This property describes a source-specific methodological details that make the data from this source distinct from the data coming from other sources
of the dataset (for instance “coming from heterogeneous data provider, the data has been harmonized using Eurostat data”). Cf the Technical Report on Core indicators, which proposes some examples of estimation methods.
Note that at least one source is needed for each indicator.
ESPON Projects are free to propose one or several xls files. Though each ESPON Project may define
the structure of the database delivery, when several xls files are delivered, the M4D Project may
kindly suggest organising them into a coherent folder (contained in a .ziparchive file) that should
be structured as follows:
1. By thematic (demography, economy, policy indicators, environment etc.)
2. By geographical objects (flows, territorial data)
In this way, external users can easily retrieve the data they looking for.
4.3. What happens to my Data?
At the moment, background data must be sent both to your TPG Project officer, the ESPON M4D
manager (<[email protected]>) and the M4D contact team. After that, a compliance check
is done by the M4D contact team to check if the template is followed and if all the mandatory fields
are filled.
No semantic and no outlier check will run on this delivery.
In the end, these additional data will be available under the ESPON Web page of each project (together
with inception report, final report etc,In the end, a compliance check is done to check if the template
is followed and if all the mandatory fields are filled.
No semantic and no outlier check will run on this delivery.
33
Draft
The Background Data of the Database
Draft
In the end, these additional data will be available under the ESPON Web page of each project (together
with inception report, final report etc, please see Figure 4.2).
In the next months, it will be possible to deliver background data in the upload part of the ESPON
Data Portal, allowing to centralise all this material in a dedicated part of the Database for a better
dataflow management.
Figure 4.2. Example of a Background Data Page
This figure shows a page on the ESPON Web Site dedicated to the Background Data Projects.
34
Draft
Draft
Chapter 5. Conclusion and Advice
As a conclusion, this chapter proposes some advice to manage the data flow inside each ESPON
Project, and complementary information.
5.1. Advice for a perfect management of the
data process
The following advice are the result of experience from the follow-up of previous ESPON Projects.
They have experimented some difficulties to follow/deliver the data and metadata specification by
the past.
1. A limited number of persons in charge of data/metadata creation in each TPG.
Ideally, each project should dedicate one of its team to deal with data and metadata creation. This
allows to:
a. Centralise all the data of the project
b. Harmonize data and metadata creation
c. Give a single delivery at the end of the project (a bad practice would be that each partner of the
TPG deliver its own key indicators without any control of the consortium).
2. Set up the question of data delivery very early in the lifetime of a project.
Regarding to the expected deliveries, some basic questions need to be discussed inside each project
very early:
a. What key indicators will be delivered to the database?
b. How to organize the data delivery of our case study?
c. What kind of innovative indicator could we propose to the ESPON Community, which could
be updatable in the future?
It is important to consider that waiting for the end of the project to take care of the data delivery
process may encounter problems of integration and lose a significant time.
3. Do not hesitate to contact the ESPON M4D team.
Each ESPON TPG is followed at least by one of our team (see Table 2.1). The M4D consortium is
present at each ESPON Seminar and is open to any suggestion, question for ease the life of ESPON
Projects.
4. Do not loose information; use the metadata templates as soon as possible!
In that way, you will be sure that you will not forget any mandatory fields and you will not have to
apply a boring copy/paste procedure of your datasets into the templates at the end of your project.
Reminder:
• Question: To whom deliver the data, and when?
Answer: data must be delivered under the Upload part of the ESPON Data Portal (one item
will be dedicated to key indicators, one to zoom-in data). This upload will trigger a notification that will be sent to the M4D Contact team responsible of the ESPON Project, to the
M4D manager, and to the ESPON CU Project Officer.
35
Draft
Conclusion and Advice
Draft
• Question: Where can I find the xls templates?
Answer: as shown in Figure 2.1, the three XLS templates are available under the Upload page
(restricted to members) of the ESPON Database Portal. An empty version and an applied
example are systematically provided.
5.2. A good practice for filling data and metadata
This example is derived from a concrete case which has been experimented by the M4D project in the
data collection of the one of the core indicators (total population 1990-2011, available under the search
interface). One of the aim of the core database strategy is to provide complete time-series at NUTS
levels for the ESPON Area for a set of basic count data. Among other, it implies to estimate some
missing values and refer precisely in the metadata the methodology used to fill the holes contained
in the dataset.
Starting from Denmark, total population is available for 2007 and 2008 on Eurostat website. It refers
to the label "1" which is described in the metadata file as shown in Figure 5.1.
Figure 5.1. Starting point: a table with empty values
This figure shows a common situation: a table with empty values which need to be estimated.
When looking at other data sources, this information is available only for two territorial units on the
National Statistical Website of Denmark (due to the change of NUTS definition). The unique way to
obtain data for the rest of the territorial units consists by proceeding to a data estimation (temporal
retropolation in this case).
The problematic is: How to reference this in the metadata file?
The only solution to avoid a loss of information consists by referencing immediately this estimation
in the metadata source of the dataset! Figure 5.2, Figure 5.3, Figure 5.4, and Figure 5.5 propose a way
to proceed in order to ensure a high quality of metadata.
36
Draft
Conclusion and Advice
Draft
Figure 5.2. Resulting dataset with estimated values and associated labels
This figure shows the resulting table with estimated values. Each estimated value has a label (column
source of the total population 2005 and 2006) explaining the methodology used to create the estimation. Of course, the value of the label (TE6b, 13) are different than the one of the starting table (label
1, source of the total population 2007 and the total population 2008). In concrete terms, the fact to
put two labels (TE6b, 13) means that two different methods have been used to estimate the missing
values. These labels have to be described in the source part of the metadata immediately.
Figure 5.3. Description of the label 1 in the metadata
This figure shows the metadata associated to the label 1. The data source is Eurostat and data has not
been estimated (false value in the estimation field). Taking into account regular updates of Eurostat
tables, a good practice consists by precising the date of upload of the table (2011-07-26 in this case)
and its precise name (demo_r_gind3).
37
Draft
Conclusion and Advice
Draft
Figure 5.4. Description of the label 13 in the metadata
This figure shows the metadata associated to the label 13. This data comes from the Danish National
Statistical Institute. As a consequence, the label must not be the same than the one related to Eurostat
data (label 1).
Figure 5.5. Description of the label TE6b in the metadata
This figure shows that data related to the label TE6b has been estimated (true value in the estimation
field). When data is estimated, it is very important to describe in the methodology fields (description,
formula or URI) how the estimation was contucted.
38
Draft
Draft
Appendix A. Data Flow Process of the
Key Indicators
Figure A.1. Data Flow Process: Upload and Syntactic Check
The syntactic check is automatic while uploading the data file to the portal.
Figure A.2. Data Flow Process: Semantic Check
The semantic check step is an expertise. This step triggers the delivery of a report and an optional
fixed data file proposing improvements and suggestions regarding its content.
39
Draft
Data Flow Process
of the Key Indicators
Draft
Figure A.3. Data Flow Process: Semantic Check Approval
When the M4D contact team has delivered the report about the semantics check, the TPG is notified.
He is invited to consult the report, then he can choose to fix his delivery or to forward it to the next
step of the integration.
Figure A.4. Data Flow Process: Outliers Check
This step mainly consists in detecting outliers and checking the quality of data. An outliers report is
delivered at the end of this expertise.
40
Draft
Data Flow Process
of the Key Indicators
Draft
Figure A.5. Data Flow Process: Outliers Check Approval
When NCG has delivered the outliers report, the TGP is notified. He is invited to consult the report,
then to decide to continue the integration process, or to review his data.
Figure A.6. Data Flow Process: ESPON CU Aggreement
At this step, ESPON CU is notified and invited to consult the delivery reports, in order to take the
decision to integrate the TPG delivery into the ESPON Database, or not.
41
Draft
Data Flow Process
of the Key Indicators
Draft
Figure A.7. Data Flow Process: Integration
Last step of the integration: ESPON CU has approved the integration. One click allows to integrate
the data into the database.
42
Draft
Draft
Appendix B. References
[1] Anton Telechev and Benoit Le Rubrus. ESPON Data and Metadata Specification. Full text in HTML [http://
database.espon.eu/metaspecifs] (last visit: 2012-05-20) .
[2] Claude Grasland and Ronan Ysebaert. ESPON Technical Report - The Core Database Strategy – A new paradigm for data collection at regional level. December 2011.
[3] LIG STeamer. ESPON Database Web Application. Version February 2012. http://database.espon.eu (last visit:
2012-05-20) .
[4] ESRI. ESRI Shape File Technical Description. An ESRI White Paper - July 1998. Full text in PDF [http://
www.esri.com/library/whitepapers/pdfs/shapefile.pdf] (last visit: 2012-03-23) .
43
Draft
Draft
Appendix C. About
This document is part of the ESPON 2013 Database Phase 2 project, also known as M4D (Multi
Dimension Database Design and Development). It was generated on the 2012-06-25 15:27:38, from
the sources of the m4d forge imag project at the svn rev 553.
This document has been written by UMS RIATE [http://www.ums-riate.fr] (Claude Grasland, Isabelle
Salmon, Ronan Ysebaert, Nicolas Lambert, Timothée Giraud, Antoine Laporte) and LIG STeamer [http://steamer.imag.fr] (Jérôme Gensel, Marlène Villanova-Oliver, Anton Telechev, Benoit Le
Rubrus) M4D Partners.
For any comment question or suggestion, please contact <[email protected]>.
Colophon
Based on DocBook technology 1, this document is written in XML format, sources are validated with
DocBook DTD 4.5CR3, then sources are transformed to HTML and PDF formats by using DocBook
xslt 1.73.2 stylesheets. The generation of the documents is automatized thanks to the docbench
LIG STeamer project that is based on Ant 2, java 3, processors Xalan4 and FOP 5. Note that Xslt
standard stylesheets are customized in order to get a better image resolution in PDF generated output
for admonitions icons: the generated sizes of these icons were turned from 30 to 12 pt.
1
[on line] DocBook.org [http://www.docbook.org] (last visit: July 2011)
[on line] Apache Ant - Welcome. Version 1.7.1 [http://ant.apache.org] (last visit: July 2011)
3
[on line] Developer Resources For Java Technology [http://java.sun.com] (last visit: July 2011). Version 1.6.0_03-b05.
4
[on line] Xalan-Java Version 2.7.1 [http://xml.apache.org/xalan-j/] (last visit: 18 november 2009). Version 2.7.1.
5
[on line] Apache FOP [http://xmlgraphics.apache.org/fop/download.html] (last visit: July 2011). Version 0.94.
2
44