How To Deliver My Data? Ronan Ysebaert UMS RIATE Draft Draft How To Deliver My Data?: Ronan Ysebaert UMS RIATE Abstract This technical report proposes a description of the expectations for the integration of different ESPON projects data. Draft Draft Table of Contents 1. Introduction .............................................................................................................. 1 2. The Key Indicators .................................................................................................... 3 2.1. Concepts behind the key indicators delivery ......................................................... 3 2.1.1. RULE 1 – LIMITED NUMBER OF INDICATORS .................................... 3 2.1.2. RULE 2 – INNOVATIVE INDICATORS ................................................. 4 2.1.3. RULE 3 – HIGH LEVEL OF METADATA .............................................. 4 2.1.4. RULE 4 - PROMOTE THE CORE DATABASE STRATEGY ...................... 5 2.1.5. RULE 5 – A GOOD COMPLETENESS OF THE INDICATOR .................... 5 2.2. Key Indicators Delivery ................................................................................... 5 2.2.1. XLS Template With Examples ................................................................ 6 2.2.2. ESPON Data and Metadata Specifications ................................................. 7 2.2.3. Frequently Asked Questions (FAQ) ......................................................... 8 2.2.4. Collected Data Estimation Methods ......................................................... 9 2.3. The Data Delivery Process .............................................................................. 12 2.3.1. Data and metadata upload .................................................................... 13 2.3.2. Syntactic check .................................................................................. 13 2.3.3. Semantic check .................................................................................. 15 2.3.4. Quality control ................................................................................... 17 2.3.5. Integration into the database ................................................................. 17 2.4. M4D Support to ESPON TPGs ........................................................................ 18 2.4.1. Beginning of the project: guidance phase ................................................ 18 2.4.2. During the project: help for data creation ................................................ 19 2.4.3. End of the project: help for integration ................................................... 19 3. The Zoom-in Delivery .............................................................................................. 20 3.1. The Zoom-in Delivery Strategy ....................................................................... 20 3.2. Expected delivery .......................................................................................... 23 3.2.1. Data file ............................................................................................ 23 3.2.2. Geometry file ..................................................................................... 24 3.2.3. Documentation file .............................................................................. 26 3.3. What happens to my data? The zoom-in data integration process ............................ 30 3.4. Support to TPG producing zoom-in data ............................................................ 30 4. The Background Data of the Database ......................................................................... 31 4.1. Strategy for the Background Data .................................................................... 31 4.2. Expected delivery .......................................................................................... 31 4.2.1. Indicator description ............................................................................ 32 4.2.2. Source description ............................................................................... 33 4.3. What happens to my Data? ............................................................................. 33 5. Conclusion and Advice ............................................................................................. 35 5.1. Advice for a perfect management of the data process ........................................... 35 5.2. A good practice for filling data and metadata ..................................................... 36 A. Data Flow Process of the Key Indicators ..................................................................... 39 B. References ............................................................................................................. 43 C. About .................................................................................................................... 44 iii Draft Draft List of Figures 1.1. Three possibilities to deliver my data ......................................................................... 1 2.1. On-line availability of the xls templates ...................................................................... 6 2.2. Header of the ESPON Data and Metadata Specification ................................................. 7 2.3. Excel Data Model for Priority 1 projects ..................................................................... 8 2.4. Example of the Label field description ..................................................................... 8 2.5. Header of the FAQ ................................................................................................. 9 2.6. Example of an estimation method ............................................................................ 11 2.7. Dataset Integration Tracking Details ......................................................................... 12 2.8. Syntactic check: example of an invalid input .............................................................. 14 2.9. Syntactic check: example of a valid input despites warnings .......................................... 15 2.10. Example of a Semantic Check Report ..................................................................... 16 2.11. Semantic Check Example: Input Information ............................................................ 17 2.12. References for a Semantic Check Expertise .............................................................. 17 2.13. Semantic Check Example: Fixed Information ............................................................ 17 3.1. Overview page of Case Studies ............................................................................... 21 3.2. Information Page of a Case Study ............................................................................ 22 3.3. Data Template for Zoom-in Indicators and Project Database .......................................... 24 3.4. Data Model Example for Zoom-in projects ................................................................ 24 3.5. Example of a Case Study Geometries Input ............................................................... 25 3.6. Mapping of Geometries Codes in Data ...................................................................... 26 3.7. Case Study Dataset Sheet ....................................................................................... 27 3.8. Case Study Indicator Sheet ..................................................................................... 28 4.1. Project Database Metadata Sheet .............................................................................. 32 4.2. Example of a Background Data Page ........................................................................ 34 5.1. Starting point: a table with empty values ................................................................... 36 5.2. Resulting dataset with estimated values and associated labels ........................................ 37 5.3. Description of the label 1 in the metadata .................................................................. 37 5.4. Description of the label 13 in the metadata ................................................................ 38 5.5. Description of the label TE6b in the metadata ............................................................ 38 A.1. Data Flow Process: Upload and Syntactic Check ........................................................ 39 A.2. Data Flow Process: Semantic Check ........................................................................ 39 A.3. Data Flow Process: Semantic Check Approval ........................................................... 40 A.4. Data Flow Process: Outliers Check .......................................................................... 40 A.5. Data Flow Process: Outliers Check Approval ............................................................. 41 A.6. Data Flow Process: ESPON CU Aggreement ............................................................. 41 A.7. Data Flow Process: Integration ............................................................................... 42 iv Draft Draft List of Tables 2.1. TGPs' M4D Contact Team ...................................................................................... 19 v Draft Draft Chapter 1. Introduction The ESPON Database 1 project (2008-2011) has experienced a lot of difficulties to overcome the heterogeneity of information provided by ESPON Projects (integration of local data, integration of sophisticated indicators with little metadata description…). In order to improve this non-sustainable situation, the M4D Project has tried to better define what is expected from ESPON Projects in terms of data deliveries. The ESPON P1, P2 and P3 Projects are obliged to deliver all data collected and produced within their project. These data should be delivered in the form of three types: • Key indicators, covering the entire ESPON Space; • Zoom-in data, covering the case-studies; • Background data, covering all data produced by the project. As you can see in Figure 1.1, the dataflow depends on the nature of the delivery. Figure 1.1. Three possibilities to deliver my data This figure shows the three possibilities for data to be integrated into the database depending on the nature of itself. This document proposes useful information for these three different types of ESPON projects: • THE KEY INDICATORS, further described in the chapter entitled The Key Indicators The key indicators are innovative indicators highly relevant for policy making and should cover the entire ESPON Space (EU27+4). These indicators will be the only ones searchable from the query interface. The ESPON projects deliver in principle the indicators related to the maps included in Part B of the (Draft) Final Report - around 10 indicators. In case a typology or composite indicator is included, the data and methodology used to build it should also be delivered. The requirement in terms of data and metadata is high for this delivery and the ESPON Projects are requested to upload the data via the Upload page [http://database.espon.eu/login] The key indicators delivery has to follow the ESPON Data and Metadata specifications. To build a strong and efficient query interface, these indicators will be checked in depth before integration. 1 Draft Introduction Draft This process includes three steps: 1. Syntax check, metadata format analysis (are all the mandatory fields filed?); 2. Semantic check, metadata content analysis (are the metadata understandable?); 3. Outlier detection, Outlier detection (are there unusual values in the dataset?). • ZOOM-IN DATA, further described in the chapter entitled The Zoom-in Delivery Besides the key indicators delivery, some ESPON Projects (in particular for Targeted Analysis, but not only) analyze specific territories of the ESPON Area at local scale. To make this kind of complementary and very interesting data easy accessible, a case-study interface will be developed. To set up this interface, the projects are requested to deliver their most representative data, their geometries (in a shape file format) and a documentation highlighting the content of the data and geometries (following a dedicated template). Regarding to this delivery, the M4D project will only check if it is possible to map the data and if all mandatory fields of the documentation file are correctly filled. • BACKGROUND DATA, further described in the chapter entitled The Background Data of the Database In order to fill their contractual obligations and to make all data as a coherent set available, each ESPON Project has to deliver a zip file that contains all data, metadata and geometries (if different than the usual ones delivered via ESPON) used in the project. This zip file is considered as an annex to the final report of the project and is stored on the ESPON Website project page. This document also proposes useful advices in Conclusion and Advices and templates files in Data Flow Process of the Key Indicators. 2 Draft Draft Chapter 2. The Key Indicators The 10 best indicators delivery is probably the most restrictive one: taking into account that ESPON is a community where knowledge and material is shared, it needs to define some basics to ensure the harmonization of the ESPON identity. Of course, it concerns reports (50 pages maximum by report, following some required typographic styles), maps (following the map-kit template) or reporting (inception report, interim report(s), draft final report, final report). It also concerns data and metadata. To be useful for ESPON projects and other end-users, data should always be accompanied by metadata, including information about their quality and sources. It is also particularly important that the metadata should be compliant with international (ISO) and European (INSPIRE) standards so as to ensure the use of the database in the longer-run and to make it compatible with other national and international database initiatives. To ensure correct data processing and integration into the ESPON 2013 Database, the ESPON Metadata Specifications provided by M4D project must be carefully respected by all the data providers participating to the project and by the organizations/persons who intend to create new software implementations interacting with the ESPON Database. The ESPON Metadata is relatively complex, but quite complete. As a result, the metadata creation in ESPON is a huge work BUT only concerns a limited number of indicators. It implies that TPGs should take into consideration at the very beginning of the implementation of the project. In Section 2.1 of this chapter, we firstly describe the concepts behind the key indicators delivery, or "What shall I deliver?". In Section 2.2, we detail the resources you can use to deliver your data, "How shall I deliver my data?". Finally, Section 2.3 is dedicated to the data flow process, or "What happens to my data?". 2.1. Concepts behind the key indicators delivery Before delivering the key indicators, four basic rules are to be kept in mind. The M4D Project has defined rules in order to give a common understanding of the future content of the ESPON Database and to avoid the integration of too much heterogeneous information. It is the unique way to propose a database that could be managed in the future. Four basic rules are described below with concrete situations of good or bad practices. 2.1.1. RULE 1 – LIMITED NUMBER OF INDICATORS Each ESPON Project has to choose 10 key indicators covering all the ESPON Area at NUTS level. With this basic rule, we want to limit the discrepancy between projects, which deliver hundreds indicators (residuals of statistical models, generally not very well explained in metadata) and other projects, which deliver few indicators, embedded in a monstrous information flow. In general terms, we prefer to include into the database a single indicator with a real added value, rather than hundreds of indicators which may never be queried by users of the database. Good practises: • If the main result of the project is a typology, please provide all the indicators used for calculating it (e.g. if the typology is based on population, age group 20-39, age group 65+, natural population increase and net migration, deliver all these indicators if they are not already included into the database). • Deliver indicator that could be helpful for the ESPON Community in the future, e.g. policy-makers, researchers and practitioners. 3 Draft The Key Indicators Draft Bad practises: • Deliver GDP per capita and all its statistical derivates (which could be automatically calculated further): GDP per capita EU27=100, GDP per capita ESPON Area=100 etc. • Provide all the residuals of a complex statistical model. 2.1.2. RULE 2 – INNOVATIVE INDICATORS By the past, the ESPON M4D project has received ten indicators describing total population in 2006! This kind of figure makes the database impossible to use (which indicator to download?). This is why, in the key indicators delivery, we kindly ask the project to propose innovative indicators that are not yet into the ESPON Database. Good practises: • Before collecting data, look into the ESPON Database to see if the indicators you are looking for are not already available. • If mistakes are detected in the ESPON Database, please notice the ESPON M4D team and propose a revision of the dataset. Bad practises: • Deliver an indicator already contained in the database without explaining the added value of the indicator you propose (estimations of better quality, mistakes corrected). 2.1.3. RULE 3 – HIGH LEVEL OF METADATA The metadata related to indicators must be very well explained. If you propose indicators derived from statistical analysis or models, make sure your data is understandable by non-specialists users! Good practises: • Take time to correctly fill each field of the metadata model. • Reference all the sources you use to create your dataset. In that way, the user will be able to define which data is coming from official data sources (Eurostat, national statistical institutes, ...) and which one you have estimated. The total population 1990-2010 file, available in the ESPON Database is a good example of systematic description of the data source. • Make sure that it is possible to rebuild the indicator your propose in the database. Use the methodology property (part 1.7.2 of the specifications [1]) to describe your calculation methodology. • Enclose to your data delivery methodological notes (field URI of the indicator description). Bad practises: • Put in the methodology field of the indicators: “cf Final Report for further explanations”… • Deliver indicators that will never be updated in the future without your TPG knowledge (e.g. composite indicators based on a data model which is your property and not diffusible). • In the source part of the metadata, mention your project as data provider (generally the dataset is a combination of data coming from Eurostat, national sources and estimations). 4 Draft The Key Indicators Draft 2.1.4. RULE 4 - PROMOTE THE CORE DATABASE STRATEGY Out of the key indicators, each project can suggest the inclusion into the "Core Database" of indicators of interest for territorial monitoring (time series, added value for the database), which could be updated and maintained in the future, out of your project. Good practises: • The M4D Project proposes total population at NUTS0, 1, 2 and 3 levels for the period 1990-2010. A good practice could be to extend this temporal coverage to the period 1980-2010. • The M4D Project proposes age structure data (5 years age-class) at NUTS 0, 1, 2 levels. A good practice could be to extend the hierarchical coverage to the NUTS3 level. • The M4D Project has collected total area and population for the UMZ (Urban Morphological Zones): Extend the thematic coverage to other indicators (Land Use, etc). Bad practises: • Deliver a derived indicator (for example, unemployment rate) without delivering the count data behind this indicator (e.g. unemployed population and active population). • Deliver a dataset with a high number of missing values. 2.1.5. RULE 5 – A GOOD COMPLETENESS OF THE INDICATOR At the moment, the ESPON Database supports several nomenclatures: NUTS division in the 1995, 1999, 2003, 2006 and 2010 revisions for the ESPON Area; United Nations division (for World countries). Whatever the nomenclature used, the degree of completeness of the indicator must be relatively good. Ideally, most of the missing values must be estimated with a description of the method used. In that order, a guidance paper has been written by the M4D project, proposing a set of estimation methods [2]. The key indicators concern Applied research projects (ESPON Priority 1) and projects from the Scientific Platform (Priority 3). For targeted analysis, most of the data will be integrated in the zoom-in interface (cf The Zoom-in Delivery). Good practises: • If Eurostat (main data provider) does not provide data for some territorial units of the 10 best indicators, look at external data sources (National Statistical Institutes) if the data exists. • When no data is available, estimate it and refer systematically in metadata the methodology used for the estimation. Bad practises: • Deliver data for three countries of the ESPON Area: In this case of figure, go to the zoomin interface (part 2 of the Technical Report) • No description of the estimation made. 2.2. Key Indicators Delivery This section details the expected deliveries and available resources to fill ESPON Data and metadata. 5 Draft The Key Indicators Draft In order to ensure an efficient way to create data and metadata in the ESPON format, the M4D Project has produced some useful guidance documents (available from the help menu of the ESPON Database Web site at http://database.espon.eu [http://database.espon.eu/]). As the M4D Project is still working on improving the Database interface, documentation and tools will be improved in the next steps of our project (for example, the availability of an on-line metadata editor). Updates and news will be regularly sent to concerned people. 2.2.1. XLS Template With Examples As shown in Figure 2.1, under the "Upload" menu of the ESPON Database Web site (login required), an XLS template fully compatible with the ESPON Metadata specifications [1] is available to download. It contains all the required information described in the metadata specification (cf part 1b2) and it is structured in four parts: 1. the dataset sheet (information related to the dataset) 2. the indicator sheet (information related to the indicator) 3. the source (information related to the data source) 4. the data (contains ID and data) This XLS is the current solution to integrate data into the database. A metadata editor is under construction to ease the data integration process. Figure 2.1. On-line availability of the xls templates The data and metadata templates, available from the upload part of the ESPON Database Web site. 6 Draft The Key Indicators Draft 2.2.2. ESPON Data and Metadata Specifications The document entitled ESPON Data and Metadata Specification [1], whose header is shown in Figure 2.2, is the reference document for the Priority 1 Projects datasets. It proposes a specification of the metadata model. Firstly, it describes the generic conceptual model of the ESPON Metadata (called as the Abstract Metadata Model). Secondly, it presents the implementation of the abstract model using the international standards (ISO-19115 and INSPIRE Directive). Finally, it explains the implementation of the abstract model in a tabular file format. Figure 2.2. Header of the ESPON Data and Metadata Specification This figure shows the header of the on-line HTML document, available on the ESPON Database Portal [3]. Please find below some advices to use these specifications: • Do not be impressed by the 150 pages of the paper format document! From the user point of view, the first, the second and the third parts of the metadata model specifications explain in a different way (conceptually, in a xml version, in a tabular version, e.g. Excel) the same topic: description of all the fields of the ESPON Metadata model. • To begin with, we strongly advise you to carefully read the introduction of the Metadata specifications, explaining the main concepts and also the third part, showing the tabular model and all the fields to be filled with concrete examples. • Download the metadata template (requires login) from the "Upload" menu (see Figure 2.1). On the basis of this .xls document, fill your metadata. For example, Figure 2.3 shows how colors and 7 Draft The Key Indicators Draft comments in this template help at filling cells. When something is not clear, please refer to the metadata specifications: as an example, Figure 2.4 shows the description of the Label field. Following Figure 2.3 and Figure 2.4 illustrate an example of a good practise by using the metadata specifications. Figure 2.3. Excel Data Model for Priority 1 projects I want to reference my data. First of all, I want to know what kind of information is mandatory. On the right part of each cell, a description box (in red on the figure) helps me to answer to this question. Each cell colored in green needs to be filled. When going to the source part of the metadata template, I do not understand the meaning of the label field (in orange). When looking on the right part of the cell, one can see that this element is described in the part 1.6.1 of the Specifications. When going to the ESPON Metadata specifications, shown in Figure 2.4, the label property gives a full description of the element. Figure 2.4. Example of the Label field description This figure is an extract of the specification. It shows the description of the Label field. As a next step, the stabilization of these Metadata Specifications is the first step to feed the ESPON Database. The ESPON M4D team is now working on the creation of a metadata editor to easily and dynamically generates (without using the XLS template) your metadata. Once available, this guidance paper will be updated. 2.2.3. Frequently Asked Questions (FAQ) By the past, the M4D project has had to respond to a lot of questions regarding to the data integration. We have tried to capitalize all these exchanges by writing a FAQ, available on-line from the help 8 Draft The Key Indicators Draft menu of the Web application [3] since February 2012. As shown in Figure 2.5, questions are ordered by topics: 1. What is M4D? 2. Content of the database 3. Access to the database 4. Data delivery 5. Metadata process 6. Support to data creation 7. Mapkit 8. Local/urban data Figure 2.5. Header of the FAQ This figure shows the header of the FAQ available on-line from the Help menu of the Web application [3] 2.2.4. Collected Data Estimation Methods One of the M4D Technical Reports entitled "The Core Database Strategy, a new paradigm for data collection" (see annex in [2]), proposes a general strategy named ESTIM for data collection at regional level. 9 Draft The Key Indicators Draft An interesting added value of the document is a dictionary of estimation methods adapted for nonspecialists (pp 70-103) inspired from the Data Navigator 2 framework produced within the ESPON 3.2 Project (2007). Among other, this dictionary has been used to estimate missing values of the total population 1990-2010 dataset). This document will be updated depending on the user feedbacks. The aim of this document is twofold: first of all, to formalize procedures of data estimations as regard to regular concrete situations. We try to explain step by step the methodology employed for estimating data by using the ESTI terminology; and secondly, to provide information in order to correctly fill the ESPON metadata. Each estimation method (an example is shown in Figure 2.6) is organised by synthetic sheet, explaining the conditions of use of the estimation method, a graphic illustration of the situation, textual explanation (what is described in the methodology field of the metadata source), a mathematic formalization and an example of use. One of the striking points here is to let the user know how the data has been estimated. 10 Draft The Key Indicators Draft Figure 2.6. Example of an estimation method This figure is an extacted screenshot from [2], showing the output of an estimation method based on time retropolation and space harmonization. 11 Draft The Key Indicators Draft 2.3. The Data Delivery Process This section aims at responding to the following question: "What happens to my data?" The data integration process aims to apply a very steady quality control of datasets delivered by ESPON projects. This process is divided in 5 steps. When the TPG integrates its key indicators, he activates a dedicated module in the ESPON Data Portal ("Upload" menu): the Tracking Tool. The tracking tool is being developed to follow the state of advancement of the data integration process (Figure 2.7). Please note that this tool requires to be logged in. For further information about the integration workflow (Who? When? etc), please consult Data Flow Process of the Key Indicators. Figure 2.7. Dataset Integration Tracking Details This screen (work in progress) allows to consult details on the achieved and pending activities concerning the dataset integration. The "Semantics" and "Outlier detection" reports of this dataset are available here. The data integration process is composed of main steps described in following sub-sections. 12 Draft The Key Indicators Draft 2.3.1. Data and metadata upload When a project is ready to deliver its data and metadata, it activates a dedicated module in the metadata editor. It means that the data integration process has started and the tracking tool (on the ESPON Database Portal) has been activated. A notification is sent to the ESPON Coordination Unit and to the M4D team in charge of the TGP. 2.3.2. Syntactic check The syntaxic check step aims at checking the compliance of delivered data and metadata with the specification. In concrete terms, it checks if all the mandatory fields of the ESPON data and metadata are correctly filled. This control is automatically done when the project uploads its datasets from the "Upload" menu of the Web application. This is the only compulsory step of the data integration process. Once successfully checked, the dataset is saved on the server. A notification is sent to ESPON CU and to the M4D team in charge of the next step. The syntaxic check step is performed on all uploaded datasets. As shown in Figure 2.8, the page displays all the necessary information to fix eventual syntactic errors or warnings. Three types of messages are displayed in the logs boxes: • INF prefix indicates an information message, e.g. some information about the syntactic check process. • WRN prefix indicates a warning message. Warning messages are triggered for ambiguous values that may be problematic during the next steps of the integration. Nevertheless, warning messages do not make the syntactic check fail. As shown in Figure 2.9, the TGP is invited to eventually review his dataset, though he can also submit it to the semantic check. • ERR prefix indicates an error message. Error messages refer to missing values or errors in mandatory fields of the metadata. These errors constraint the user to review his dataset that can no pass this step and continue the integration process. 13 Draft The Key Indicators Draft Figure 2.8. Syntactic check: example of an invalid input This screen shows the information messages (prefixed with [INF]), warning ([WRN]) and error messages ([ERR]) returned by the syntactic parser. Example: 1 2 3 4 5 6 7 8 WRN ERR ERR ERR ERR ERR ERR ERR No value found for the indicator 'IXP'. Skipping data validation for this The 'Temporal Extent' property is null. The 'Dataset Information' element is not valid. The 'Temporal Reference' element is not valid. Unable to check the global temporal extent, because it is null. The 'Temporal Reference' property is not valid. The 'Temporal Reference' property is not valid. The 'Lineage' property is not valid. 14 Draft The Key Indicators Draft Figure 2.9. Syntactic check: example of a valid input despites warnings This screen shows that the uploaded file is valid (no errors) but still contains warnings. The user can pass this step or fix the dataset by clicking respective buttons at the bottom of the page. 2.3.3. Semantic check After the syntaxic check step, the dataset is transferred to the M4D contact team in charge of the TGP. This step aims at analyzing the content of the data and metadata (and namely the free-text fields). The aim of this step is to analyze if all the indicators of the dataset are correctly described and understandable by a large public. The result of this expert check is achieved by the edition of a semantic report. Note that this semantic report feedback does not forbid the data integration process, but the project is sollicitated to consult this report and to decide to follow up the integration process, or to fix his dataset according to this expertise. An example of such a semantic report, filled with annotations, warnings and remarks, is shown in Figure 2.10. 15 Draft The Key Indicators Draft Figure 2.10. Example of a Semantic Check Report This example semantic check report extract proposes annotations remarks and suggestions besides problematic cells. Further details are given below. Concretely, the semantic check is composed of two files: a report and a proposal of correction. The report (as shown in Figure 2.10) contains the following information: • First lines: who did the check and when. • First column: description of the error(s) detected by sheet (dataset, indicator, and source). For instance it could be “the indicators should be better described”, “the methodology of calculation of the indicator should be better precised”, “keywords are not adapted to the indicator etc.) • Second/third column: description of the location of the error in the metadata (name of the indicator/label etc.) • Last column: action made on the metadata. Three cases are possible: • the deletion of the information (a bad keyword…); • detected mistakes are corrected by the M4D Contact team (precision of the name of an indicator); • when it is impossible to correct the information, the following coment is displayed: the project is strongly advised to precise the information. The proposal of correction is a new metadata file. This step is an expertise. In other terms, if the TPG is not able (or does not want) to correct his metadata, the dataset can be submitted to the next step of the integration process. Following screenshots illustrate an example of the semantic check expertise performed by the M4D Team on a problematic dataset. Figure 2.11 shows the initially received information. Figure 2.12 shows the consulted documents to help at understanding and fixing the received information. Figure 2.13 shows proposal of correction returned to the TGP. The M4D contact team is not in charge of filling this kind of information! We support you in the process but please make sure that your delivered indicators are understandable by external users! 16 Draft The Key Indicators Draft Figure 2.11. Semantic Check Example: Input Information This figure shows a lack of information in the initially received metadata. This kind of description (4digit classes) is not enough to understand how the indicator has been build. Figure 2.12. References for a Semantic Check Expertise This figure shows the material available (TGP report) to complete the missing information. Figure 2.13. Semantic Check Example: Fixed Information This figure shows the fixed information returned to the TGP in the proposal of correction document. 2.3.4. Quality control At this stage, an outlier detection tool will run on the key indicators. The aim of this check is to provide an expertise on unusual values contained into the dataset according to various statistical tests (statistical outliers, spatial outliers). Like the semantic check, this is an expertise. This step is achieved by the edition of the outlier report. ESPON TPGs can validate or not the result of the check after consulting this report. The conceptualization of this check is still in progress. Until its implementation, the uploaded data passes to the next step. 2.3.5. Integration into the database Previous checks and steps of the dataflow give us a strong expertise on the quality of the datasets delivered by projects. Before integrating a dataset into the database, the ESPON M4D Project first needs the agreement of both the ESPON project and the ESPON Coordination Unit. This validation aggreement is mainly based on the provided reports (semantic/outlier). After its integratio into the database, it will be possible to dynamically query the database composed by the 10 best indicators through the search interface. If metadata are very well described, it gives a real added value to the indicators. 17 Draft The Key Indicators Draft 2.4. M4D Support to ESPON TPGs There are two critical phases during the lifetime of an ESPON Project: • The beginning, when the project has to find the material and guidance for beginning its investigations: mapkit, basic data, ESPON metadata rules, understanding of the data process. • The ending, when the project delivers its data and metadata in the specified ESPON format and checks. To ensure the good integration of ESPON Data in the expected format (e.g. key indicators strategy, good quality of data and metadata), continuous exchanges with ESPON TPGs are also strongly needed. The idea behind the follow up of ESPON Projects is to help them as much as possible in their data creation process, and not to wait for the end of the project to discover mistakes in the data or metadata. Though the M4D Team can help at its integration, please remind that it does not have in charge the creation of the ESPON TPGs datasets files. Consequently, the follow-up of ESPON TPGs implies to define some tasks, which can be described regarding to the lifetime of the project. On the top of that, it is necessary to distinguish: • ESPON projects under the priority 1 and 3, delivering data which will feed the web interface and following the "key indicators" principle (delivered for the ESPON area with high quality metadata); • Case Studies data that will feed a dedicated part of the Web application, and for whom the requirements are different. Following sub-sections propose some guidelines for the different phases of a project. 2.4.1. Beginning of the project: guidance phase Main issues to be taken into account: 1. Ensuring that each Priority 1 Project have access to the entire ESPON Database (public and private part). It implies to give a login and password pair to each project. 2. Inform the ESPON TPG on the resource available in the database: • Guidelines concerning data and metadata • Mapkits • Technical reports • data available in the database, geodatabases coming from Eurogeographics, etc. 3. Presentation of what is expected from the ESPON TPG at the end of the project if it is still not clear with this technical report. Please respect "The key indicators principle" described in Section 2.1. 4. Presentation of the ESPON data and metadata templates and representative examples if needed. 5. Identification of the persons in charge of the data collection and creation in the ESPON Priority 1 project. It is always more efficient to be in touch with the engineers – who generally create the data rather than the scientific coordinator, who use the data created in the project. 6. Explain the data process at the end of the project: syntactic check, semantic check and quality control and describe how the ESPON tracking tool manages the data flow, if it is still not clear with this technical report. 7. Presentation of the way to manage case-study data (cf The Zoom-in Delivery). 18 Draft The Key Indicators Draft 2.4.2. During the project: help for data creation Issues to be taken into account during the project mainly concern the data creation: • Answer to each question asked by the TPG regarding to data creation. • Meet the project during each ESPON Seminar. • Be present at least at one meeting of the project. • Make the link with the ESPON M4D project. 2.4.3. End of the project: help for integration Main issues to be taken into account at the end of the project: • In case of problems regarding to the syntactic check, the contact team helps the project to solve the problems. • For each ESPON project, Table 2.1 shows the M4D contact team that is in charge of the semantic check. Table 2.1. TGPs' M4D Contact Team ESPON Project M4D Contact Team ATTREG UMR Géographie-cités (FR) TRACC Anne Bretagnolle SGPTDE <[email protected]> EU LUPA Universitat Autònoma de Barcelona (ES) ESaTDOR Roger Milego <[email protected]> KIT National Center for Geocomputation (IE) TERCO Martin Charlton SeGI <[email protected]> ARTS University of Iasi TIGRIS (RO) Alexandru Rusu <[email protected]> TIGER UMS RIATE (FR) Ronan Ysebaert <[email protected]> All ESPON Priority 3 Projects (monitoring, map updates…) Laboratoire d'Informatique de Grenoble LIG STeamer (FR) Jérôme Gensel <[email protected]> 19 Draft Draft Chapter 3. The Zoom-in Delivery This chapter focuses on the networking activities with the Priority 2 projects (and more generally case study data). It is crucial for the ESPON database to integrate all the data and indicators provided in the framework of these projects, even if the information does not cover all the ESPON space. However, the characteristics of data provided at local scale make impossible a homogeneous integration of such information in the query interface described above, and this for different reasons: • Too precise information: One of the aims of the web interface consists by providing datasets for all the ESPON area. Make available indicators for only a couple of NUTS2 at local space will produce noise into the database. • Heterogeneous nomenclatures: Some datasets can be produced in heterogeneous geographical delineation, out of the NUTS or the LAU nomenclatures (bassin de vie in France, Super Output Area in UK). It will be very difficult to store on a systematic way all the nomenclatures provided. • Too specific indicators: When analyzing territorial dynamics at local scale, some indicators of high interest may be collected for these case studies, but are totally useless at the ESPON scale (for instance, number of commuters going from Germany to Luxemburg in the Grande Region). • Difficulty to easily identify what is available: When multiplying case studies, at a very local scale, make possible to have an overview of what is available is a challenge. The query interface is clearly not adapted to this kind of request. The data storage of data coming from ESPON Priority 2 projects raised a lot of conceptual and practical problems, which has been solved by proposing an alternative solution to enter the data. 3.1. The Zoom-in Delivery Strategy The ESPON M4D considers as a “zoom-in delivery” a dataset that does not cover the entire ESPON Area (EU27+4). It includes several cases of figures: • Local data for a region or a group of regions (e.g. Greater Manchester at LAU2 level, Ile-de-France at employment basin level –not including in the LAU nomenclature etc.) • Non ESPON Area and non ESPON Neighbourhood data (e.g. data on American, Brazilian or Japanese regions). The M4D proposal consists by building a specific interface for querying such data. The data will be stored following a simple template (in a zip format, cf Section 3.2 for further explanations) and will be downloade following the two proposed pages shown in Figure 3.1 (overview) and Figure 3.2 (details). 20 Draft The Zoom-in Delivery Draft Figure 3.1. Overview page of Case Studies This overview page of case studies is a proposal that will be further improved, but it presents some clear advantages for the users: • A clear overview of the location of case studies produced within the ESPON Program. • Data integration is not limited to Europe and it is easily possible to integrate data coming from case studies outside Europe (USA, China, etc) • It is a simple solution for displaying in a homogeneous way the heterogeneity of the ESPON production. Some possibilities will be integrated in order to interact with the map (e.g. select only the location of case studies coming from a given project; select only the case studies located in a given country). Then, when selecting a project pin, the user is redirected to the case study information page shown in Figure 3.2. The pins solution to see case studies data is certainly not the best way to display the one in cross-border areas (Grande Région), large areas (North Calotte) etc. But taking into account the heterogeneity of case studies data and the difficulty to predict by advance what kind of geometries could be proposed by ESPON Projects, the M4D Project has chosen this solution, which may be improved in a future version of the interface. 21 Draft The Zoom-in Delivery Draft Figure 3.2. Information Page of a Case Study This figure shows the information page of a case study, previously selected from the list in the Overview page (Figure 3.1). Five main parts compose the page: 1. General information related to the ESPON TPG (aim of the data collection, contact, upload date of the datasets). 2. Data information: a listing of the available indicators, temporal extent of the indicators. 3. Geometries: location and name of the case study, nomenclatures used to collect data. 4. Data source: name of the data provider(s), URL, precaution of use. 5. Downloads: this part of the page proposes to download separately the data (.zip format), the geometries (as a .zip), and the metadata page as a .pdf file. Note that the download rights may be specified and restricted, particularly for the geometries not free of use, for example the Eurogeographics data. 22 Draft The Zoom-in Delivery Draft 3.2. Expected delivery To feed the zoom-in interface and the metadata page, the M4D Project needs three main deliveries from the ESPON Projects: data, geometries and documentation. The following sub-sections describe each of these elements. 3.2.1. Data file The format of the data file, shown in Figure 3.3, is not significantly different than the one proposed for the key indicators. An example of the data file is also given in Figure 3.4. The elements that differ from P1 projects are: • The two first lines of the Excel sheet (the temporal extent and the ID of the indicator have been concatenated in a single cell) • The source column, on the right column of each indicator, has been deleted. It means that the source description is made at the level of the dataset. The main elements of the data file are: 1. Code (first column): A code for the territorial units contained in the database (which has to be the same than the one displayed in the geometries). 2. Name (second column): name of the territorial unit 3. Object (third column): Object type (LAU1, LAU2, River Basin…) 4. Version (fourth column): Object type version (like NUTS versions). If the version is not adapted or not available for the dataset, put n/r (not relevant) or n/a (not available). 5. Indicator code (first line): Code of the indicators (concatenation of an identification “POP” and the year of reference “1990”) 6. Values: When data is not available, put n/a in the cell; when data is not relevant (e.g. location of harbour for non-costal territorial units), put n/r in the cell. 23 Draft The Zoom-in Delivery Draft Figure 3.3. Data Template for Zoom-in Indicators and Project Database This figure shows the content of the values sheet expected for Case Studies project data file. Figure 3.4. Data Model Example for Zoom-in projects This figure shows an example of the expected data model for Case Studies projects. 3.2.2. Geometry file In term of geometries, the M4D Project expects georeferenced information (Figure 3.5) in the ESRI Shapefile format [4]. The information contained in the .dbf linked to a shape file has to be at least a code (ID) that is similar than the one contained in the data files (Figure 3.6). Thus, it is possible for the user to: 1. Analyse the exact territorial coverage of each case study. 2. Build some maps thanks to the data gathered for each case study of the ESPON Community. 24 Draft The Zoom-in Delivery Geometries have to be delivered in name_of_the_project_geom.zip. Draft a .zip archive whose filename Figure 3.5. Example of a Case Study Geometries Input This figure is an example of the ESPON TeDi Project Case Study, available at LAU 2 level. 25 is Draft The Zoom-in Delivery Draft Figure 3.6. Mapping of Geometries Codes in Data This figure shows the full correspondance between geometries and data files codes. 3.2.3. Documentation file The documentation file aims at providing the information that is finally available to end-users on the page shown in Figure 3.2. The file structure is inspired from the metadata specifications of the key indicators with some simplifications and adjustments linked to the specificities of such a project delivery. In the xls template, mandatory fields must be filled in two sheets, these mandatory cells are indicated with a green backgound color in Figure 3.7 and Figure 3.8. Following sub-sections describe each of the sheets. 26 Draft The Zoom-in Delivery Draft 3.2.3.1. The dataset sheet Figure 3.7. Case Study Dataset Sheet This figure shows the dataset sheet of the TeDi Case Study data file. The green color shows mandatory fields. The purple color shows optional fields. The expected information in the dataset sheet is: • Name: name of the delivery. It is to give an idea of the dataset content. We encourage all dataset providers to produce the most short and meaningful dataset names that directly reflect the data semantics. • Project: ESPON project in which the dataset was produced. This should be an acronym of one of the existing ESPON projects. If this property is not specified, the default project "ESPON 2013 Database" will be applied. • Abstract: Free-text description of the contents of the dataset, in a way to make understandable the aim of the case study (both geographical coverage and thematic scope of the delivery). • Access classification: Classification of the access rule applied to the dataset/geometries separately. Three possibilities can be mentioned in this field: 1. unclassified - available for general disclosure (public access) 2. restricted - not for general disclosure (for registered users only, e.g. belonging to the ESPON Program). This possibility has to be used when the geometries comes from Eurogeographics, which cannot be diffused out of ESPON. But as far as possible, try to create your own geometries with no limitations of use… 3. confidential - available for someone who can be entrusted with information (for the administrator of the database only, e.g. ESPON Coordination Unit and the ESPON Database administrator) 27 Draft The Zoom-in Delivery Draft • Use restriction: Information useful to know for the future user of the dataset. It might be incoherencies between indicators definition (e.g. “be careful to the unemployment rate definition for Belgian territorial units”), content of the dataset (e.g. data are not available for the same year) etc. • Responsible party: Organization or person responsible for the entire dataset. Name, organization and email contact are required. • Metadata contact: Organization or person who created the metadata for the dataset. Name, organization and email contact are required. • Spatial binding: Describes the spatial link between the data part of the dataset and the territorial units used. Four elements are required: the name of the case study and its country of belonging, the latitude and the longitude location of the case study (by convention, we propose to use the center of the case-study); and information related to the geographical level of analysis (nomenclature name and/or version and/or level). The number of case studies per dataset is not limited. 3.2.3.2. The indicator sheet Figure 3.8. Case Study Indicator Sheet This figure shows the indicator sheet of the TeDi Case Study data file. The green color shows mandatory fields. The purple color shows optional fields. The expected fields in the indicator sheet is: 28 Draft The Zoom-in Delivery Draft • Code: A short acronym that reflects the meaning of the indicator • Name: A short expression that reflects the meaning of the indicator • Abstract: The abstract of the indicator. This property must describe the indicator in a more extended way than it is done by the Name property. The abstract must not repeat only the name of the indicator, but propose more information about it, that is not given by the Name. • Methodology description (optional): Describes the methodology used to produce indicator values. This methodology can concern a particular indicator independently of data sources or be specific to a particular source that provided indicator values (e.g. when a typology is produced, explain the cluster method used and the meaning of values shown in the data file – 1 for decreasing; 2 for increasing). • Methodology URI (optional): Reference to the resource where a detailed description of the methodology is made. This may be a reference to an online/paper publication or to the name of a file attached to the dataset. If this property specifies a file name, it must be present in the package delivered to the data processors; otherwise the data provider will be requested to supply this file. • Temporal extent: groups temporal references of periods or instances covered by the values of an indicator in the dataset. When the indicator is available at different time period (e.g. DNS_1a indicator on the figure 15), add several temporal extents. • Provider: Refers to the data provider of the indicator value. The provider may be an institution or even a person who is the originator of the data. This property should not be confused with the reference to the publication source: the data provider is the actor who contributed to the data production or publication. • Provider URI (optional): Official Uniform Resource Identifier (URI) of the data provider. In most cases, this is the URL (Internet address) of the data provider's site. This property must not represent a reference to the publication, but to the organization or the person who provided the data. For example, this property can take the value "http://ec.europa.eu/eurostat", which refers to the home page of Eurostat • Publication title (optional): Title of the publication or name of the source where data were taken from, if it exists (for instance "Switzerland Statistics Public Database") • Publication URI (optional): Official Uniform Resource Identifier (URI) of the publication. In most cases, this is the URL (Internet address) where the data is available online or can be accessed or obtained. This can also be an ISBN if the source is a paper publication (for instance http:// www.espon.eu/reports/report001.pdf). • Publication reference (optional): Indicates the element of the referenced publication (page, part, chapter etc) to refer to. (for instance. p.50, chapter 2). • Methodology description (optional): This property describes a source-specific methodological details that make the data from this source distinct from the data coming from other sources of the dataset (for instance “coming from heterogeneous data provider, the data has been harmonized using Eurostat data”). Cf the Technical Report on Core indicators, which proposes some examples of estimation methods. • Methodology URI (optional): Reference to the resource where a detailed description of the methodology is made. This may be a reference to an online/paper publication or to the name of a file attached to the dataset. If this property specifies a file name, it must be present in the package delivered to the data processors, otherwise the data provider will be requested to supply this file. • Copyright (optional): Text describing the copyright rules and/or restrictions applied to the data associated with this source. The default value of this property is "(c) ESPON 2013 Database". 29 Draft The Zoom-in Delivery Draft 3.3. What happens to my data? The zoom-in data integration process At the moment, zoom-in delivery must be sent both to your TPG Project officer, the ESPON M4D manager (<[email protected]>) and the TIGRIS team (<[email protected]>). When zoom-in data is delivered, a compliance check is organised by the M4D Project (TIGRIS team in particular) in order to check that: 1. The codes of the territorial units contained in the geometries and the dataset are the same (is it possible to make a map?) 2. The geometries are georeferenced (is it possible to display the case-study on the ESPON Mapkit?) 3. All the mandatory fields of the documentation file are correctly filled. At the end of the compliance check, a notification is sent both to the coordinator of the project and to the project officer. It means that the zoom-in delivery will be available from the zoom-in interface, shown in Figure 3.1 and Figure 3.2. In the next months, it will be possible to deliver zoom-in data in the upload part of the ESPON Data Portal, allowing to centralise all this material in a dedicated part of the Database for a better dataflow management. 3.4. Support to TPG producing zoom-in data TIGRIS Team (University of Iasi, Romania) is the team in M4D team in charge of the follow-up of projects producing Case-study data. The TIGRIS team has a good experience of local data and have produced several technical reports on that topic. Indeed, the ESPON TPGs are welcomed to ask any question regarding to the case study data flow or availability of data at local level to the TIGRIS team For any question, please send an email to Alexandru Rusu (<[email protected]>). 30 Draft Draft Chapter 4. The Background Data of the Database 4.1. Strategy for the Background Data ESPON TPGs may have produced a lot of data useful for specialists (e.g. residuals of a regression model) but not for ordinary (e.g. non-expert) users, such a policy makers or practitioners. Or TPGs may produce intermediate data that has been used to produce a synthetic index delivered in the "key indicators". In such a case, the M4D Project has produced a simplified data and metadata template derived from the Metadata Specifications of P1 projects. The aim of this template is to propose to external users the minimal piece of information useful to understand the meaning of the indicator, the origin of data and some precisions on the data producer. In fact, this template helps to define harmonised information related to data. 4.2. Expected delivery The XLS template developed in that order is quite easy and not time-consuming to feed. It is structured in two parts. One is dedicated to data and the other one to metadata. The data template is structured as the one proposed for case-study data (cf Section 3.2.1), and has to be delivered as a .xls file including a single sheet entitle data. The metadata file contains 10 compulsory fields (Figure 4.1) and has to be delivered as a .xls file including a single sheet entitled metadata. This sheet is structured in columns (one for each indicator). The first part is dedicated to the indicator definition, the second part to the data sources. 31 Draft The Background Data of the Database Draft Figure 4.1. Project Database Metadata Sheet This figure shows the content of the metadata sheet expected for Background project data file. A description of fields is given in following sub-sections. 4.2.1. Indicator description The indicator sheet content is described below: • Code: A short acronym that reflects the meaning of the indicator. • Name: A short expression that reflects the meaning of the indicator • Abstract: The abstract of the indicator. This property must describe the indicator in a more extended way than it is done by the Name property. The abstract must not repeat only the name of the indicator, but propose more information about it, that is not given by the Name. • Temporal extent : groups temporal references of periods covered by the values of the indicator. • Methodology description (optional): Describes the methodology used to produce indicator values. This methodology can concern a particular indicator independently of data sources or be specific to a particular source that provided indicator values (e.g. when a typology is produced, explain the cluster method used and the meaning of values shown in the data file – 1 for decreasing; 2 for increasing). . • Keyword (optional): Groups a list of keywords and/or keyword expressions related to the indicators. Ideally, these keywords must refer to the GEMET Thesaurus (http://www.eionet.europa.eu/ gemet/). • Upload/metadata date: Date of creation of the metadata file in the following format: DAY/ MONTH/YEAR • Use constraint (optional): Access and use constraints applied to ensure the protection of privacy or intellectual property, and any special restrictions or limitations on obtaining the resource. 32 Draft The Background Data of the Database Draft • Point of Contact: Persons or organizations that may be contacted for different issues related to the dataset/metadata. A name, an email and a name of an organization of reference is required. • Project: Name of the ESPON Project who has created the file. • How to source the indicator (optional): Like on the ESPON Mapkit, the rule is the following: name of the team, name of the ESPON Project, dataset date. 4.2.2. Source description All the data providers (structured in column) are listed below the indicator description. A source may be described as follows: • Provider Name: Refers to the data provider of the indicator value. The provider may be an institution or even a person who is the originator of the data. • Reference (optional): Official Uniform Resource Identifier (URI) of the data provider. In most cases, this is the URL (Internet address) of the data provider's site. For example, this property can take the value "http://ec.europa.eu/eurostat", which refers to the home page of Eurostat • Copyright (optional): Text describing the copyright rules and/or restrictions applied to the data associated with this source. • Publication title (optional): Title of the publication or name of the source where data were taken from, if it exists (for instance "Switzerland Statistics Public Database") • Methodology description (optional): This property describes a source-specific methodological details that make the data from this source distinct from the data coming from other sources of the dataset (for instance “coming from heterogeneous data provider, the data has been harmonized using Eurostat data”). Cf the Technical Report on Core indicators, which proposes some examples of estimation methods. Note that at least one source is needed for each indicator. ESPON Projects are free to propose one or several xls files. Though each ESPON Project may define the structure of the database delivery, when several xls files are delivered, the M4D Project may kindly suggest organising them into a coherent folder (contained in a .ziparchive file) that should be structured as follows: 1. By thematic (demography, economy, policy indicators, environment etc.) 2. By geographical objects (flows, territorial data) In this way, external users can easily retrieve the data they looking for. 4.3. What happens to my Data? At the moment, background data must be sent both to your TPG Project officer, the ESPON M4D manager (<[email protected]>) and the M4D contact team. After that, a compliance check is done by the M4D contact team to check if the template is followed and if all the mandatory fields are filled. No semantic and no outlier check will run on this delivery. In the end, these additional data will be available under the ESPON Web page of each project (together with inception report, final report etc,In the end, a compliance check is done to check if the template is followed and if all the mandatory fields are filled. No semantic and no outlier check will run on this delivery. 33 Draft The Background Data of the Database Draft In the end, these additional data will be available under the ESPON Web page of each project (together with inception report, final report etc, please see Figure 4.2). In the next months, it will be possible to deliver background data in the upload part of the ESPON Data Portal, allowing to centralise all this material in a dedicated part of the Database for a better dataflow management. Figure 4.2. Example of a Background Data Page This figure shows a page on the ESPON Web Site dedicated to the Background Data Projects. 34 Draft Draft Chapter 5. Conclusion and Advice As a conclusion, this chapter proposes some advice to manage the data flow inside each ESPON Project, and complementary information. 5.1. Advice for a perfect management of the data process The following advice are the result of experience from the follow-up of previous ESPON Projects. They have experimented some difficulties to follow/deliver the data and metadata specification by the past. 1. A limited number of persons in charge of data/metadata creation in each TPG. Ideally, each project should dedicate one of its team to deal with data and metadata creation. This allows to: a. Centralise all the data of the project b. Harmonize data and metadata creation c. Give a single delivery at the end of the project (a bad practice would be that each partner of the TPG deliver its own key indicators without any control of the consortium). 2. Set up the question of data delivery very early in the lifetime of a project. Regarding to the expected deliveries, some basic questions need to be discussed inside each project very early: a. What key indicators will be delivered to the database? b. How to organize the data delivery of our case study? c. What kind of innovative indicator could we propose to the ESPON Community, which could be updatable in the future? It is important to consider that waiting for the end of the project to take care of the data delivery process may encounter problems of integration and lose a significant time. 3. Do not hesitate to contact the ESPON M4D team. Each ESPON TPG is followed at least by one of our team (see Table 2.1). The M4D consortium is present at each ESPON Seminar and is open to any suggestion, question for ease the life of ESPON Projects. 4. Do not loose information; use the metadata templates as soon as possible! In that way, you will be sure that you will not forget any mandatory fields and you will not have to apply a boring copy/paste procedure of your datasets into the templates at the end of your project. Reminder: • Question: To whom deliver the data, and when? Answer: data must be delivered under the Upload part of the ESPON Data Portal (one item will be dedicated to key indicators, one to zoom-in data). This upload will trigger a notification that will be sent to the M4D Contact team responsible of the ESPON Project, to the M4D manager, and to the ESPON CU Project Officer. 35 Draft Conclusion and Advice Draft • Question: Where can I find the xls templates? Answer: as shown in Figure 2.1, the three XLS templates are available under the Upload page (restricted to members) of the ESPON Database Portal. An empty version and an applied example are systematically provided. 5.2. A good practice for filling data and metadata This example is derived from a concrete case which has been experimented by the M4D project in the data collection of the one of the core indicators (total population 1990-2011, available under the search interface). One of the aim of the core database strategy is to provide complete time-series at NUTS levels for the ESPON Area for a set of basic count data. Among other, it implies to estimate some missing values and refer precisely in the metadata the methodology used to fill the holes contained in the dataset. Starting from Denmark, total population is available for 2007 and 2008 on Eurostat website. It refers to the label "1" which is described in the metadata file as shown in Figure 5.1. Figure 5.1. Starting point: a table with empty values This figure shows a common situation: a table with empty values which need to be estimated. When looking at other data sources, this information is available only for two territorial units on the National Statistical Website of Denmark (due to the change of NUTS definition). The unique way to obtain data for the rest of the territorial units consists by proceeding to a data estimation (temporal retropolation in this case). The problematic is: How to reference this in the metadata file? The only solution to avoid a loss of information consists by referencing immediately this estimation in the metadata source of the dataset! Figure 5.2, Figure 5.3, Figure 5.4, and Figure 5.5 propose a way to proceed in order to ensure a high quality of metadata. 36 Draft Conclusion and Advice Draft Figure 5.2. Resulting dataset with estimated values and associated labels This figure shows the resulting table with estimated values. Each estimated value has a label (column source of the total population 2005 and 2006) explaining the methodology used to create the estimation. Of course, the value of the label (TE6b, 13) are different than the one of the starting table (label 1, source of the total population 2007 and the total population 2008). In concrete terms, the fact to put two labels (TE6b, 13) means that two different methods have been used to estimate the missing values. These labels have to be described in the source part of the metadata immediately. Figure 5.3. Description of the label 1 in the metadata This figure shows the metadata associated to the label 1. The data source is Eurostat and data has not been estimated (false value in the estimation field). Taking into account regular updates of Eurostat tables, a good practice consists by precising the date of upload of the table (2011-07-26 in this case) and its precise name (demo_r_gind3). 37 Draft Conclusion and Advice Draft Figure 5.4. Description of the label 13 in the metadata This figure shows the metadata associated to the label 13. This data comes from the Danish National Statistical Institute. As a consequence, the label must not be the same than the one related to Eurostat data (label 1). Figure 5.5. Description of the label TE6b in the metadata This figure shows that data related to the label TE6b has been estimated (true value in the estimation field). When data is estimated, it is very important to describe in the methodology fields (description, formula or URI) how the estimation was contucted. 38 Draft Draft Appendix A. Data Flow Process of the Key Indicators Figure A.1. Data Flow Process: Upload and Syntactic Check The syntactic check is automatic while uploading the data file to the portal. Figure A.2. Data Flow Process: Semantic Check The semantic check step is an expertise. This step triggers the delivery of a report and an optional fixed data file proposing improvements and suggestions regarding its content. 39 Draft Data Flow Process of the Key Indicators Draft Figure A.3. Data Flow Process: Semantic Check Approval When the M4D contact team has delivered the report about the semantics check, the TPG is notified. He is invited to consult the report, then he can choose to fix his delivery or to forward it to the next step of the integration. Figure A.4. Data Flow Process: Outliers Check This step mainly consists in detecting outliers and checking the quality of data. An outliers report is delivered at the end of this expertise. 40 Draft Data Flow Process of the Key Indicators Draft Figure A.5. Data Flow Process: Outliers Check Approval When NCG has delivered the outliers report, the TGP is notified. He is invited to consult the report, then to decide to continue the integration process, or to review his data. Figure A.6. Data Flow Process: ESPON CU Aggreement At this step, ESPON CU is notified and invited to consult the delivery reports, in order to take the decision to integrate the TPG delivery into the ESPON Database, or not. 41 Draft Data Flow Process of the Key Indicators Draft Figure A.7. Data Flow Process: Integration Last step of the integration: ESPON CU has approved the integration. One click allows to integrate the data into the database. 42 Draft Draft Appendix B. References [1] Anton Telechev and Benoit Le Rubrus. ESPON Data and Metadata Specification. Full text in HTML [http:// database.espon.eu/metaspecifs] (last visit: 2012-05-20) . [2] Claude Grasland and Ronan Ysebaert. ESPON Technical Report - The Core Database Strategy – A new paradigm for data collection at regional level. December 2011. [3] LIG STeamer. ESPON Database Web Application. Version February 2012. http://database.espon.eu (last visit: 2012-05-20) . [4] ESRI. ESRI Shape File Technical Description. An ESRI White Paper - July 1998. Full text in PDF [http:// www.esri.com/library/whitepapers/pdfs/shapefile.pdf] (last visit: 2012-03-23) . 43 Draft Draft Appendix C. About This document is part of the ESPON 2013 Database Phase 2 project, also known as M4D (Multi Dimension Database Design and Development). It was generated on the 2012-06-25 15:27:38, from the sources of the m4d forge imag project at the svn rev 553. This document has been written by UMS RIATE [http://www.ums-riate.fr] (Claude Grasland, Isabelle Salmon, Ronan Ysebaert, Nicolas Lambert, Timothée Giraud, Antoine Laporte) and LIG STeamer [http://steamer.imag.fr] (Jérôme Gensel, Marlène Villanova-Oliver, Anton Telechev, Benoit Le Rubrus) M4D Partners. For any comment question or suggestion, please contact <[email protected]>. Colophon Based on DocBook technology 1, this document is written in XML format, sources are validated with DocBook DTD 4.5CR3, then sources are transformed to HTML and PDF formats by using DocBook xslt 1.73.2 stylesheets. The generation of the documents is automatized thanks to the docbench LIG STeamer project that is based on Ant 2, java 3, processors Xalan4 and FOP 5. Note that Xslt standard stylesheets are customized in order to get a better image resolution in PDF generated output for admonitions icons: the generated sizes of these icons were turned from 30 to 12 pt. 1 [on line] DocBook.org [http://www.docbook.org] (last visit: July 2011) [on line] Apache Ant - Welcome. Version 1.7.1 [http://ant.apache.org] (last visit: July 2011) 3 [on line] Developer Resources For Java Technology [http://java.sun.com] (last visit: July 2011). Version 1.6.0_03-b05. 4 [on line] Xalan-Java Version 2.7.1 [http://xml.apache.org/xalan-j/] (last visit: 18 november 2009). Version 2.7.1. 5 [on line] Apache FOP [http://xmlgraphics.apache.org/fop/download.html] (last visit: July 2011). Version 0.94. 2 44
© Copyright 2024