How to Design a Clinical Data Warehouse by Philip Puls and Dr. Niels Buch Leander The implementation of a Clinical Data Warehouse (CDW) is, first and foremost, a drive towards standardisation in order for a company to reap the following benefits: •Better use of internal resources •Reduction in critical time path for statistical analysis •Standard exchange of data with CROs, partners and regulatory agencies •Cross-trial analysis and leveraged use of historic data •Globalisation and knowledge sharing •Compliance with regulations The implementation of a CDW is, however, complex and critical as it may threaten to restrain an organisation that is already struggling to get its products to market as quickly as possible. Therefore, this article describes how to design a CDW that can facilitate its implementation significantly by thinking through the entire standardisation process, right from the start. Getting the Building Blocks of a CDW right The ‘foundation’ clinical data warehouse consists of data load programmes for most common data sources, e.g. CDMS, EDC, SDTM/ODM, IVRS, Safety and CTMS, company specific code-lists as well as transformation and enrichment of the data into a single standardised data model. The data load programmes must include a load of the study metadata so that, at a later stage, it will be possible to utilise features such as code-lists, dictionary versions and trial designs, including trial arms and visit schedules. The CDW also includes a number of ‘data marts’, special collections of data organised for a specific purpose, such as a SDTM–data mart that can be exported or reported and a signal detection data mart. Clinical Data Warehouse Analysis platform CDW operations applications Std program library Data set creation Study metadata management Audit trail Global Access Source metadata management Administration and data transfer Data Repository Clinical Data Repository Metadata Repository Clinical Applications CTMS CDMS SDTM Trial Metadata Clinical data Clinical data Trial Metadata Figure 1: Elements of the in-process Clinical Data Warehouse Data sources to CDW The most obvious way to design the underlying Clinical Data Repository (CDR) is to look to the Janus Model which is a normalised data model that allows for cross-trial analysis and the creation of SDTM and ADAMs data sets. This is an industrywide data model that comes with a full SDTM mapping description. The benefit of using an enhanced Janus data model is that it provides maximum automation capabilities, an ‘FDA-view’ of the data and exploratory analysis possibilities across studies, projects and compounds. Furthermore, the data model is easily adapted if company-specific needs are not covered, for example, in relation to study metadata. However, since the CDW enables in-process data review, the clinical data repository data model cannot be a mere ‘copy’ of the Janus data model. Instead, it must be enhanced to include the necessary traceability back to the source as well as a data quality status that allows users and programmes to filter on ‘approved’ data. The conversion of the study-data stream to the normalised CDR does introduce some latency which must be taken into account, especially if, for example, titration and safety are reviewed on the basis of data availability in the CDW. The ‘foundation’ CDW includes a set of programmes that read SAS transport files (xpt) in SDTM standard and Define.xml files and load the data into the Clinical Data Repository (CDR). The programmes should handle the current productive version of SDTM (version 3.1.2) and be ready for future versions. The programmes should use define.xml and the company-approved codelist/value level metadata to verify that the data set can be loaded correctly and comply with approved company standards. This functionality is similar to what is found in the tool that the FDA uses when checking and loading applicant data files. The SDTM load programmes handle new or proprietary domains by storing all events in one EVENTS table. All findings are stored in one FINDINGS table, and all interventions are stored in one INTERV table. The load programme stores all additional supplemental qualifiers in a single SUPPQUAL table. The define.xml metadata are also loaded into the data model in order to correctly store historic values such as dictionary versions as well as trial design definitions and to specify the link between code variables and codes. Finally, the ‘foundation’ CDW should include at least a SDTM-data mart that, in a standardised format, makes data available to all users and connecting systems, such as a business intelligence tool. The Value of Dealing with Metadata in the CDW These outlined features of the CDW are often not enough to satisfy the average user or justify the investment to the company as they only maintain current standards and do very little in terms of automation. Furthermore, the ‘foundation’ CDW system only stores the trial metadata as it looked at the time of its collection. It will be necessary to have different versions available online for reporting or submission, for example, and perhaps for exploratory purposes. Without the ability to dynamically shift clinical and study metadata during the trial life cycle, there is an enormous risk that the data will grow stale, thereby reducing the data warehouse to a storage facility with reduced value to the users and to the company. Therefore, it is vital to expand the functionality of the ‘foundation’ CDW to include a metadata repository that can organise metadata for clinical study reporting in order to facilitate creation of standard programme libraries and study design and to drive data source mapping. The metadata facilitate re-usability of programming code, integration of data into standardised data structure, optimisation of data preparation and reporting and frontloading. As such, metadata play a pivotal role in the drive towards standardisation. In CDW, there are two types of metadata: clinical metadata and operational metadata. The former are defined as all data related to the subject in the trial and are thereby independent of the trial; the latter are defined as data that describe the trial and are therefore specific to a single trial. For design purposes, it is important to keep in mind that all metadata are not standard, whereas all standards are metadata. The consequence of this asymmetry is that besides metadata covered by SDTM and ADAM, it is necessary to include process metadata for transformation, transport, presentations, QC, study and submission and business process and control. The Clinical Data Warehouse Operations application includes three modules: In order to maintain a high level of standardisation and actively pursue frontloading of resources, maintenance of metadata and the preparation of metadata for new studies should be done as early as possible in the trial design process. Preferably, all new study protocols should be based on the metadata library, and any changes necessary to accommodate new trial designs in the metadata library should be made and approved along with the internal approval of the study protocol. This process ensures, first, that all activities that can be front-loaded will be performed and, second, that analysis and reporting can be executed automatically once the study data are loaded. 1. A Metadata Module (MMA) 2. A Source Data Mapping Module (SDM) 3. An Administration Module To Be Serious about the Drive towards Standardisation The first of these modules, the Metadata Module, maintains the following metadata: clinical metadata, study metadata, study design, visit structure, study flow chart, clinical metadata versions and cross-study metadata. Metadata governance Study Metadata library Protocol design CRF design Clinical Data base setup SAP The above figure shows where study metadata are applied during the clinical study process. It also illustrates that throughout the study life cycle, it is necessary to establish a metadata governance process and define responsibilities clearly in order to preserve the integrity of the clinical data repository. Furthermore, the figure also highlights that decisions made at the level of ‘Protocol Design’ DBR CDW load Stat. analysis Medical Writing Figure 2: Process and Data standardisation impact the downstream task of ‘Statistical Analysis’. This is why the metadata repository implementation must be coordinated with a standardisation of protocol and CRF design. Ideally, the protocol authoring tool pulls its protocol components from the Metadata Repository as this will guarantee consistency between the way data are collected and the way data are stored and reported. The Source Data Mapping (SDM) module includes one mapping design (ETL) for each source from which the CDW is loading, thereby replacing several of the ‘foundation’ CDW features described above. How static or dynamic the ETL needs to be depends on the level of flexibility and variation in the data source. If data are sourced from a Clinical Data Management (CDM) system, there will be variations in how the trials are defined and structured. In this case, one should consider making the source data mapping dynamic and letting the SDM handle any necessary data conversion. The amount of effort spent on the SDM is related to the strategy of migrating legacy data. If it has been decided to migrate some or all historic ABOUT THE AUTHORS: Philip Puls is a Senior Project Manager at NNIT in Zürich, Switzerland. Dr. Niels Buch Leander is a Business Consultant at NNIT in Copenhagen, Denmark. They both specialise in designing and implementing IT solutions for pharmaceutical companies. clinical study data, the company may, in the worst case scenario, end up having to design one ETL for each study. A smartly designed SDM will, however, reduce this migration effort. The cost and benefit of the SDM should therefore also be held up against the cost of migrating legacy data and the benefit of having legacy data available in the CDW. In addition to the two modules described above, the CDW operational application also includes an administration module that maintains users, the security system and a centre for managing load processes. In this way, the Metadata Management Application can become the company’s global repository for clinical trial handling and reporting. By thus comprehending the full extent of the drive towards standardisation and the critical role of metadata, the CDW can be designed in such a way that it will support the company’s business aims with a minimum of disruption during its implementation. Please contact Frederico Braga, Key Account Manager [email protected] or on +41 794 395 865 to learn more about our services. NNIT A/S Lottenborgvej 24 DK-2800 Lyngby tel: +45 4442 4242 NNIT Switzerland Bandliweg 20 CH - 8048 Zurich tel: +41 44 405 9090 NNIT Czech Republic Lazecka 568/53A CZ-77900 Olomouc tel: +420 585 204 821 NNIT China 358 Nanjing Rd. CN-Tianjin 300100 tel: +86 (22) 5885 6666 NNIT Philippines 24/F 88 Corporate Center 141 Valero St. Makati City 1227 tel: +63 2 889 0999