Harmonisation

Data Harmonisation – Enabling the Reuse of Existing Data

Valuable data resources from health research exist around the world, and their reuse holds great potential for addressing new research questions. However, as the collection and documentation of research data have evolved over time and often differ between studies, these datasets are not always directly compatible. To enable cross-study analysis of existing research data, NFDI4Health provides a harmonisation strategy.

Image

Background

There are two main approaches to make research data findable and usable across different studies. Firstly, research data can be uniformly recorded according to internationally recognised standards – a particularly suitable approach for newly initiated studies. For this purpose, NFDI4Health, in collaboration with standardisation experts, has developed services to support users in describing their research data using established terminologies (Terminology Service) and in cataloguing them (Annotation Workbench). In addition, the German research landscape includes a wide range of population-based studies, some of which began collecting data many years ago and have now concluded. These often differ significantly in how data was collected and documented. To enable the use of such data in future cross-study analyses – making them interoperable and reusable – a retrospective harmonisation strategy is required.

Our Service

The service developed by NFDI4Health supports analysts and data-holding institutions in harmonising data efficiently, with minimal personnel effort. Conceptually, the strategy is modelled on existing procedures of the Canadian Maelstrom Research Group. The templates used for metadata collection, assessing harmonisation potential, and applying appropriate harmonisation rules have been adapted to suit the context of national cohort studies in Germany. The entire workflow, with detailed descriptions of each step, is documented in a dedicated harmonisation protocol. To allow the harmonisation of research data locally within the data-holding institutions – without requiring data to leave the premises – we have integrated the functionalities of the R package Rmonize into a flexible and fully featured R project. This service minimises the workload for data-holding studies and ensures a smooth harmonisation process through prior testing. Only the creation of the original dataset and the execution of the harmonisation script remain the responsibility of the studies themselves. Upon successful execution, a harmonisation report is generated to verify the correctness of the harmonised data, along with an Opal-compatible Data Dictionary and the harmonised dataset. These documents can then be used in internal study projects as well as cross-study research, for example using DataSHIELD.

Contact: Dr. Franziska Jannasch (DIfE) and Florian Schwarz (DIfE)

Workflow

  1. Analysts list the required research data based on their research question in a project-specific Target Data Schema. Existing standards are used to describe the data where available.
  2. Analysts contact the data holders to check whether the required data has been collected and, if so, request detailed descriptions regarding data collection methods, formats, units, etc. (The Target Data Schema may need to be revised depending on variable availability.)
  3. The analysts then create a study-specific Data Dictionary containing the requested variables, including descriptions, units, and categories.
  4. Based on the defined variables in the Target Data Schema and the study-specific variables, the analysts assess the harmonisation potential of each variable (complete, partial, impossible) and determine appropriate harmonisation rules.
  5. The R script is slightly adapted to reflect the study’s specific circumstances, keeping the workload for data holders as low as possible.
  6. The templates created in steps 1, 3, and 4, along with the R script (with embedded Rmonize functions), are compiled into an R project and sent to the study. The study is asked to provide a dataset with the study-specific research data.
  7. Upon successful execution of the R script, a harmonisation report and the harmonised dataset are produced. An additional file is generated for optional use with DataSHIELD.
Image