Accessibility Tools

Select your language

Select your language

Synthetic Data

Simulated Datasets for Research and Development

Health research thrives on the exchange of high-quality data, yet it faces stringent data protection requirements. Synthetic data open up new possibilities by replicating real data structures. NFDI4Health supports researchers with methods and tools for the generation, evaluation, and visualisation of synthetic data.

Image

Background

Sharing data in health research is often difficult and time-consuming due to high data protection requirements. The result is the creation of data silos, in which research-relevant data cannot leave organisations, or only to a limited extent. Synthetic data offer a potential solution: as artificially generated datasets, they aim to reproduce the statistical and structural properties of real data without allowing conclusions to be drawn about patient-specific, sensitive information. This means that synthetic data can potentially be shared more easily and enable the simulation of analyses and experiments in scenarios where access to real data is not possible or only very limited. NFDI4Health supports researchers with methods for synthetic data generation, as well as with tools for assessing realistic potential risks from a data protection perspective and for visualising synthetic data.

VAMBN – Generation of Synthetic Data

VAMBN (Variational Autoencoder Modular Bayesian Networks) is a hybrid generative AI approach for synthetic data generation. It was specifically developed to realistically represent and synthetically generate heterogeneous and longitudinally collected study data. VAMBN enables researchers to model complex relationships between variables over time and to generate new datasets that are statistically similar to the original data.

Syndat – Evaluation of Synthetic Data

The quality of synthetic data depends heavily on the selected AI model as well as on the statistical properties of the underlying real data. The analysis and evaluation of synthetic data can be a complex process; moreover, there is currently no scientific consensus on the quantitative evaluation of synthetic data. Syndat is a tool designed to support researchers in the systematic evaluation of synthetic data with regard to their similarity to real data and potential data protection-related risks. Syndat is available both as a Python library for data scientists and, for other user groups, as an interactive web-based dashboard.

Relevant publications

Gootjes-Dreesbach L, Sood M, Sahay A, Hofmann-Apitius M, Fröhlich H. Variational Autoencoder Modular Bayesian Networks (VAMBN) for Simulation of Heterogeneous Clinical Study Data. Front Big Data Med Public Health. 2020;3:16. https://doi.org/10.3389/fdata.2020.00016
Kühnel L, Schneider J, Perrar I, Moazemi S, Prasser F, Nöthlings U, Fröhlich H, Fluck J. Synthetic data generation for a longitudinal cohort study - Evaluation, method extension and reproduction of published data analysis results. Sci Rep. 2024;14:14412. https://doi.org/10.1038/s41598-024-62102-2
Adams T, Birkenbihl C, Otte K, Ng HG, Rieling JA, Näher AF, ... Fröhlich H. On the fidelity versus privacy and utility trade-off of synthetic patient data. iScience. 2025;28(5):112382. https://doi.org/10.1016/j.isci.2025.112382
Moazemi S, Adams T, Ng HG, Kühnel L, Schneider J, Näher AF, ... Fröhlich H. NFDI4Health workflow and service for synthetic data generation, assessment and risk management. Stud Health Technol Inform. 2024;317:21–29. doi:10.3233/SHTI240834