In this episode, I had the pleasure of speaking with Prof. Holger Fröhlich, who leads the AI and Data Science Group at the Fraunhofer Institute and is an honorary professor at the University of Bonn. We explored one of the hottest topics in healthcare data science right now: synthetic data.
Holger and I discussed how synthetic data is generated using AI, what role digital twins could play in the future of clinical trials, and how these innovations could fundamentally reshape how we design and conduct research. We dove into the Cynthia Project, which is part of the Innovative Health Initiative (IHI) – the largest public-private partnership for health research in Europe.
What You’ll Learn in This Episode:
✔ What synthetic data actually is – and what it isn’t
✔ The benefits and challenges of using synthetic data in clinical research
✔ Real-world use cases, from early dementia diagnosis to trial simulation
✔ How digital twins could help predict patient outcomes
✔ Why synthetic data might be especially useful in areas with limited patient access
✔ The regulatory and ethical considerations we’ll need to navigate
Resources & Links:
🔗 Fraunhofer SCAI – AI in Data Analysis and Simulation (AI-DAS)
🔗 CERTAINTY Project (Virtual Twins)
🔗 The Effective Statistician Academy – I offer free and premium resources to help you become a more effective statistician.
🔗 Medical Data Leaders Community – Join my network of statisticians and data leaders to enhance your influencing skills.
🔗 My New Book: How to Be an Effective Statistician – Volume 1 – It’s packed with insights to help statisticians, data scientists, and quantitative professionals excel as leaders, collaborators, and change-makers in healthcare and medicine.
🔗 PSI (Statistical Community in Healthcare) – Access webinars, training, and networking opportunities.
Key Publications from Holger and Team:
🔗 A synthetic data generation framework for Alzheimer’s disease
🔗 Generative learning for deep biomarker discovery
🔗 Longitudinal digital twins in Parkinson’s Disease
🔗 Digital twins for clinical research – opportunities and limitations
Join the Conversation:
Did you find this episode helpful? Share it with your colleagues and let me know your thoughts! Connect with me on LinkedIn and be part of the discussion.
Subscribe & Stay Updated:
Never miss an episode! Subscribe to The Effective Statistician on your favorite podcast platform and continue growing your influence as a statistician.
Never miss an episode!
Join thousends of your peers and subscribe to get our latest updates by email!
Get the





Learn on demand
Click on the button to see our Teachble Inc. cources.
Featured courses
Click on the button to see our Teachble Inc. cources.
Holger Fröhlic
Head of AI & Data Science at Fraunhofer SCAI
Professor at University of Bonn
Prof. Dr. Holger Fröhlich holds a diploma (focus area: Artificial Intelligence) and PhD in Computer Science. After positions as a postdoc at the German Cancer Research Center and as a Senior Scientist at Cellzome AG (now an enterprise of Glaxo-Smith-Kline), he was appointed an associate professorship at the University of Bonn in 2010.
In 2015 he joined the global biopharmaceutical company UCB and became the Director of an AI and Data Science research team. Since 12/2019 HF is Head of the AI & Data Science group and Deputy Head of the Department of Bioinformatics at the Fraunhofer Institute for Algorithms and Scientific Computing in Sankt Augustin.
In addition, he is teaching as an honorary Professor in the Master programs Life Science Informatics and Computer Science at the University of Bonn. Holger Fröhlich’s focus is on the development and application of data science (specifically AI/ML) methods in biomedicine, with focus on early drug discovery, precision medicine and clinical trials. In this context, Holger Fröhlich has developed over the last 20 years a broad spectrum of data science approaches, including techniques for multi-modal data integration, (generative) time series modeling, hybrid AI approaches (ODE/NN combinations, Graph Neural Networks) and causal machine learning. He is author and co-author of more than 170 scientific publications.
He has been coordinator of three EU projects (JNPD ADIS, ERA PerMed DIGIPD, AIPD), and partner in multiple further national and international research consortia as well as industry collaborations. Furthermore, he is scientific advisory board member of the international AI graduate school HIDSS4Health of the Helmholtz society.

Transcript
Reimagining Clinical Trials with Synthetic Data and Digital Twins
[00:00:00] Alexander: You are listening to the Effective Statistician podcast. The weekly podcast with Alexander Schacht and Benjamin Piske designed to help you reach your potential lead great science and serve patients while having a great [00:00:15] work life balance.
[00:00:23] In addition to our premium courses on the Effective Statistician Academy, we [00:00:30] also have. Lots of free resources for you across all kind of different topics within that academy. Head over to the effective statistician.com and find the [00:00:45] Academy and much more for you to become an effective statistician. I’m producing this podcast in association with PSIA community dedicated to leading and promoting use of statistics within the health industry.
[00:00:59] [00:01:00] For the benefit of patients, join PSI today to further develop your statistical capabilities with access to the ever-growing video on demand content library free registration to all PSI webinars and much, much more. [00:01:15] Head over to the PSI website at psiweb.org to learn more about PSI activities and become a PSI member to pick.[00:01:30]
[00:01:30] Welcome to a new episode of The Effective Statistician. Today I am interviewing Hogar. Hi Hogar. How are you doing? Hi, Alexander. I’m doing well. How are you? Very good. Especially as I’m talking to [00:01:45] you about a very interesting initiative. But before we dive into that, maybe Holger, you can shortly introduce yourself.
[00:01:53] What have brought you to the position where you are now?
[00:01:57] Holger: Yeah, so I’m adding the AI and data [00:02:00] science group at the Institute for Algorithms and Scientific Computing, and I always seeing close to Bonn and also a honorary professor at the University of Bonn. And yeah, so what brought me [00:02:15] into this position is in essence, on one hand I would say my interest in research in data science, but on the other hand also at the same time, an interest to [00:02:30] translate this research into application specifically to benefits patients.
[00:02:35] I’ve gone through a number of career steps also, which brought me to different positions in academia, but also for example, pharma industry in fact, [00:02:45] and I ended up somehow in in between. So if you think about it, something like the bridge between the research that’s typically done at universities in Germany.
[00:02:59] [00:03:00] Research development and industry. Our mission is to be a translator between both sides. Okay? And this is what we play a different project sometimes in one-to-one with industry partners and sometimes as [00:03:15] part of larger consortium.
[00:03:16] Alexander: And that is exactly the topics we wanna talk about today. Here comes the first acronym IHI.
[00:03:25] What does that stand for and where is it coming from?
[00:03:29] Holger: [00:03:30] IHI stands for Innovative Health Initiative, the largest public private partnership in Europe, in the healthcare sector at least. And IHI has an essence the mission to bring [00:03:45] together academic partners as well as pharma companies and companies from the MedTech field.
[00:03:54] And so the idea is to do projects. Research projects to [00:04:00] together in a joint manner. Yeah. So there’s really people sitting from both sides, from all these sectors together and they really try to tackle very challenging and innovative problems.
[00:04:12] Alexander: And this is [00:04:15] originating from the former code IMI now just. I, I, because it’s not just the pharma industry, but also Symantec industry.
[00:04:26] If you have heard about IMI prefer, [00:04:30] or IMI protect, or IMI get real. There were a lot of these initiatives in the past that yield is quite interesting finding and very good collaboration between pharma and the academics [00:04:45] supported by the European too. Let’s go a little bit more specifically into the, and that’s a pretty big consortium and within that there are different, I’m not quite sure how they’re called, but [00:05:00] streamed specific groups working on that.
[00:05:03] Tell me about what you are working
[00:05:07] Holger: currently in three different HI project. One of them is fo focusing actually [00:05:15] on setting up a federated learning platform for Europe in order to connect health data. So the other one is focusing on a platform for earlier dementia diagnosis. The third called [00:05:30] Cynthia focuses on synthetic data generation.
[00:05:34] Evaluation and setting up a platform.
[00:05:37] Alexander: This is the one we wanna dive a little bit deeper into today because synthetic data is quite [00:05:45] invoke, uh, there’s a lot of, uh, different, uh, terms that are flowing around and we wanna double click onto this. Let’s start with the basics. What is synthetic data for you?
[00:05:58] Holger: The term synthetic data is, [00:06:00] of course not standardized in some way, but there’s also different understanding about it. In Cynthia, at least the focus is on AI generated synthetic data. So that means and practice. So you have [00:06:15] at the beginning real data. So in this case real patient level data and to train a generative AI model or this data, generative AI model learns the statistical distribution that which [00:06:30] the data.
[00:06:32] Properly was sampled from, if you learn the statistical distribution, you can sample from this distribution again.
[00:06:39] Alexander: Yeah.
[00:06:39] Holger: And this generates your synthetic data. So in some sense this is [00:06:45] similar is if you had, in a very simplistic term, learned the parameters of a distribution and now you’re exactly from it, this is the simplified notion.
[00:06:56] Sounds like
[00:06:57] Alexander: bootstrapping, isn’t it? [00:07:00]
[00:07:01] Holger: At the end of the day, so we are not practically using Bootstrap here. Yeah. But other techniques of, in our case, for example, of variation inference. Yeah. Which is a approximate base inference technique. [00:07:15] Yeah. But at the end of the day, the underlying ideas, there’s really, so you have real data and you try to de use characteristics of the underlying statistical distribution such that at the end of the day, you can sample from it.[00:07:30]
[00:07:30] Alexander: When it comes to this underlying data, let’s go into details there. So this underlying data, for example, this, uh, typical baseline characteristics like age, gender. [00:07:45] Severity of the disease. Would it also include data sets, like pre-treatment, comorbidities, or safety events
[00:07:55] Holger: ongoing? In principle, from our algorithmic point of view, we [00:08:00] do not make particular assumptions of what the individual variables are that go into such a model.
[00:08:06] This could be of different nature. Of course, demographics plays an important role, so. Comorbidities, measures of disease, [00:08:15] severity, et cetera. And of course, and this is something that we have particularly worked on during last years, we are interested to not only learn and this data of on the baseline [00:08:30] characters characteristic level, but really also longitudinally.
[00:08:33] Yeah. So at the end of the day, we generate synthetic patient trajectories.
[00:08:38] Alexander: Where is that patient level data coming from?
[00:08:42] Holger: This comes from different [00:08:45] studies. So in our case, for example, we have a, one of the collaborators in our projects who has access to a larger set of Alzheimer’s disease studies and they are made available to us.
[00:08:59] Yeah. [00:09:00] And now we are able to train our models on this data and, and then another work stream, which starts. Near Future is also to additionally leverage real world data. In fact. So this is of course [00:09:15] data of very different nature. This is data from hospitals in OIA and Valencia. Yeah. And there also, the idea is to leverage these type of data then to build generative models, mortgage.[00:09:30]
[00:09:30] Real world data has different characteristics than data from clinical studies because in real world data, of course, you do not have this typically higher depth of assessment of individual patients. [00:09:45] Yeah, but you have a much more Yeah. Shallow, let’s say assessment as for example, you may not see there. In the case of Alzheimer’s disease, a lot of standardized questionnaires that assess cognitive [00:10:00] performance.
[00:10:00] Yeah. Just something at the diagnostic level. There’s this first, for example, cognitively impaired or uh, demented. Yeah. You may see imaging data, comorbidities and prescriptions, but very irregular compared to a [00:10:15] study, much less systematic, which generates its own challenges.
[00:10:18] Alexander: Yes. These are two extreme hospital data.
[00:10:21] On another extreme would be claims data. There’s yet another. Set of datas proof effective observational studies [00:10:30] set in Alzheimer. Do you have access to these as well?
[00:10:33] Holger: Yes, exactly. It’s pointed out via one of our collaborators, we have access to a larger number, cohort studies in the Alzheimer’s field.
[00:10:42] It’s also worthwhile to mention that during [00:10:45] efforts in last year, so we have also our own institute collected a larger number of studies. In this area, we have recently counted the total aggregate number of patients in all of these studies now, which amounts to about [00:11:00] 65,000 or so. So we have the significant data stack.
[00:11:03] Alexander: Yeah, that’s definitely quite a size in terms of sample size to work on what does the ultimate goals and use cases of such synthetic data. [00:11:15]
[00:11:15] Holger: So there’s of course, different ideas why people are interested in synthetic data. One relatively practical one is the ritual idea actually facilitating data sharing.
[00:11:29] Yeah, so [00:11:30] obviously data sharing or sharing patient level data is incredibly difficult in Europe because you run into all sorts of. Legal problems. Yeah. So therefore, one of the original thoughts behind synthetic data [00:11:45] generation was to have a sneak review for data scientists, something that mimics characteristics of real data and allows people to develop codes while waiting for access to the real data.
[00:11:59] [00:12:00] This is also the reason why we are in, in Germany, and also partners in this national health data infrastructure. Yeah. So we delivered an approach to generate synthetic data, but also, for example, an assessment tool. [00:12:15] So that says, how good or bad is such synthetic data compared to real one, and what might be also associated risks, privacy risks that are still discussed with synthetic data.
[00:12:28] So this is the one. This is the [00:12:30] one idea. Then there’s also, and this has become also the main motivator now behind this Cynthia HI project also further eight years, such as we could augment real data with synthetic data to [00:12:45] expand sample sizes. This is already done in the imaging fields a lot. Yeah. So where you have, for example, an object that you rotated simply.
[00:12:56] And you generate therefore new views. This has shown to [00:13:00] demonstrate a strong value when you train AI model. So now of course, an open question is, which is, is something similar like this also possible for other types of data in, in medicine for more structured data, [00:13:15]
[00:13:15] Alexander: extrapolation of data?
[00:13:18] Holger: In a way, yes. And a third motivator can be facilitate.
[00:13:22] Clinical studies in the future can be generate synthetic control arms, for example, specifically in areas where [00:13:30] recruitment. Of patients might be difficult because the disease is rare, or they have a subgroup of patients, which is difficult to get. For example, thinking about children or pregnant women or whatever.
[00:13:43] Yeah. These are other [00:13:45] motivators. Depending on how you do it. That data may allow us to. Interpolate between real datas, for example, longitudinally between visits. Synthetic control also could help also to design a future study because there’s [00:14:00] few. If you design a new trial, you have to think when should patients come in?
[00:14:04] If you have an idea how this could look and the different scenarios, this could help.
[00:14:10] Alexander: I want to go back to what you mentioned patients. Could it [00:14:15] be that, for example, in a typical phase three study? Yeah. We have very often a one-to-one ratio between active and placebo. So could it be that instead of one-to-one, we have two to one or [00:14:30] three to one or four to one ratio with much less patients on placebo and augment them with synthetic patients?
[00:14:40] Holger: That might be a possibility. Yes, of course. If you have [00:14:45] certain subgroups of patients. You think might be relevant, but they have underrepresented in your study. Yes. And then you may be able to over sample this subgroup if you want. Yeah. To due to this synthetic data [00:15:00] generation.
[00:15:00] Alexander: Okay. So for very old patients in Alzheimer’s studies.
[00:15:05] Yeah. Very old or very young. Uh, or for patients that have a specific pretreatment or comorbidity. Yeah. I think this is [00:15:15] only possible for placebo patients or comparator patients, not for active patients. Would you agree with that?
[00:15:22] Holger: Yes. No, of course. There are thoughts as to use generative AI models and developing [00:15:30] digital twins.
[00:15:31] Yeah. Digital twin. The idea is at a given point in time, asked the question, can we, uh, simulate a statistical distribution of possible future disease [00:15:45] trajectories? And now you can imagine if I had already the idea how the effect of a certain given medication was. Yeah. A lifestyle intervention or whatever, as if I knew the effect size.
[00:15:58] Yeah. It’s [00:16:00] on average. So then I would be able to also counter factually simulate such, uh, intervention. And that can be is also a different way how synthetic data might help.
[00:16:11] Alexander: Yeah. How do we, because [00:16:15] synthetic patient comes more or less free of charge. How can we think about sample sizes when we basically get patients for free?
[00:16:26] Does that work?
[00:16:27] Holger: I would say not for free. [00:16:30] These generative models are trained on real data. That means they are as good or bad as the real data.
[00:16:37] Alexander: Okay.
[00:16:38] Holger: And of course they are also as representative or not as the real data as you want. You cannot circum combine these [00:16:45] limitations.
[00:16:46] Alexander: Yeah. But if I, let’s say if I have clinical trial and I have four patients on active four.
[00:16:55] Every one month placebo, I you sample, let’s say [00:17:00] the same number of placebo patients from generative ai. Triple the time. Quadruple the time, yeah. Yes. Of patients. In what sense do I get [00:17:15] penalized for having more virtual patients
[00:17:19] Holger: and not penalized actively, but of course. From, for example, study design perspective and also calculate through different scenarios in terms of sample size [00:17:30] also.
[00:17:30] Yes. But again, so you’d have to at this point, keep in mind, yeah. So that the model and the synthetic data is based on, at the end of the day, real data, which has. Limitations and representativeness. There’s [00:17:45] no free lunch in that sense. Uh, so that’s because statistics still holds true. Yeah.
[00:17:51] Alexander: Yes. Can you describe a little bit about the cost of it?
[00:17:55] Of course, training, the, the model and so on, but [00:18:00] once it is trained, is it free thereafter
[00:18:03] Holger: in a way, training such models cost, time, and compute, you need to optimize these models? Yes, of course. Yeah. But yes, after the model has been trained, of course you [00:18:15] can sample as much as you want to think about the situation that I learned or estimate simply the parameters or the goals and distribution.
[00:18:24] Mm-hmm. And then of course, yes, if I have that and if these parameters, uh, good estimator [00:18:30] of the population. Characteristics. I can sample from it as long as they want. Yeah.
[00:18:35] Alexander: Yeah.
[00:18:36] Holger: That is, I think, a good analogy.
[00:18:38] Alexander: When do you think you will get some kind of concrete examples from that? This is [00:18:45] IHI project.
[00:18:46] Holger: This project started in September and runs for five years. We are generating first results internally and we’ll publish them. Yeah. I should say, so that we are in principle in this [00:19:00] field since quite a number of years now. Yeah. So synthetic data generation and we have also published approaches with real data in Parkinson’s field as well in the Alzheimer’s field.
[00:19:11] Yeah. So this is not, it’s not starting from scratch here, what [00:19:15] we are doing,
[00:19:16] Alexander: I think that is very often the case for these public private partnerships. There’s hugely some basis that you build on. Very often, even after the five years have ended, there’s some kind of further collaboration and [00:19:30] potentially follow up by HR projects.
[00:19:34] Holger: Absolutely. As a sequel, and this is of course what we are interested in, at the end of the day, it’s a network. This is how I see that where we are part of and like to be part of, [00:19:45] and of course we are interested that this is long term. You know the people, they know us. There’s, there’s a mutual trust.
[00:19:51] Alexander: We’ll definitely put the link to this.
[00:19:54] IHI project into the show notes so you can see [00:20:00] what kind of organizations are involved and see whether potentially your organization, your company, your university is already part of that. These collaborations tend to be quite big, especially for such [00:20:15] areas that are interest for many universities and many, many companies.
[00:20:20] Thanks for the listener. Interested in learning more about creating data based on generative [00:20:30] ai. What would be a good step?
[00:20:33] Holger: You can contact me on that matter. There are publications there from a scientific point of view, from us, but of course also for others. Yeah. We’ll be, are also at the moment checked [00:20:45] also out of this Cynthia project Generating also review paper and existing approaches there.
[00:20:51] Yeah. Also for different data modalities. Feel free to contact me directly.
[00:20:57] Alexander: Thanks so much for that great offer, and thanks a [00:21:00] lot for this awesome discussion about synthetic data. About IHI. How that will change the way we think about running clinical trials in the future. Think that is a really super interesting [00:21:15] field.
[00:21:15] Would be interesting to see the regulatory view payer view on these topics since the whole scientific community, how it will think about this.
[00:21:25] Holger: Absolutely. Cynthia, there is a dedicated work stream on that. [00:21:30] Thank you for this first discussion.
[00:21:36] Alexander: This show was created in association with PSI. Thanks Reine and her team at VVS in the background, [00:21:45] and thank you for listening. Reach your potential. Lead great science and serve patients. Just be an effective [00:22:00] statistician.
Join The Effective Statistician LinkedIn group
This group was set up to help each other to become more effective statisticians. We’ll run challenges in this group, e.g. around writing abstracts for conferences or other projects. I’ll also post into this group further content.
I want to help the community of statisticians, data scientists, programmers and other quantitative scientists to be more influential, innovative, and effective. I believe that as a community we can help our research, our regulatory and payer systems, and ultimately physicians and patients take better decisions based on better evidence.
I work to achieve a future in which everyone can access the right evidence in the right format at the right time to make sound decisions.
When my kids are sick, I want to have good evidence to discuss with the physician about the different therapy choices.
When my mother is sick, I want her to understand the evidence and being able to understand it.
When I get sick, I want to find evidence that I can trust and that helps me to have meaningful discussions with my healthcare professionals.
I want to live in a world, where the media reports correctly about medical evidence and in which society distinguishes between fake evidence and real evidence.
Let’s work together to achieve this.
