Adaptive designs let us learn earlier, stop smarter, and protect patients—but they also make communication tricky. In this episode, Kaspar Rufibach and I dig into what “still correct” looks like when you try to explain results from group-sequential and other adaptive trials to regulators, clinicians, and scientific audiences. We unpack conditional vs. unconditional bias, median-unbiased estimation, stage-wise ordering for p-values, confidence intervals in multi-stage settings, and what to do with secondary endpoints and multiplicity. We also touch on ICHE20 (Adaptive Clinical Trials) and why pre-specification isn’t just a box-tick—it’s what builds trust.

Why You Should Listen:

✔ You need clear, defensible language for papers, conferences, and labels when your study had interims and stopping rules.

✔ You’ll learn practical rules-of-thumb for when “naïve” estimates are okay—and when to adjust.

✔ You’ll hear what regulators typically focus on vs. what patients and clinicians actually want to know.

Episode Highlights:

02:00 – Why communicating adaptive results is hard (and how simple can still be correct)

04:14 – What bias are we actually interested in? Conditional vs. unconditional

07:20 – Consequences for point estimates and confidence intervals

09:15 – Ordering the sample space across stages; stage-wise ordering and p-values

12:23 – Median-unbiased estimation: what it is and when to use it

13:38 – Secondary endpoints, safety, and multiplicity strategies

16:13 – Estimation efficiency vs. unbiasedness: what should we optimize?

17:40 – Communicating to scientific vs. lay audiences

18:36 – Should we publish p-values for secondary endpoints in adaptive trials?

20:20 – No one-size-fits-all template—and why fairness matters across programs

20:30 – Pre-planning or bust: why post-hoc “fixes” don’t carry the properties we need

21:49 – Trust, reproducibility, and credible decision-making

23:16 – ICHE20: read it, comment, improve it

Links:

🔗 ICHE20 (Adaptive Clinical Trials) – draft guidance: worth reading for its perspective on estimation and communication.

🔗 The Effective Statistician Academy – I offer free and premium resources to help you become a more effective statistician.

🔗 Medical Data Leaders Community – Join my network of statisticians and data leaders to enhance your influencing skills.

🔗 My New Book: How to Be an Effective Statistician – Volume 1 – It’s packed with insights to help statisticians, data scientists, and quantitative professionals excel as leaders, collaborators, and change-makers in healthcare and medicine.

🔗 PSI (Statistical Community in Healthcare) – Access webinars, training, and networking opportunities.

Join the Conversation:
Did you find this episode helpful? Share it with your colleagues and let me know your thoughts! Connect with me on LinkedIn and be part of the discussion.

Subscribe & Stay Updated:
Never miss an episode! Subscribe to The Effective Statistician on your favorite podcast platform and continue growing your influence as a statistician.

Never miss an episode!

Join thousends of your peers and subscribe to get our latest updates by email!

Get the shownotes of our podcast episodes plus tips and tricks to increase your impact at work to boost your career!

We won’t send you spam. Unsubscribe at any time. Powered by Kit

Learn on demand

Click on the button to see our Teachble Inc. cources.

Load content

Kaspar Rufibach

Expert Biostatistician at Merck

Kaspar is an Expert Statistical Scientist in Roche’s Methods, Collaboration, and Outreach group and is located in Basel.

He does methodological research, provides consulting to Roche statisticians and broader project teams, gives biostatistics training for statisticians and non-statisticians in- and externally, mentors students, and interacts with external partners in industry, regulatory agencies, and the academic community in various working groups and collaborations.

He has co-founded and co-leads the European special interest group “Estimands in oncology” (sponsored by PSI and EFSPI, which also has the status as an ASA scientific working group, a subsection of the ASA biopharmaceutical section) that currently has 39 members representing 23 companies, 3 continents, and several Health Authorities. The group works on various topics around estimands in oncology.

Kaspar’s research interests are methods to optimize study designs, advanced survival analysis, probability of success, estimands and causal inference, estimation of treatment effects in subgroups, and general nonparametric statistics. Before joining Roche, Kaspar received training and worked as a statistician at the Universities of Bern, Stanford, and Zurich.

More on the oncology estimand WG: http://www.oncoestimand.org
More on Kaspar: http://www.kasparrufibach.ch

Transcript

[00:00:00] Alexander: You are listening to the Effective Statistician Podcast, the weekly podcast with Alexander Schacht and Benjamin Piske designed to help you reach your potential lead great science and serve patients while having a great [00:00:15] work life balance.

[00:00:22] Alexander: In addition to our premium courses on the Effective Statistician Academy, we also have. [00:00:30] Lots of free resources for you across all kind of different topics within that academy. Head over to the effective statistician.com and find the Academy and much [00:00:45] more for you to become an effective statistician. I’m producing this podcast in association with PSI.

[00:00:53] Alexander: A community dedicated to leading and promoting use of statistics within the health industry for the benefit of [00:01:00] patients. Join PSI today to further develop your statistical capabilities with access to the ever-growing video on demand content library free registration to all PSI webinars and much, much more.

[00:01:13] Alexander: Head over to the [00:01:15] PSI website at PSI Web. To learn more about PSI activities and become a PSI member today.

[00:01:28] Alexander: Welcome to another [00:01:30] episode with Kaspar. Hi Kaspar. How are you doing? Hi, Alexander. I’m doing fine, thanks. How are you? Good. I’m doing fine as well. So in the middle of the summer, it doesn’t look like summer outside. But that’s actually quite okay if you’re [00:01:45] working inside anyway and without air conditioning.

[00:01:48] Alexander: Today I want to talk about a topic that has been on my mind for very long time because it’s not really straightforward and I have [00:02:00] seen lots of discussions about these, lots of different approaches to it. And I’m not even sure whether there’s a right way to do it. So today is really just [00:02:15] about how can you communicate things in such a way that it’s still correct and also understood.

[00:02:24] Alexander: By the audience. And we’ll wanna talk about especially the case [00:02:30] of adaptive studies and the communication of their results. I think couple of different topics might also apply to things like multiple multiplicity and p values around that. But [00:02:45] for the. Case today, let’s just imagine we have an adaptive design.

[00:02:52] Alexander: So a group, let’s say, very straightforward group sequential design with multiple groups. And you [00:03:00] have after every stop, you can both stop for futility as well as for efficacy. And now the first point is, you get now the, you get to the reporting stage and [00:03:15] let’s say you wanna report all your results, the primary efficacy, the secondary efficacy data, as well as your safety data in a manuscript.

[00:03:26] Alexander: The first challenge is. Now [00:03:30] you can’t just report it like as if it would have been non-adaptive design. So what do you suggest to have in mind when we report that? What are some kind of first bigger guidances that [00:03:45] come to your mind? 

[00:03:46] Kaspar: Thanks Alexander for bringing this up.

[00:03:48] Kaspar: You are opening a very big box here. Which I have to say is maybe not fully satisfactory. Solved in general. [00:04:00] One, one aspect is if you stop after stage one. I think you can just use what you have, because then that inference is no different. So that’s the easy case. But assume you have a group sequential design.

[00:04:14] Kaspar: You have [00:04:15] an interim analysis for, say, efficacy and futility. After 50% of information, you don’t stop, and then you stop maybe at an efficacy interim after 80% of inflammation. What are the issues? [00:04:30] The issues in general is. Effect as, you have tuned this design to have a hypothesis test with type one error protection.

[00:04:42] Kaspar: That’s how you design those things. [00:04:45] So the hypothesis test piece is taken care of, and this is also what a lot of regulatory guidances are about. You have to be able to reject the null on their type one error [00:05:00] protection. But that’s only the first piece. Of course, in a drug label, a patient is not so much interested whether you have rejected the null on their type one air protection.

[00:05:11] Kaspar: What a patient, which is the, who is the ultimate stakeholder [00:05:15] of what we do is interested in is how large is treatment effect if I adhere, if I take the drug, if I adhere to the drug. All these caveats. Of course, however, estimation of that treatment effect might not be straightforward. If you have a [00:05:30] design with multiple interim analysis, that’s true.

[00:05:33] Kaspar: The first thinking I would go through is, are there scenarios where you can simplify things, where you can say, I know in theory the effect is biased, but maybe in this, [00:05:45] in that scenario, it doesn’t matter so much and there is actually quite some literature about that for group sequential designs where you have some rules of T thumb.

[00:05:54] Kaspar: If you have a reasonable. Error spending function. If your interim is after [00:06:00] 50% of information, and if you then stop for efficacy, we know your treatment effect is biased, but the biases may be not so big. 

[00:06:10] Kaspar: So that already gives you some idea. There is literature around [00:06:15] that.

[00:06:15] Kaspar: The next step then is. If you want to be very precise, and typically as statisticians, we want to be, the next question to ask is, what bias am I actually interested in? Because there, in these kind of designs, there are [00:06:30] different types of biases. The literature calls about conditional bias and unconditional bias.

[00:06:36] Kaspar: So what are they? The conditional bias gives you the expectation of your estimate. Conditional. You stop [00:06:45] at this very stage that you stopped. 

[00:06:47] Kaspar: And

[00:06:48] Kaspar: if you want to have an estimator that accounts for that conditional bias, it can actually shrink your effect estimate quite dramatically towards the no. [00:07:00] It always shrinks it, isn’t it? Here. I think it does, but the point is this really what you are interested in? Because ultimately, when we talk about bias or statistical inference, what we are interested in is at least for [00:07:15] frequent, is the inference is this repeated sampling paradigm.

[00:07:20] Kaspar: If you repeat the trial many times on average. I want my estimate, estimate to be on point. This is what unbiased ness meets. But if you [00:07:30] condition at the stage where you stopped and this property is lost. So that’s why people and I am among them, I advocate for looking at the unconditional bias. So that’s not conditioning on the stage where you stopped.

[00:07:44] Kaspar: But [00:07:45] this gives you proper inference in this kind of, if the trial is repeated many times, paradigm. Still the sampling distributions or the distribution of your estimator is not just a normal as you [00:08:00] have when you just have one stage trials, but it’s a mixture of truncated normals because you always truncate with the rejection.

[00:08:09] Kaspar: So inference is a bit more tricky. But typically that when we talk about that [00:08:15] unconditional bias, that bias is n. For reasonably designed trials and reasonable interims is not so substantial. And you can correct for that, but that’s maybe one thing we can discuss later. Yeah. Sorry, I interrupted it.

[00:08:28] Alexander: So that’s for the [00:08:30] first thing, for the point estimate. Yeah. For your treatment effect for the primary outcome. The same is also true for the confidence interval, isn’t it? 

[00:08:40] Kaspar: Yes. You can. Account for that. 

[00:08:43] Alexander: It’s a little, again, more [00:08:45] tricky because you don’t have this nice middle curve anymore. Yes. Yeah. That you have a curve with ups and downs, so it’s multimodal and your confidence interval is not so to say, is the smallest length? It can be [00:09:00] actually longer because you have these yeah these so to say holes than it, yeah.

[00:09:04] Alexander: That you covered them as well. 

[00:09:05] Kaspar: Yeah, it’s true. Inference in general is more tricky. There is another complexity because you. Have [00:09:15] a sampling space that looks more tricky. Imagine you have a two stage trial and you have an outcome. You can either stop at the first stage and then it’s, if you want to compute a p [00:09:30] value for a p value, we need a co a concept of ordering because the P value is what is.

[00:09:37] Kaspar: My probability to observe what I have seen or something more extreme. 

[00:09:41] Alexander: Yeah. 

[00:09:42] Kaspar: But now if you have a multi-stage [00:09:45] trial if you have a result and both stopped at the same stage and one effect is larger than the other, the ordering is clear. 

[00:09:53] Alexander: Yeah. But 

[00:09:54] Kaspar: how do I compare a trial that, or a result? Has stopped [00:10:00] at the first stage and has some effect estimate with one that has stopped at the second stage and has some effect.

[00:10:06] Kaspar: Estimate. Yeah. And there are different approaches how you can order that sample space. Typically, what people have zoomed in on is what we [00:10:15] call stage wise ordering, which means you first look at the same stage and then you just compare the treatment effect. And something that is stopped earlier is always.

[00:10:28] Kaspar: A larger [00:10:30] effect than something that is stopped later. So that’s what people call stage wise ordering. And this has some useful properties. For example, when you reject the null hypothesis, your P value is always smaller than the respective Alva at that stage. And also in loosely speaking, [00:10:45] your P value does not depend on the future.

[00:10:47] Kaspar: It does not depend on information levels of later stages. And beyond the stage of which you stopped. So these are all kinds of, these technical complexities, but if you say I buy into this stage wise [00:11:00] ordering, you can actually do inference and you can compute something like median, unbiased estimators of a treatment effect.

[00:11:09] Kaspar: And that is maybe quite an straightforward way to assess whether you have bias, if you [00:11:15] are naive, estimate it. Ignoring the. The conditional sampling that you have or you pass a few an interim analysis. This naive estimator is maybe not too different from a median unbiased [00:11:30] estimator, which you can compute with any reasonable software.

[00:11:34] Kaspar: Yeah. Then maybe there’s not much to worry about. If these two are very discrepant, maybe there’s some reason to look into things more carefully. 

[00:11:41] Alexander: So median unbiased is basically the same as [00:11:45] unbiased. Just applying the median instead of the mean. Yeah.

[00:11:49] Kaspar: So median unbiased means you look at a confidence interval, basically at the upper limit of a one-sided 50% [00:12:00] confidence interval. And the confi, once you have. Agreed on an ordering of the sample space. You can not only compute P-values, you can also compute confidence intervals. So you basically look at a one-sided 50% confidence interval, and [00:12:15] this upper limit is your median unbiased estimator, and you then also have a concept of confidence interval for that immediately.

[00:12:23] Alexander: Okay. Yeah, that’s another thing because for the median, you just need the ordering and you don’t need to [00:12:30] calculate any distances, which becomes really difficult in that space. Yeah. Yeah. Okay. So now we have

[00:12:38] Alexander: we have these means. We have medias, we have confidence intervals. [00:12:45] We have P values for the primary analysis. Now, how does that look like? Because. All the studies I have worked on and I haven’t worked so much on oncology, had [00:13:00] lots of secondary endpoints, lots of different questionnaires, quality of life functioning all kind of different things.

[00:13:08] Alexander: How do we actually then report on these? Because on these, there’s no decision [00:13:15] made. However, of course, very often they are correlated to the primary endpoint

[00:13:20] Kaspar: and yet another. Good, very good question and something that is often also coming up in discussion with regulators that we may design. [00:13:30] A very clever adaptive design, and we designed that around the primary endpoint. And then we account for the selective nature of the sampling, for example, through bias corrected estimators.[00:13:45] 

[00:13:45] Kaspar: And we run simulations to evaluate the bias that we potentially have, and then we correct for that bias, et cetera. But we only do that for the primary end. There are, as you rightly say, there are secondary endpoints. There is safety for safety. We typically look at what we [00:14:00] have. We don’t pull p values or do inference.

[00:14:03] Kaspar: That accounts for the adaptive nature of the design. And I think this may be one of also the challenges of adaptive designs. And that if you look into ICG 20 the draft, [00:14:15] maybe there, there are some hints that. Maybe we should start to look into this a bit more, but that’s not the end of the story because it’s not just that For secondary endpoints, efficacy endpoints, for example, we often [00:14:30] ignore the selective sampling nature of an adaptive design or a group sequential design for that matter.

[00:14:38] Kaspar: Often they also have quite complicated. Graphical based alpha recycling methods for these secondary [00:14:45] endpoints. They also stand in the way of just naive inference being valid. 

[00:14:53] Alexander: Yeah. 

[00:14:53] Kaspar: So what do we do there? So we have actually two aspects that complicate matters. It’s one [00:15:00] is sometimes you see very complicated graphical procedures.

[00:15:05] Kaspar: And again, the primary focus is on what regulators push on industry is basically by saying, this is somehow [00:15:15] common wisdom, and it might be true or not true in one or the other instance. But in general, I think it’s fair to say what you have type one air protected and you get significant with rejection of the null of that specific endpoint.

[00:15:28] Kaspar: You have a good chance that ends [00:15:30] up in the label however. Somehow the statistical theory and the statistical pre specification when designing trials stops there. Very often. We don’t say what is then the inference and should the, and [00:15:45] then there are multiple questions should the inference take into account first.

[00:15:50] Kaspar: The fact you have an adaptive design. And second, the fact that your endpoints are part of a complicated type one error strategy. [00:16:00] Then the next question I think we sometimes underappreciate is bias the criteria we should optimize for or should it not be estimation efficiency? Also, ICHE 20. The draft actually says something about that.

[00:16:13] Kaspar: These are all questions [00:16:15] that. There where we have as a drug development community, some room for improvement. I think to be clear what we want and as statisticians who are key to design trials, that we [00:16:30] also think not just up to rejection of certain hypothesis pertaining to certain endpoints, but even further, because you have to write something in the label of that.

[00:16:42] Kaspar: You can argue. Some do that should be [00:16:45] accurate if you say this is a 95% confidence interval that should have that property. 

[00:16:51] Alexander: Yeah. There’s the one thing is the discussion with the regulators. And the thing that is, say you are usually talking to experts [00:17:00] on the other side and. It’s the benefit is you’re pretty much talking more more or less directly to them.

[00:17:08] Alexander: And then you can explain all the technical details and so forth. And also the label in itself [00:17:15] is a very condensed way of how of how results others played. Especially the part that is really sees the indication statement and all these kind of different things it becomes more challenging when you [00:17:30] present it, for example, to a scientific audience at a conference or a publication or even more when it goes into kind of more layman’s term.

[00:17:40] Alexander: Let’s say the average physician or patient. You’ll go to [00:17:45] these levels, but let’s stay with the scientific community. So for the secondary endpoints people, I don’t know why, but people always wanna see P values. Yeah. [00:18:00] And you very only get it published if you have cs also for your secondary endpoints. Now, would you. We had the case of you have an adaptive group, second group sequential design and stop second or third [00:18:15] readout.

[00:18:15] Alexander: Would you report then the nominal p values for the secondary endpoints as if it had, would have been a standard design without adaptive features? Would you create something that is more [00:18:30] adjusted. And if so, what? How would be adjusted for it? Or would you say no Pvi, SO at all? 

[00:18:36] Kaspar: Maybe there, these are multiple questions into one.

[00:18:39] Kaspar: Acknowledging that many people don’t appropriately interpret P-values, [00:18:45] then maybe we should cite them less often. But that it, that’s just a bandaid because it doesn’t really solve the underlying problem. Even if we say we should report more confidence intervals. It, for many designs, it is not [00:19:00] obvious how to compute.

[00:19:01] Kaspar: Then if it were simple, maybe we would do it more often, but. As I said, do you then want unbiased ness? Do you want estimation efficiency? Do you want to account for the adaptive nature of the [00:19:15] design? Do you want to account for the complicated multiple testing strategy that you had? Do confidence interval need to reflect?

[00:19:23] Kaspar: For a very simple hypothesis test, you have this one-to-one correspondence between if the confidence [00:19:30] interval. Is not including your null, this corresponds to a rejection of the null. Is this a feature you want to have from confidence intervals that this correspondence remains if you do, if you compute a confidence interval after a multiple [00:19:45] of a complicated multiple testing strategy?

[00:19:47] Kaspar: I don’t have answers to these questions. It’s, these are trade offs for some designs for some. Methods these adjusted, so to say, [00:20:00] inference exists for others, maybe less. I think regulators also have an eye on in some sense, fairness. What if somebody adjusts for all these things and another company, same drugs, can, same comparator?

[00:20:14] Kaspar: Does [00:20:15] not, these are very difficult questions. I don’t have template answers to. 

[00:20:20] Alexander: Yeah. One last thing. All of these only work preplanned, isn’t that you can’t do any of this post hoc all these kind of [00:20:30] adjustments. You need to have that all pre-specified. Isn’t that?

[00:20:33] Kaspar: In general, I think in our environment, I would say yes. I think a p value after the fact, after you have looked at a lot of data and then you just compute it for one thing that you [00:20:45] have seen is maybe interesting. I don’t think has the properties that we would a p value wish to have. Maybe there are methods post selection inference and these kind of things, but this is not something we routinely do.

[00:20:58] Kaspar: I think pre [00:21:00] specification serves more than one purpose. I think one is to establish the statistical properties of inference that we want to have. Pre specification also avoids cherry picking. At [00:21:15] least maybe they’re somewhat related in certain instances, but I think. This is why regulators want to see pre specification, and they rightly do

[00:21:22] Alexander: the target also builds trust generally, you have a, you do what you say and you say what you do. [00:21:30] And this is what for me, pre specification also means it helps you that you can trust in your data. And if someone else would’ve, runs the analysis would’ve come pretty much to the same conclusion.

[00:21:44] Alexander: [00:21:45] Yeah. That is, and I think that is, that’s really vital. 

[00:21:49] Kaspar: That’s yet another thing. When you say somebody, if somebody else runs the analysis will come to the same conclusion. There is a lot of research and meta research on precisely that aspect. If you give the [00:22:00] same data and the same statistical analysis plan to 20 teams of statisticians, you will get 20 different results because there are still so many ambiguities on the way.

[00:22:08] Kaspar: But I agree in general to the principle that pre specification builds trust and. You [00:22:15] can argue it’s quite amazing that we decide on introduction of a drug, yes or no, based on a primary endpoint. So we spend 150 million for a trial. The P value is 0.07. Very often there’s no drug people working [00:22:30] in an academic environment for them.

[00:22:31] Kaspar: That’s very difficult to cross. But I think it’s actually, as you say, this is building the credibility. This is just being. A very often a reasonable hurdle, and it’s not if the P value is smaller than 5%, [00:22:45] you immediately get approval. I keep saying this is an entry ticket to reg negotiations with a regulator, and then it’s still about comprehensiveness of the evidence.

[00:22:55] Kaspar: How was the trial conducted? What were the assumptions? All these aspects [00:23:00] still play a role. There is no automatism. But you need to jump a first hurdle, which is a pre-specified, properly designed, experiment for a pre-specified prospective, prospectively [00:23:15] specified hypothesis. Yeah. 

[00:23:16] Alexander: Thanks so much Kaspar for this really good discussion and, you already mentioned ICHE 20. Please have a look into this guideline. You mentioned a couple of points about [00:23:30] this so there’s still time to give, provide comments to it, so please do that as well. Thanks so much Kaspar. Another great discussion and feel honored always to have these discussions [00:23:45] with you because I always learn something new and it’s it’s quite enjoyable.

[00:23:50] Alexander: Thanks so much. 

[00:23:51] Kaspar: Thank you, Alexander.

[00:23:57] Alexander: This show was created in [00:24:00] association with PSI. Thanks to Rain and her team at VVS. With assurance’s background and thank you for listening. Reach your potential leap rate science and serve patients. Just be an effective [00:24:15] statistician.

Join The Effective Statistician LinkedIn group

I want to help the community of statisticians, data scientists, programmers and other quantitative scientists to be more influential, innovative, and effective. I believe that as a community we can help our research, our regulatory and payer systems, and ultimately physicians and patients take better decisions based on better evidence.

I work to achieve a future in which everyone can access the right evidence in the right format at the right time to make sound decisions.

When my kids are sick, I want to have good evidence to discuss with the physician about the different therapy choices.

When my mother is sick, I want her to understand the evidence and being able to understand it.

When I get sick, I want to find evidence that I can trust and that helps me to have meaningful discussions with my healthcare professionals.

I want to live in a world, where the media reports correctly about medical evidence and in which society distinguishes between fake evidence and real evidence.

Let’s work together to achieve this.