How to analyse subgroups effectively using data visualisation

Interview with Paolo Eusebi

Subgroup analyses in combination with data visualisation is one of the hottest topics I can think about. And it hits us as statisticians during our careers again and again.

We need to understand subgroups for efficacy reasons and safety reasons and it’s a common question in terms of how consistent your drug works across the different subgroups. It gets even more complicated if you want to review it across multiple studies.

In this episode, Paolo and I discuss the importance of visualisation in understanding subgroups. Specifically we speak about:

  • Why visualisation is important
  • How to create effective visualisation
  • Graphical display of subgroups
  • Subgroup explorer
  • Exploring robustness of subgroups between studies 
  • Three step approach
  • Meta-analysis of interaction
  • Graphic Display of Heterogeneity GOSH plot
  • Shiny App

Listen to this episode and share this with your friends and colleagues!

Paolo Eusebi
Contract Statistician

Statistician with broad experience in all aspects of biostatistics, epidemiology and health services evaluation.
Interested in consulting offers.

Specialties: Data management, data analysis, research projects.
Knowledge of main statistical software packages such as SAS, STATA and R. 

Subscribe to our Newsletter!

Do you want to boost your career as a statistician in the health sector? Our podcast helps you to achieve this by teaching you relevant knowledge about all the different aspects of becoming a more effective statistician.


Alexander: You’re listening to the effective statistician podcasts, the weekly podcast with Alexander Schact, Benjamin Piske and Sam Gardner, designed to help you reach your potential, lead great sciences and serve patients without becoming overwhelmed by work. Today, I’m talking with Paulo about subgroup analysis, a neverending story and always something interesting to learn about. So stay tuned. I’m producing this podcast in association with PSI, a community dedicated to leading and promoting the use of statistics within the healthcare industry for the benefit of patients. Join PSI today to further develop your statistical capabilities with access to the video on demand, content library, free registration to all PSI webinars, and we are actually talking about one of these today and much, much more. Head over to the PSI website at to learn more about PSI activities and become a PSI member today. 

Welcome to a new episode of the effective statistician. And today I’m talking with Paolo about one of my favorite topics, subgroups in combination with data visualization. Hi Paolo, how are you doing? 

Paolo: Thanks for having me. 

Alexander: Very good. So subgroup analysis is a really important topic. It comes up again, and again and again. You know, you want to understand subgroups for safety reasons. You want to understand subgroups for configurative reasons. It’s a common question in terms of how consistent your drug works across different subgroups or whether it’s a specific subgroup that stands out in terms of efficacy or safety. And gets even more complicated if you want to review it across multiple studies, for example. So within subgroups, yeah, I think data visualization plays a very, very important role. What’s your experience regarding this Paolo? 

Paolo: Yeah, I think that presenting some exploratory subgroup analysis with graphical displays is really, really important. I think it’s so with the case that you can deliver a better message, if you use it properly with some graphical display. For example, one year ago, I was working with one of my colleagues, and doing some exploratory work for subgroup identification because we have one study with an unclear message agenda. And we wanted to have a fruitful discussion with the clinical team. Have nice tools and even interactive ways of discussing what we have in the data. And we presented an interactive application based on the sunscreen fitch developed by Emil Tyre. And it was really successful. Patients liked it because you can basically split the sample sites in the subgroups you want. In a combination, you put only the subgroup in one display funnel plot, and you can interact with it. You click on which dot and you have a specific subgroup. Maybe it’s the subject’s male and with some pages and with this is the duration baseline. And you see how much the effect side deviates from the overall effect sides. 

Alexander: Yeah. Yeah. I really like this based tool as well because it really supports the usual workflow that you have, by looking into subgroups more and more closely and also more and more complex subgroups, yeah. You start maybe with subgroups that are justified by one variable,yeah. And you started into subgroups looking into, you know, that combination of two or three and more variables. And with this funnel plot, you can really see the relationship between the size of the subgroup and the effect within the subgroup. And the funnel plot, you can think of as here the horizontal line is the size of the group and the vertical line is the effect within the group. Upwards more effect, downwards less effect. And so of course, with smaller subgroups you by random chance get more and less big effects. And so the funnel is really that, you know, smaller the subgroups the bigger the distribution of it of course. With a funnel you can also see whether there’s certain subgroups that don’t follow, just this kind of random pattern but which really stand out and these are the ones you really want to look for because these are the ones where there’s potentially really something happening. 

Paolo: Yeah, I really like the idea because for example, the developers didn’t want to put confidence intervals or P values in the tool because it was meant just for exploratory purposes. I think that it’s also an educational tool for clinicians. They can understand that if you cut the stem polar you can both have more chance to have larger deviations from the overall factor. And you can also find, you know, these subgroups that really deviate from his expected pattern. Then of course clinicians are used to funnel plots. If you think about reporting this meta analysis. You know, it’s the reason expected to pattern that and some deviation from these expected patterns in the plot. 

Alexander: Yes. This kind of combination of different subgroups is a really, really interesting topic, yeah. So, of course, you know, if you have just two variables that you look into, yeah. You can think of a venn diagram,yeah, with the two areas and they overlap and then you can get a nice description of it. But as soon as you have more than two, it gets really really complex, yeah. And for that you have in your presentation of your subgroups webinars that recently happened and that was done by PSI on Wednesday of the 17th of November. You’ve shown a different plot. And to be honest, you know, it is a point that someone mentioned in the Wonderful Wednesday Webinar. I didn’t know about this plot. So can you tell us a little bit more about this upset plot?

Paolo: Yeah, I think it’s a really nice device in which I discussed the basic upset plotting of which you have the intersection between groups and you have the sample size there, for example, control pharma and treatment, in terms of the sample size there in each hand. So you can see the amount of patients you have for each of these intersections. And of course, you can have an extended version of this plot, including the method size. So maybe you have a forest plot in the middle. And then you have the intersection on the left side of the forest plot there and on the right side so you have the amount of patients in each number . So you can have a comprehensive overview of what you have in terms of the size there. In terms of the number of people, like in terms of certainty of your effect side for this specific subgroup. So the original version of the upset plot was only with the intersection between having or not having these particular conditions for the subgroup definition. And then you have this extended question considering also, the case for each, for example. You don’t want to consider any split between older or younger than 65. Maybe you want to consider all the possible ages, and then you have a different combination of other characteristics. Well, of course, it’s difficult to scale, you know, this graphical display when four, five, six, ten. So that defines subgroups. 

Alexander: Yep. Yeah. But the ordering comes really nicely. Yeah, I think as you order these subgroups there in terms of size. You, you know, you don’t need to show everything, you just kind of cut at a certain point and any subgroups that are smaller than that you just ignore, which I think is quite reasonable. And that makes our upset plot quite manageable. I think it’s a plot that everybody should be aware about because it’s really nice and helpful and yet very much underused. You touched on another interesting point and that is the continuous covariates and looking there into subgroups that, you know, the just the digital mediation or the categorization of your covariate is kind of a little bit tricky because of we know that ,you know, just because you turn 65, your biology doesn’t dramatically change. So you presented a step approach to this. Can you tell a little bit about what that does? 

Paolo: It’s a nice way I would say to display how you can add different side effects, along the continuum of a biomarker of different characteristics. And of course, you need to have some kind of it moving in the sense that you need to adapt your analysis to the level of the variator. Trying to avoid this basic linear trend that doesn’t maybe tell you all the story. So basically, you cut the sample sites in different subgroups. You start, for example, you have 400 patients and you start by considering 50 patients in each subgroup and you start and then you consider, also the case that you can have an overlap between neighboring subgroups. You started from, for example, younger people on board, along with the middle age of 45 and then you have another subgroup with the middle age of 50, for example, and these two subgroups overlap, for example, maximum 20 patients, for example. And you have the effect sides for each of these subgroups. And of course, so you have a confidence interval, which is calculated as always. And then you have also a simultaneous confidence in the sense that you are taking into account that you are reusing the same data to some extent. 

Alexander:  Yeah. Yeah, so you get a really nice smooth over to complete the range of your covariate. And you then can very much easily see whether it’s, for example, some kind of hockey shaped curve or some U-shaped curve or, you know, any other relationship and always it’s really, you know, pretty stable,yeah. And some kind of linear regression really makes a lot of sense. I think that is a really nice graphical approach. The downside is, I think you need a decent number of patients to attest to these things because otherwise you can’t really, you know, smooth things out. So I think that it’s surely not a nice approach if you only have just 20 patients.

Paolo:  Yeah, then you can see that if you have a smaller sample size then you have the end and then you lose your program to fit a straight line to go. 

Alexander: Yeah. Yeah. 

Paolo: You cut along the continuum.

Alexander: Yep. Completely agree. The next problem that you talked about in your presentation is about exploring subgroups between studies. So what’s the biggest problem there when you compare across different studies? 

Paolo: An example, if we think about one study, we speak about an heterogeneous treatment effect. For example, in terms of and when you have different studies, maybe you have different results. What you can find that in one study could hold in the other one. And I think it is a situation in which clinicians, for example, struggle to understand why you have different treatment effects in one study. At some point, you may want to combine the evidence that you have in your pool of studies. And trying to figure out how much heterogeneous treatment effect in your studies. And if you have five, six or seven studies, as it can be the case, if you have a development program that has progressed a lot , you may also have real-world evidence data, you have Phase 4  studies and Phase 2 studies in the same bucket. Then you can do three steps with the analysis approach, basically. You fit the model and find the subgroup and then combine the interactions and see the heterogeneity studies of this simple action then.

Alexander:  Okay. Okay. So you basically, first identify the subgroups then within the studies, you look into the interaction effect between treatment and the subgroup. You show that across the different studies using a forest plot. And in the third step what’s happening there? 

Paolo: Then that, that is one nice graphical device that has been used a few times in the meta-analysis little group, which is the gosh blotter. So basically, you can fit the methodology step for holding the combinations of the studies. For example, you can put together studies One, two, one, two, and three, all the studies from 1 to 7, for example. If you had 7 studies. And then you have a bunch of this combinatorial definition. And then you have this gosh blotter, basically, it’s a scatter plot in which you have the high square in the y-axis and the effect sides in the x-axis. So you see how much heterogeneity you have for each combination depending on the effect sides. And, of course, you can spot some but that  for example, and can have, for example, a gap gosh mixed with a model, identify some plaster, and then you can see that. One of the studies is responsible for the notion of such clusters and you can spot this study as the driver of the continuity across your studies, then you can maybe have a closer look to display data about what was going on there. 

Alexander: So, what you’re doing is you have this plot of the effect size versus the heterogeneity score in the high square and that gives you the scatter plot. And then on top of that, you build an unsupervised learning which is cluster analysis and then you get the number of clusters and you can understand kind of which studies are in these clusters? So to say, it mostly represented okay.

Paolo: This is that, for example, you have three clusters, evenly divided. And maybe the proportion of each study is less or more people in all these clusters and then you have one study, which is more isn’t that in one of the clusters for example, and it’s so almost absent there in the other cluster. So it’s more unevenly distributed in this cluster. So it’s like having covariates that vary a lot in the cluster analysis for example. When you enter a cluster, for example, with a high proportion of younger people and a low proportion of older people, and then to think about customers, and then you can identify this cluster of customers, like, the young people would like this kind of stuff. And it’s the same here, you in this particular cluster where you have maybe a lot of heterogeneity and higher or lower effect side. So you have these two studies more present. 

Alexander: Yeah. Yeah, and you also talked about yet another nice application and a shiny tool that was presented at a Wonderful Wednesday in December 2020. And where you can do a much better kind of interact with the different data. So that gives you this gosh plot compared to you know, combined with forest plot. And it’s really, really nice to explore the heterogeneity within subgroups across, you know, different studies. So, lots of lots of graphical stuff that we talked about here. And of course, we only talked about it. So I really urge you to have a look into our corresponding blog post, where you will find all the different links to these different graphics, our codes, and a link to the video on demand. Where you can, you know, look at the complete webinar. Also, with the other interesting discussions about subgroups and yeah, code stuff like this. So I think it’s a huge, nice thing to look into these things in a visual way. Because yeah, it’s so much faster. So I’m looking across many, many different tables, for example, yeah. So Paulo, what’s your key takeaway that you want the listener to step away with? 

Paolo: Yeah. I think that I can tell you that I think that we have the opportunity to evolve a little bit with the technology we have here. So, it’s really difficult to have a sense of what’s going on, just scrolling thousands of tables, and it’s much more elegant and pleasing to look at one interactive application. 

Alexander: Yeah. Yeah. I think that will be much more in the future and I think people get more used to these interactive data visualizations. Also as you know, there’s so many dashboard surrounds pandemic. Why not use that within the companies and maybe also for in the future at some point move with regulators and other customers that we have. Thanks so much Paolo. That was a great discussion. And again, you can check out our homepage where you’ll find all these different links and stay tuned for much more in the future to hear from Polo and myself. 

Paolo: Thanks. Bye. 

Alexander: This show was created in association with PSI. Thanks to Reine who help the show in the background. And thank you for listening. Reach your potential, lead great sciences and serve patients. Just be an Effective Statistician.

Never miss an episode of The Effective Statistician

Join hundreds of your peers and subscribe to get our latest updates by email!

Get the shownotes of our podcast episodes plus tips and tricks to increase your impact at work to boost your career!

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

Scroll to Top