In our standard graphics, we usually present summary statistics only. The exception is the Kaplan-Meier plot showing also individual patients. Showing individual patients effectively tells us so much more about e.g. clusters, outliers, and extreme values. Join Benjamin and I while we dive deep into the following points:
Why its important to:
- Show uncertainty
- Show actual number of patients
- Give a feeling for the evidence
- Connect to the physicians seeing the individual patients
- Increase transparency
How to create:
- Spaghetti plot
- Slope graph
- Using jittering for categorical data or thickness based on numbers of patients
- Animations over time
- Heat maps
- Bar charts with symbols for the patients
- Cumulative distribution function
- Waterfall plots
- Flowing data example of categorical data over time (find it here: https://flowingdata.com/2015/12/15/a-day-in-the-life-of-americans/)
- Highlight the group means
- Making it interactive with hover over or filters or selections
- Sorting is crucial
- Combination with charts displaying group differences
- Unbalanced treatment groups
- Avoid over cluttered graphs
- Time to read and explain complex graphs
- Time needed to create complex graphs
- Problems obscuring small but meaningful differences
Listen to this interesting episode and share this with your friends and colleagues!
Alexander: You’re listening to the Effective Statistician podcast. The weekly podcasts with Alexander Schacht, Benjamin Piske, and Sam Gardner are designed to help you reach your potential, lead great science and serve patients without becoming overwhelmed by work. Today, we are talking about figures and how to show individual patients in this. So stay tuned. And now, some music.
Figures, data visualizations, graphs, we just need more of them. I’m very very certain. And I’ve talked about figures quite often in this podcast as a couple of different episodes about data visualization. So If you want to learn more about it just scroll back. Also, there’s a lot of stuff about data visualization on the homepage.
So head over to theeffectivestatistician.com and check all the different resources we have. These have quite a lot. I’m producing this podcast in association with PSI, the community dedicated to leading and promoting the use of statistics within the healthcare industry for the benefit of patients. And there’s also a really great data visualization SIG. Join PSI today to further develop your statistical capabilities with access to the video on demand content library, free registration to all PSI webinars, much more. Head over to PSIweb.org and become a PSI member today.
Welcome to another episode of The Effective Statistician. And like last week it is with Benjamin.
Alexander: Hi Benjamin. How are you doing?
Benjamin: Hi Alexander. Well, let’s see if it was last week because the agenda is usually changing.
Alexander: If it is recorded it’s predicted to follow up another discussion you have but you’re right sometimes these things are changing. Just to show what kind of behind the scenes thing, we usually record episodes at least a couple of weeks if not months in advance to take the stress out of it. You know, we both didn’t want to get into a mode where ‘Oh, it’s Monday, we need to record the episodes so that we cannot go out tonight’. That would be really.
Benjamin: Yeah, but you’re usually very good at working in advance. So there’s quite a number currently. I don’t even know exactly but I think this episode is, meaning, we today have still the last day of August for recording but it’s planned for November.
Alexander: Yeah, it’s even planned for mid-December.
Benjamin: So Christmas at least.
Benjamin: So individual patients, what a story about it?
Alexander: In my years where I was working a lot on Psychiatry studies. Yeah, one of these things that we were always looking into was symptom reduction of a time. Yeah. Let’s say you have a patient with schizophrenia and you measure his symptoms with some kind of questionnaire. That captures all these different questions about the symptoms of patients with schizophrenia. And then you have some kind of baselines call and you have to take a score at each visit. Let’s say, 1,2,4, 6 and 8 weeks after the start of treatment. And usually what people would show was averages, maybe with confidence individual differences, group differences, things like this, or responders. So how many patients had at least 25% reduction or 50% reduction or whatever? What was the relevant endpoint? And I was always wondering, ‘Okay the percent reduction really depends on the baseline value. So if you have a score, let’s say it ranges from 20 to 50 and your baseline characteristics because you don’t have anyone below that because you want to have patients in your study that you can actually treat and improve. Then if you have a ten point difference is 50% reduction on the lower end and it’s only a 20% percent reduction in the upper end. How can you kind of have an impact that means change, you know, is the same? So I was thinking if I need to get a little bit more grip on this. I was really really dissatisfied with how we were commonly kind of displaying data, because I couldn’t really feel the data. Do you know what I mean? Yeah, and so, I was discussing this for quite a long time with different colleagues. We looked into different ways, how we can add. And one of the things that we very often were looking into were scatter plots, where we had the baseline data as the horizontal axis and the follow-up data at the vertical axis. And then you can plot log data at week one week two at week four, six, and eight versus the baseline data. And that was kind of nice. We got a little bit better feeling of it. As you know, later in my career I actually used this similar approach with psoriasis data. And that we even more kind of made it more sophisticated and interpolated the individual patients. So instead of having scatter plots for each follow-up visit, you would have one scatter plot of that visit and results versus the baseline: week 1 versus baseline, week 2 versus baseline and week 4 versus baseline so on and on. You would have that like a continued series of scatter plots and we made them after each other so we got a little kind of comic movie so to say. And when we had that, that was really the moment where you could see how the individual patients were performing. Because if you had seen different scatter plots, you couldn’t really see, ‘Okay what’s that point moving from here to here or that different patients? But because of this animation, the interpolation, we were able to see how the individual patients were going down and up and so on. And that was really the first time I’ve thought, ‘Now I understand what’s happening’. And it was one of the big successes of my career. This example probably helped me with two or three promotions actually. And we applied it and people outside of the company copied it. So other companies use similar things and it was really interesting. And the background is really to show the individual patients. Showing the individual patients is also something that is coming up again and again in the discussions of the Wonderful Wednesday, so the monthly webinar that PSI Visualization Special Interest Group is running. Can highly recommend having a look there. And I think, Zachary Skrivanek is usually the person that comes up with, ‘Oh, if we were to show individual patients, what would that look like?’ Yeah. It’s nice because you get a sense of the uncertainty. Over time the variation between patients, you get a much better sense of it. You get a feeling of, how much evidence is there? Yeah, because you directly see. ‘Oh, here we have ten patients, here we have 1,000 patients’. You get physicians to connect much more immersive data because physicians see individual patients. They don’t see the average patient. And so they see all kinds of different people. With low symptoms, low severity of symptoms, has a variety of symptoms. They don’t usually see the exact average patient. And that really helps for them to connect more emotionally also with the data. You can see more transparency, what’s going on? Sometimes people see statistics a little bit like a black box, magic thing. Probably heard a variation of and then you’d use your stats, magic sentence quite a lot in my career.
Benjamin: Because people don’t really understand it.
Alexander: And when they see the individual patients, they can much better understand, ‘here is this is why the mean is changing’. Or ‘this is why there is not a statistical difference because the variation is so big’. You can also see more details like having a certain cluster of patience. So for example, going back to the example with the animated scatter plot, we could see there were certain patients that were moving down and then up again. So most patients weren’t down but that certain patients had only this initial drop and then they decreased with efficacy. That was interesting to have a look into. You can see much better, are there any extreme values or even outliers? Yeah, what’s happening in this, let’s say the upper quartile of the patients. The most severe patient. Do they behave the same as all the other patients? Or the very low patients. Maybe you see that well, most of the treatment effect you see in the more severe or less severe patients. You see much more of these details when you look into individual patients in your figures.
Benjamin: Yeah, Absolutely, I understand the point of saying it’s the question is about the individual patient and not about the average. So that does definitely help. Just trying to imagine that if you face three or 300 patients, how would you possibly do this? So I think we still have to consider the whole setting for whatever you would like to present because even if you have 300 points jumping up and down in a minute scatter plot, you might pick one or two, but they may give you just an idea that it’s still not about the individuals. We have to dig into the data but I fully understand, especially when you present the results to the medics. If it’s for a statistician, they can probably at first look better into the descriptive statistics with mean, variances, etc, understand where this is coming from, especially when you see it all like in the longitudinal settings over time. The medics I mean, they can’t understand it. It’s not that they don’t understand, like descriptive statistics. It’s just that they are not reading it like this. So as you said, it’s about the patient. So if they find somebody interesting, they would like to see. Where’s the patient going? And so I think animation is a very good idea. We still like from, you know, being in the industry for 20 years, we still kind of have the origin mind is still on 2D, print out paper, graphics and it tells me it’s this where you can show them the printer quality, isn’t that good. That’s why if you have three in the patient’s it’s just one big black, you know, the blob of ink on this one. So it doesn’t give you anything. But it’s difficult especially when I now think about the zero setting because we don’t, you know, when we plan ICP and plan analysis, that is, you know, we are talking about animation. So we are talking about whatever you can put in figures and to describe this statistically and then have some idea or some, you know, plan behind it. So in the end when it goes to the FDA it’s not a question of the individual patients or 300 individuals. It’s about the mean, it’s about the confidence, the P-value whatsoever. So it’s descriptive but for really digging in the data and understanding what’s behind, and then planning the future, you know, whatever, wherever we go with this, whatever we do with, how we present it. It’s especially where you are working in the later phases, this is an enormous tool to use animation, so to use more than a printout paper.
Alexander: Let’s talk a little bit about a couple of different graph types where you can use that. And that’s also to see how some of the objectives that Benjamin raised, quite rightly, can potentially be addressed there. So one of the most common ways of showing data is, you know, especially, if you have continuous data over time or something like that, this is a spaghetti plot. That works well if you don’t have too many patients. If you have a lot of patients, it can get very cluttered. There is a trick you can use. If you are not interested so much in showing the individual patients. You can, for example, have two spaghetti plots. One for each treatment if you have two treatments. And put the individual patients, kind of gray lines in the background and the average treatment effect as a dark line in the foreground. Yeah, that way you can see how the overall changes as well as what is the variation behind it. Are there any outliers and any other kind of weird things going on? The downside of it is of course that you see how most solutions within the groups have, it’s not so well suited for comparing between groups there, of course, having those averages in the same graph is really nice. But it serves, you know, a different purpose.
Benjamin: Yeah, and you can also like this. You can also work with colors more than just have your gray and, and like one. So if you are with the inner group, I don’t know, gender, or whatsoever. Make it a little bit colorful. So not heavy colorful, not one color for each patient, but it’s really about underlining some of the general questions so you get an idea.
Alexander: For sure don’t make a rainbow plot where you have 30 different colors. By the way, if you can make it interactive, it’s really nice if you have these kinds of hover over functions or things like this. So all if you can select patients there, where highlight all the female patients, highlight all the pre-treated patients, or things like that. So you can see kind of as a monolith, you know, equally distributed across all patients or is there some kind of, you know, directional effect on all the pre-treated patients, less severe patients, with more severe patients or whatsoever. So, that’s nice to have. A more simplified version of the spaghetti plot is the slope graph. So basically, it’s just two vertical lines. Each patient just connects these lines in terms of before and after, things like that. And you can very easily then see, is that general increase in things or is that general decrease in things.
Benjamin: And you can also see Alec outliers for you know, if the extremes you can identify them.
Alexander: Yeah, one of the problems here is If you have let’s say categorical data, that’s CGI. Yeah, with values 1 to 7. It gets a little bit more difficult. They have one of the things to have in mind, for example, data. Yeah. So that you can better identify the individual patients or sometimes, you know, doing things like maybe a Sankey Chart is helpful. Of course the Sankey Chart if you have lots of kinds of episodes and can see the individual patients anymore. Yeah. How it is flowing but at least for each of these differences, you can get a better understanding of that. It goes in the same direction as, you know, if you have some kind of slope graph and then you have the sickness of the line based on the number of patients in there.
Benjamin: Yeah, that’s another point, really. I mean, we can talk about, you know, we said colors already as one identifier, which is just easy to see and to identify and the other one is thickness, you know, bold not whatever. It is like a big circle, like a small point whatsoever. So kind of thickness of line is another very quick identifier of individual or grouped data within these plots.
Alexander: Yeah, another way to show how things develop over time for individual patients is a heat map. Yeah, so you have metrics of where, you know, the individual rows, let’s say the patient and the different columns, let’s say visits for example. And then the color of the cells is determined by the severity of the symptoms. Sometimes that works quite nice. You see this, for example, I’ve seen that in some Covid related cross. Where do you see time in terms of the columns and age groups as the rows. So it starts with the top row which also includes kids then the teenagers and so on. And then you can see in terms of the intensity seen by the incident rate, the incidence for the specific time. And that way you can see where those incidents first goes up, for example, in the elderly population and then goes to the younger population or maybe, you know, when there’s a lot of explanation of the older patients that are happening. Yeah, that most of the incidents are actually in the younger population or the kids. So this, you know another way to show individual patients if you have each line or each row is basically one patient.
Benjamin: Just one comment on the table because I think it’s an excellent idea to have seen nice graphs, interesting quite, you know, general newspapers. So whether you use these kinds of heat maps, the only advice I would like to give is to use the right colors.
Benjamin: It’s even though you can say whatever, you know, dark red isn’t, it means something good, actually, it doesn’t. If you read like a figure with dark red this is an alarm. And that is something we should consider. So if you, if you work with green and with red, this is kind of giving you a like a good attitude to the whole figure even though you don’t want it or you may not want it. So that’s why, use yellow, blue, whatever. Just think about the colors. So that’s important advice really to not use a dark red.
Alexander: And think about maybe your audience is colorblind.
Benjamin: Yeah, that’s true. So radically.
Alexander: Especially we as men are highly affected so take that into account.
Then one graph that we very very often use shows in general patients the Kaplan-Meier plot. Yeah, if you think of it, it’s a little bit like a, you know, each row is the invertible patient. And then it’s looked up over time and here the sorting is coming through. Of course, it is sorted by the time to respond to the time to event or the time to sensory. And so here you can see how really important the ordering of the patients is. You get this nice survival curve, when you have the Kaplan-Meier plot, and I think it’s really a nice way to visualize this type of data.
Benjamin: I agree. And this is quite a good point you bring up because you don’t see this as individual patient data, you know, like as usual Kaplan-Meier. I mean it is obvious but you don’t name the patient so you don’t identify them unless you specifically do.
Alexander: The same as, you know, you can also show cumulative distribution function. Yeah. It is the same way and you can show if you have a couple of different treatments. You can show kind of how these different cumulative distribution functions behave over time. Yeah, that’s another way. Kind of related to this is the waterfall plot where you basically have patients change over time. Yeah, and you sought the patient how much they change. Yeah, some may increase, some may decrease. And you start with the patient that has the biggest increase and then seek the patient with the biggest decrease. And that way you can directly see, how many patients had decreased or at least less, or how many patients had a decrease overall or other patients that are, you know, the most of the patients are decreasing but there are a certain fraction of patients that is increasing quite a lot.
Benjamin: You can really put in a nice animation actually for this one as well. So if it’s overtime.
Benjamin: And you like, highlight and colors, let’s say like different either, pretreatments or any subgroups and then you just go with the animation over time and you see how the weights of the waterfall plot are moving over time on one side and then how different colors are moving from the left side to the side and so on. So waterfall plots and animations are powerful tools for individuals course of the patient’s decrease and increasing of the unit.
Alexander: Do you know these race bar charts?
Alexander: That to have come into vogue, quite a lot on the internet. So where you can, for example, let’s say I have seen the most favorite programming languages of the time. And then you see, it’s ranked 1,2,3,4,5 also. And you can see how these future terms of the ranks. The most successful musicians in terms of sales. Yeah, and then you see kind of how that changed over time in terms of The Beatles and then The Rolling Stones and then Elton John and then Rihana, and whoever goes in there. It’s kind of related to that. It’s an alternative way to show individual data overtime. I want to go into a couple of additional features that you can do. So we already talked about highlighting things. Likes group means, like, you know, maybe specific patients or groups of patients, making it interactive. This hover-over effect is really nice, where maybe you can see them. This individual patient, did he have any kind of AE or especially for AE data? Yeah, if you show individual patients and their AE data or lap data, you know. Here’s the lap going up. Did the patient have any communication? Did they have any comorbidity? Did they have an adverse event? Which one was he on? All this kind of additional information you can then put into hover over information. That, of course, makes it really nice to walk through the data. Of course, it’s time consuming to build it up. But depending on how often you can use it. For example, AE data maybe you can use for any phase one study that you do. Then maybe it makes sense, to invest time once, then you use it again and again and again.
Benjamin: I mean, we can deal with that later. So we should talk about how to create it. So what software are you using? But let’s first think over additional features.
Alexander: Yeah. The other thing that we already mentioned is that sorting is very crucial. And if you can sometimes it’s nice if you can also have interactive sorting. Do you want to sort by treatment group? Do you want to sort by baseline value? Do you want to sort by something that is meaningful? Sort of something really meaningful, I’ve seen thoughts where they just sorted by the patient’s number which mostly nobody really cares about. So make sure the other thing is to try to combine it with your mean statistics or your summary statistics that you’re interested in. Yeah. So like we said, the spaghetti plot includes the average. Or do you want to include the percentage? So when we were talking about the cumulative distribution function. Do you want to highlight within that kind of certain thresholds? If you have a waterfall plot, do you want to highlight certain areas in it? You can always kind of think about this and align these different graphics where you show the data in a meaningful way. For example, if you have a waterfall plot, and it is horizontally aligned, then put it until, if you have several treatment groups, put them below each other, not next to each other because then you can see much easier, how a certain percentage changes? If you then directly see, ‘Okay, here 80% of the patients improved, here 70% of the patients improved, here 60% of the patients improved’. If you have that below each other, you can much easier compare these averages. Whereas it’s much more difficult to compare it if you have it next to each other.
Benjamin: Yeah. Actually you can see the weight like how this is kind of shifting over from one side to the other or at least moving in one direction if it’s below each other definitely.
Alexander: Yeah, another important feature to have in mind is unbalanced treatment groups. That was one of the bigger challenges actually with this animated scatter plot because then you have, let’s say if you have a 2 to 1 ratio in terms of randomization, you have doubled the number of dots in one plot compared to the other and that can look weird. I also had heat maps where one heat map was triple the size of another heat map. I was thinking, like can kind of correct for that somehow so that you can more easily visually compare things. You have mentioned things can get very cluttered, take care of that in terms of gray, hover, whatsoever. How big are these, in a scatter plot, how big are other thoughts that you are showing? A lot of fine-tuning is needed, that’s why having it 100% pre-specified is very often very difficult.
Benjamin: Yeah. Otherwise, it would have been there anyway already.
Alexander: Yeah, there’s another drawback to it, usually seeing graphs more complex. Consider where you want to show it. It’s probably not the right place if you have just 20 seconds to explain it. Yeah, like if you’re giving a presentation at a conference and your whole presentation does not longer than seven minutes. Or if you give it to someone that is like a sales representative. It has maybe five minutes with a physician. Take it to those occasions where you actually have time.
Benjamin: It’s well, actually you see that this was a successful exercise with creating the plots if it doesn’t take more time to understand it.
Benjamin: But I mean if you dig into individual data, it obviously takes time, but getting the grip on the data, so if you use the right colors, if you use the right settings, if it’s not too cluttered, if it’s easy to read . But if it’s difficult to understand, give it to your neighbor, to your colleague and just get a sense of how quickly the person understands it. But the good thing about the figures and even the individual patients in the visualization is that you are using means of colors, thickness and whatsoever so the difference makes people of the appearance much easier to assess. Right away if it’s good.
Alexander: Yeah. And there’s another kind of thing, of course, it takes more time. So doing a line graph of averages. Straightforward. But here it takes more time, but I think about it very often, it’s worth the effort. Yeah, because of what people see in the end. Of all the work, the protocol, the SAP, the programming, the execution of the study, all these kinds of different things. Ultimately, come down to some figures. Make them really stand out. These studies cost hundreds of millions of Euros sometimes. Yeah, and then we have this rubbish Excel figure in the end that summarizes the key points. And say, couldn’t you invest a little bit more time and do this. Make sure that people understand it. And as we said in the beginning, patient values are really important in the end, oh wait, we do it about individuals. Yeah, so connected to them. Okay, that was a pretty more technical episode where we talked about all kinds of different advantages, disadvantages, challenges, with showing individual patients and we went through a lot of different examples of charts that you can use. If you want to learn more about how to actually do this. I strongly encourage you to go to the Data Visualization Special Interest Group. There are a lot of examples of that on their homepage. And it comes with data and with quotes. So check it out there. Most of the quote actually is done in our so use it, adopt it, and try it out. Have fun with it.
Benjamin: Yeah, have fun.
Alexander: Thanks so much. Stay tuned.
Alexander: This show was created in association with PSI. Thanks to Reine, who helps the show in the background. Thank you for listening. Of course, this was an episode about data visualization and it’s a podcast. I know you can’t see these things so head over to theeffectivestatistician.com to check out the show notes, the links for these podcast episodes and then you will be able to see lots of different things that we talked about. Reach your potential, lead great science and serve patients. Just be an effective statistician.
Never miss an episode of The Effective Statistician
Join hundreds of your peers and subscribe to get our latest updates by email!