Are you curious about how predictive models forecast future outcomes?

Or how do experts build and test these models to ensure they’re reliable?

In this episode, I sit down with Chantelle to explore these questions and dive deep into the world of predictive modeling. We break down the process, from analyzing the data to validating the final model, and discuss how these techniques are used, especially in healthcare.

If you’re eager to understand how predictive models work and why they matter, this episode is for you.

**Key Points:**

**Key Points:**

**Predictive Models**: Forecast future outcomes based on historical data.**Model Building**: Process of analyzing data, training, and testing models.**Validation**: Ensures model reliability and accuracy.

**Healthcare Applications**: Use in survival rates, treatment responses.**Technical Discussion**: Pruning models, interpretability, data considerations.**Tools**: Use of R, Python, SAS for building predictive models.**Collaboration**: Importance of interdisciplinary work, combining traditional statistics with AI/ML techniques.

Chantelle shares valuable insights into building, testing, and validating these models, showing just how powerful and essential they are.

If you found this discussion insightful, listen to the full episode to dive deeper into the technical details and practical applications of predictive modeling. And if you know others who would benefit from this knowledge, share the episode with them. Let’s spread these valuable insights together!

**Transform Your Career at The Effective Statistician Conference 2024!**

- Exceptional Speakers: Insights from leaders in statistics.
- Networking: Connect with peers and experts.
- Interactive Workshops: Hands-on learning experiences with Q&A.
- Free Access: Selected presentations and networking.
- All Access Pass: Comprehensive experience with recordings and workshops.

### Never miss an episode!

Join thousends of your peers and subscribe to get our latest updates by email!

Get the

### Learn on demand

Click on the button to see our Teachble Inc. cources.

### Featured courses

Click on the button to see our Teachble Inc. cources.

**Chantelle Cornett**

**Health Informatics PhD Candidate at the University of Manchester**

She is a PhD candidate in health informatics at the University of Manchester and a statistician at The Effective Statistician. With a background in data analytics, statistics, and statistical programming, she holds a BSc in Statistics from University College London and an MSc in Medical Statistics from the London School of Hygiene and Tropical Medicine. Chantelle has published work in “Research Synthesis Methods,” highlighting her contributions to health informatics.

Committed to improving healthcare outcomes through data, she has volunteered as a data analyst at The Neurological Alliance, supporting individuals with neurological conditions. Her career goals focus on developing innovative statistical methods to address complex health informatics challenges, particularly in women’s health. Chantelle’s work integrates academic research with practical applications, advancing the fields of health informatics and statistics.

**Transcript**

**Prediction Modelling**

**Alexander:** [00:00:00] Welcome to another episode of the effective statistician. Today it’s awesome to have Chantelle back on the podcast. So last time we talked about the conference and if you haven’t listened to this episode, scroll back. It’s really important. Today we will talk about a more technical topic, and that is predictive modeling. Now, that appears to me to be a pretty broad term, Chantelle.

**Chantelle:** Yes, you’re very right, Alexander.

**Alexander:** What is it actually?

**Chantelle:** So, predictive modeling is a very general term for any statistical technique that can be used to forecast future outcomes based on historical data. It involves creating a mathematical model that could predict future events or behaviors by analyzing the patterns in the data.[00:01:00]

**Alexander:** Yeah. Yeah. So what are typical use cases, for such predictive modeling?

**Chantelle:** Well, it spans a load of industries not just healthcare but myself in particular, I’m interested in survival modeling. And what this typically involves is. estimating or predicting one’s chance of survival. And that could mean time until death or time until an event.

But other uses of modeling could be to estimate someone’s response to a treatment, for example. There are so, so many.

**Alexander:** So in terms of these models do I need to think about it like, okay, If there’s this model built, I can basically put in data for a specific patient, like myself or kind of, you know, my someone that, you know, [00:02:00] is of interest to me, or, you know, as a physician can put then these in and then I get some kind of, you know, probability, things like that out of it?

**Chantelle:** So that’s a very good question. The answer is no, every predictive model is made with a specific population in mind. If I were to create a predictive model for people living in the UK, for example, that’s where I am. And I, and it was a model estimating how someone would. react to a treatment based on physiology and a bunch of other factors.

If I took that same model but predicted people from, I don’t know, China, for example, you wouldn’t expect, or maybe you could, I’m not, I’m not the expert here on the treatment. You wouldn’t expect the people from China or India for example, to react the same way. So you’d need another treatment Model.

**Alexander:** So yeah, it’s really [00:03:00] important to kind of think about are you explore extrapolating here or interpolating in terms of the model. So if you think about all the data that the model was trained on, the data that it should predict, should kind of. be in a way similar to the data that was trained on.

**Chantelle:** Yeah, this is really important. One of the is that we have a way of developing prediction models. And the typical split is you take your data set that you’re developing the model for, and you’d say, okay, I’m going to use 50 percent of this data to train the model. And then you’d set aside a further 20 or 30 percent to test the model.

And this is completely independent of the data you trained it on. And then you have another 20 to 30%. Obviously it has to add to 100 to validate your model on but all of these would come from the same cohort.

**Alexander:** Okay. So let’s go through these. [00:04:00] three steps. So training and model, what does that mean actually?

**Chantelle:** Okay, so we can describe a prediction model like a white box. We have a bunch of inputs. So say if I wanted to make ice cream, I’d have my milk sugar flavorings and not organic ice cream. And you’d put these into the box and through the method, the boxes, this method, this prediction model you or the ice cream machine, and then your output would be the ice cream.

So the way you’d go about developing this model is you, have your you typically do some exploratory analysis.

And this

is looking at the data you have and figuring out, Okay, can I see any patterns in the data? And you’d look at your outcome and check how it relates to each of these inputs.

Maybe there is a linear pattern. And this would typically show a linear [00:05:00] model is the most appropriate. And there are loads of other relationships that These variables can have the outcome logistic, for example, and we would look at the data check for outliers maybe do some imputation for missing data if that’s appropriate And we’d, it’s an iterative process from there.

So you’d use your software and using the software, you’d then choose which of these variables or characteristics of your data are most appropriate for predicting your outcome. And you could try different models, for example, in survival, there are quite a few of them, notably the viable model on the cocks that can be interchangeable based on assumptions.

Also, I should have said previously, in your exploratory analysis, check any assumptions for models you may want to check that they hold because otherwise the model that you end up with won’t [00:06:00] be much use if they don’t

**Alexander:** cool. So, the training part is basically to set up this model. Yeah.

So that you have all the parameters or, you know, see some model itself, kind of, you know, what kind of mathematical function you have, what kind of error distribution, all these kinds of different things that are sets. That’s a training part. Okay, it

**Chantelle:** this will be a working draft. Okay. I guess that takes us to the next bit, which is the testing.

And for me, testing is a bit of an iterative process. Again, I wouldn’t use the testing data to then go and redo my model. But say if the metrics for my tests come back and say, Oh this model is actually pretty rubbish at predicting our test set of patients. Then I would go back and go, okay, maybe I need to try another kind of [00:07:00] model.

Maybe one of my assumptions didn’t really match. Or maybe I’ve missed a huge variable that was actually very much. A key variable for predicting the outcome, then you’d go back and forth creating predictive models, particularly ones that are useful is a very long process. And it’s something that definitely should not be rushed.

**Alexander:** Okay. So in terms of, so I have now my model. Yeah. That I got from the test cases and now I Put the input data into this model and get out the predictions. And of course I have the actual cases. So I know what is reality and I know what is, what’s predicted. And now I can look into how much that aligns to each other.

Yeah. So for example, I can look at the predicted survival curve and the. [00:08:00] And, you know, the other curve and kind of can look at the correlation of these. Yeah, so I think these are the really important thing is that you look at the correlation and not so much or coincidence and not so much whether you get the same picture because Well, it’s for the individual patients that you want to make a prediction, not for the population, isn’t it?

**Chantelle:** Yes. So typically you do use metrics such as calibration or discrimination. This is just for survival modeling. If anyone’s not guessed yet, that’s my thing. It’s what gets me out of bed in the morning. And you’d typically as you said, you would take your predictions like your Kaplan Meier curve, plot it next to the actual Kaplan Meier curve and obviously there’s room for a bit of error, no prediction model is perfect.

But you want to make sure that it’s not useless. You, you want to make sure that if someone is incredibly sick, you’ve tried every treatment that [00:09:00] it’s not gonna predict that they’re going to be perfectly healthy tomorrow. Right.

**Alexander:** Yeah. So putting both Kaplan Meier plots on top of each other. So you would get from a prediction of survival, you get kind of a time to event, would you get also the probability of getting censored in that case?

**Chantelle:** No, not, not, not specifically, I need to check on that one. Okay,

**Alexander:** so that is, that’s really interesting. And of course I’m looking into, okay, do the patients that are, you know, at the start of the Kaplan Meier curve, the same on the predicted and on the actual ones that you know. Yeah. So that’s, that’s super interesting.

Would [00:10:00] you, so one other question I have is dividing the test and the and the training data set. Do you do that at random or how do you select?

**Chantelle:** Yep. Typically at random, it is, there’s a bunch of different methods we can use. So I typically just use a built in R package to randomize the split.

Some people will do, okay, so starting at the first, second or third. Observation in the data set, I want every fifth one and then pull those together. And then the next load would go into the second. There’s a lot of different randomized sampling methods that people use, but you, you want to make sure that it’s randomized particularly if there’s some order to the data that you’re using.

**Alexander:** Yeah, yeah, completely agree. What you mentioned the word calibration. What does calibration mean kind of can you explain that without a formula.

**Chantelle:** Yes, it’s typically the alignment [00:11:00] of what we expect. versus what actually happens. So typically you would have a curve and you’d want that curve to be a straight line.

**Alexander:** Okay. Yeah. Yeah. That is, you basically look for each patient in terms of the predicted versus the actual value. And if that is on a, you know, straight line. 45 degrees line, 10 cents. Perfect. Okay. Yeah. Yeah. So this is kind of yeah. As if you would have binary and you have perfect concurrence. Yeah. So that’s cool.

What other characteristics would you look for, for good prediction model in this case?

**Chantelle:** So I like to say that the simpler the model, the better. So you can have a model with 30 plus variables in it. [00:12:00] And in most cases, that’s going to perform quite well. Because the more information you have, the better.

The issue with those models is interpretability. A model is completely useless. if the clinician can’t interpret the result and doesn’t understand what the model is doing. So as statisticians, we have a responsibility to ensure that our methods will be used to benefit the patient. So I like to make sure that my models are as simplistic while also taking into account as much information as is useful.

We don’t want to add variables for the sake of just increasing the prediction a tiny amount because that leads to a lot of issues further on.

**Alexander:** Yeah. So there’s, there’s a couple of considerations and collecting data is by the way, not [00:13:00] free. Yeah. So some more data you need into the model, the more it becomes actually expensive.

to run it. And also there’s this very high likelihood that, you know, as you said, some variables will not add a lot of, a lot of, you know, meaningfulness. So how do you, how do you prune a model so that you kind of have only those variables in it that really contribute a lot.

**Chantelle:** Yeah, so I’ll answer your question in just two seconds.

One thing I want to emphasize with the whole more data element is that for prediction models there was a really great paper by Richard Riley et al. For sample size criteria for prediction modeling that said that as a rule of thumb, you don’t want to have more, you need to have more than 10 observations per variable.

Okay. [00:14:00] you add in the model, and that’s events. So taking that aside, obviously, if you have 20 people in your data set, you wouldn’t want to have more than one or two variables. 20 would be a rubbish number. I would freak out by fast to make a predictive modeling based on 20 people. Going back to your question about sort of feature selection choosing the variables.

If you, if you had your variables and it wasn’t clear based on p values, for example, which ones would be better than others, and you had loads one method, and sometimes people refer to it as overkill I think they can. be useful in certain circumstances. And it’s something I’m working on at the moment, actually, is penalization techniques.

And these are a mathematical tool that sort of make sure that the model is as fast as can be. There are different versions for example, lasso [00:15:00] ridge and horseshoe penalization. And they Look at each of the, the scale of the coefficients for each of these covariates and try and shrink it as much as possible.

If, if the variable isn’t as useful So I’m looking at it at the moment in the context of multi state modeling there’s been a great paper for clinical prediction models by Richard Riley and Glenn Martin on penalization techniques in small sample size sort of environments. And it’s a great read.

But yeah, you typically use feature selection techniques including penalization or there are simpler methods such as stepwise regression where it adds one variable at a time to a model

and

goes, okay, does it add any more value? And then you could reverse it, see how much they align.

But typically you’d have some background information from prior research to building the model that would tell you, okay, [00:16:00] maybe this variable is more useful than others. So I’d, strongly suggest using the reasoning approach before going for anything more technical, especially if you don’t fully understand the methods.

**Alexander:** Yeah, yeah. Completely agree. Really understand all the different variables in it. I have earlier in my career done that and I didn’t completely understand it. We’ve all

**Chantelle:** been

**Alexander:** there. One of the variables was just a lock transformation of another variable. Yeah. And so yeah, you definitely want to avoid kind of having both the, the original and the lock transformation in it, usually.

So. Now we have the training and the test, now we get to see a validation. When do we make that step from test to validation?

**Chantelle:** So I would recommend doing it once you’ve done [00:17:00] this iterative process of okay, test, train, test, train, and you are happy with your model. And this validation is a last final check just to make sure you haven’t done anything radically wrong before saying, okay, this model is useful and can go into practice.

**Alexander:** Cool. So what we just talked about lots of, lots of traditional methods. Yeah. So if you look into some of the classical No, they could be called data science books, like elements of statistical learning. You can probably find lots of, lots of these things in there. Now there’s this big hype about artificial intelligence.

Will that take over completely from these traditional methods?

**Chantelle:** So I might be echoing Casper from the hype episode of this podcast. I don’t have a straight answer [00:18:00] for you. Probably not. But AI methods can also be quite useful. So the way I like to think of this is as statisticians, we have a toolbox.

If we were electricians, every electrician, electrician would take the right tool for the job. You wouldn’t take a spanner when a screwdriver would be needed, for example. So we have a responsibility to educate ourselves. and pick the right tool. So if I, so there were many differences between statistical versus machine learning AI and just one of them would be interpretability.

Yeah.

If I needed my model to be interpretable, I wouldn’t go for AI and ML, I would go for standard statistical methods. And if one thing that machine learning techniques can do very well is in small sample sizes, and we know that standard statistical techniques aren’t great at that. So it’s really about picking the right tool [00:19:00] and keeping up to date with the methods.

And that can be quite difficult, especially now that it seems every other day there’s a new AI method. And it’s, it’s really spiraling at quite an impressive rate. But yeah just to echo myself a bit you do need to make sure that you are educating yourself and that you’re picking the right tool at the right time to do the right job.

**Alexander:** Yeah. Yeah. Absolutely. Love that. Yeah. So it’s, it’s definitely an area to explore. It’s definitely area to learn about. And To work together with people that come from that space. Yeah, I think the biggest advantage we can get if we work together, learn from each other embrace the opportunity and do the best with it.

And yeah, I also love the episode with Kasper that was super fun. [00:20:00] So there’s a Kasper Ruffenbach episode. published some time ago, just scroll back a little bit in your podcast player and then you will find it. Now one last question. If I want to do all of these things what type of tools do I actually use?

**Chantelle:** For clinical prediction modeling.

**Alexander:** It

**Chantelle:** depends on what you’re most confident using the toolbox analogy. So for me personally, that would be either R or Python. I have some experience with SAS. But you really just need to make sure that you are using a tool that you’re comfortable with and you know how to use.

**Alexander:** Yeah.

**Chantelle:** So yeah. And I’ve

**Alexander:** definitely used. It’s a very general answer. Yeah. I definitely use SASS for a couple of these things. Yeah. So there’s, there’s lots of built in things within whatever Proc Logistic or other things. Yeah. What, whatever you, you want. And so have a look into what what’s, what’s possible [00:21:00] there.

Yeah. And of course. If you’re really into programming and you’re multilingual, that always helps. So speaking about this, if you really love SASS, we also have a SASS2R course that helps you to. Learn more are, and that is specifically designed for people working in clinical trials and health data, these kind of things.

And it comes with an awesome presenter called Thomas Nightman, who has helped lots of, lots of people that are statisticians, programmers, and knows us in the, in the Pharma CRO space and transition to to R. It’s, it’s a really, really nice course. It’s also both pre recorded and interactive. So just check out the academy on the effective statistician for set one and one.

When the next live training will be you can [00:22:00] always look into the recordings. Thanks so much Chantelle for this. Awesome discussion about predictive modeling. We talked about a couple of different use cases. We talked about testing, training, validating and went into quite detail about kind of pruning and all these kinds of different things.

I think it’s an absolutely fascinating area. And it is definitely something you can explore and have a look into. There’s a lot of applications within Healthcare Answers.

**Chantelle:** Thank you for having me, Alexander. It’s been a pleasure talking about something that I could talk about for hours. Yeah,

**Alexander:** that’s kind of the nerdy part of this podcast.

I love it.

### Join The Effective Statistician LinkedIn group

This group was set up to help each other to become more effective statisticians. We’ll run challenges in this group, e.g. around writing abstracts for conferences or other projects. I’ll also post into this group further content.

I want to help the community of statisticians, data scientists, programmers and other quantitative scientists to be more influential, innovative, and effective. I believe that as a community we can help our research, our regulatory and payer systems, and ultimately physicians and patients take better decisions based on better evidence.

I work to achieve a future in which everyone can access the right evidence in the right format at the right time to make sound decisions.

When my kids are sick, I want to have good evidence to discuss with the physician about the different therapy choices.

When my mother is sick, I want her to understand the evidence and being able to understand it.

When I get sick, I want to find evidence that I can trust and that helps me to have meaningful discussions with my healthcare professionals.

I want to live in a world, where the media reports correctly about medical evidence and in which society distinguishes between fake evidence and real evidence.

Let’s work together to achieve this.