This episode originally struck a chord with statisticians around the world—and for good reason. Whether you’re just starting with real-world evidence (RWE) or mentoring someone who is, this conversation is packed with practical lessons that will help you navigate the complexities of observational data with more confidence.
In this special replay, guest Rachel Tham and I reflect on the real-world analysis mistakes, misconceptions, and growing pains they wish someone had warned them about earlier in their careers.
From ambiguous index dates to messy exposure definitions and unexpected data quirks—this episode will save you hours of rework and help you better manage timelines and expectations.
What You’ll Learn:
✔ Why “index date” is more complicated than it sounds
✔ Common mistakes around exposure definitions
✔ The importance of understanding how RWE data is generated
✔ What programmers should know about timing, variables, and algorithms
✔ Why project management in RWE must be iterative and stakeholder-driven
✔ Key terminology pitfalls that can trip up even experienced professionals
✔ How data issues like duplicates, implausible values, and partial dates impact analyses
✔ Best practices for communication, timelines, and managing expectations in RWE projects
Why You Should Listen:
If you’re working in real-world evidence or thinking of transitioning from clinical trials, this episode is packed with practical advice to help you avoid common mistakes and set your projects up for success. Rachel shares from her own hands-on experience—starting from data management and programming to leading statistical analyses in RWE. Alexander and Rachel also highlight real-world data quirks that no textbook prepares you for, making this episode an essential resource for statisticians, data scientists, and healthcare researchers alike.
Links:
🔗 The Effective Statistician Academy – I offer free and premium resources to help you become a more effective statistician.
🔗 Medical Data Leaders Community – Join my network of statisticians and data leaders to enhance your influencing skills.
🔗 My New Book: How to Be an Effective Statistician – Volume 1 – It’s packed with insights to help statisticians, data scientists, and quantitative professionals excel as leaders, collaborators, and change-makers in healthcare and medicine.
🔗 PSI (Statistical Community in Healthcare) – Access webinars, training, and networking opportunities.
If you’re working on evidence generation plans or preparing for joint clinical advice, this episode is packed with insights you don’t want to miss.
Join the Conversation:
Did you find this episode helpful? Share it with your colleagues and let me know your thoughts! Connect with me on LinkedIn and be part of the discussion.
Subscribe & Stay Updated:
Never miss an episode! Subscribe to The Effective Statistician on your favorite podcast platform and continue growing your influence as a statistician.
Never miss an episode!
Join thousends of your peers and subscribe to get our latest updates by email!
Get the





Learn on demand
Click on the button to see our Teachble Inc. cources.
Featured courses
Click on the button to see our Teachble Inc. cources.
Rachel Tham
Associate Director, Statistics & Real World Data Science at Astellas Pharma

Rachel Tham is an experienced statistician and programmer, who is passionate about improving patient lives and healthcare experiences. She leverages 10 years of experience in the pharmaceutical industry: 8 years dedicated to Real-World Evidence studies, and 2 years in Data Management. Prior to that, she was a pharmacy technician in community and hospital pharmacies.
Rachel is currently a Senior Biostatistician at Veramed and holds an MSc in Medical Statistics from the London School of Hygiene and Tropical Medicine along with a bachelor’s degree in Psychology from the University of Wisconsin-Eau Claire.
Transcript
00:00
You are listening to the Effective Statistician Podcast, the weekly podcast with Alexander Schacht and Benjamin Piske designed to help you reach your potential, great science and serve patients while having a great work-life balance.
00:23
In addition to our premium courses on the Effective Statistician Academy we also have lots of free resources for you across all kind of different topics within that academy. Head over to theeffectivestatistician.com and find the academy and much more for you to become an effective statistician.
00:50
I’m producing this podcast in association with PSI, a community dedicated to leading and promoting the use of statistics within the healthcare industry for the benefit of patients. Join PSI today to further develop your statistical capabilities with access to the ever-growing video on demand content library, free registration to all PSI webinars and much much more. Head over to the PSI website at PSIweb.org
01:19
to learn more about PSI activities and become a PSI member today.
01:30
Welcome to another episode of the Effective Statistician. And today I’m talking with Rachel. Hi Rachel, how are you doing? Hi, I’m doing very well. Thank you. Very good. It’s great to record this episode together. We have now worked for quite some time together and yeah, actually in different companies. And now we’re both together at the same company. that’s pretty cool.
01:56
But before we dive into our topic of today, maybe you can do a short introduction of yourself. I had an interesting path that has led me to become a biostatistician. I knew I wanted to do something in healthcare when I was younger. Seeing my parents’ work in healthcare influenced me to pursue something similar. But after I worked in a hospital and a community pharmacy, I realized that I was more interested in the research aspect of healthcare and less so much of the…
02:23
application of healthcare. So this journey then led me through different roles as a data manager. And then I moved to real world evidence and I did some things in SQL. I was a data extractor and then I taught myself how to program in SAS. I did a part-time masters and that led me to my current role as a statistician. You helped yourself to, or you trained yourself in SAS? Yes, I was originally a data manager and I saw that programming really helped people move.
02:52
around and have more skill sets within the industry. So I taught myself SQL and started to do some extractions for different databases. And then they said, Oh, you know what, it would be really nice if we had some more SaaS programmers. So I said, Hey, I’m interested. And thankfully the company I was at the time sponsored me and I learned a little bit by myself. They sent me on a course and nurtured me to become a SaaS programmer. And that further inspired me to become a statistician.
03:23
That’s such a nice story. That careers usually don’t have this straight path and it’s you learn along the way and you, yeah, the more you dive into things, the more you learn about your own interests. And that’s, that can lead to new areas, completely new areas. Yeah. That’s pretty cool. Okay. So today we want to talk about.
03:50
things we would like to have known earlier in our career about real-world evidence. Let’s start with the first topic, the index date. I think that is a really interesting one because if you come from a clinical trial setting, you may think dates are pretty easy.
04:10
You randomize and sets your baseline and you can have a little bit of a discussion of whether baseline is the start of treatment or the start of random, the day of randomization and things like this. But usually it’s pretty much the same. Yeah. Then from there on, everything’s clear. You have how many days you had before randomization, how many days you have after randomization and yeah, everything is also highly regulated and you have really.
04:38
nice quality, so you know exactly the day, sometimes even the time when the treatment was taken. However, in real world we don’t have that. No, it’s very different. It’s one of the main challenges, I think, that you can have subjects that enter the database, they can also leave the database, and their time within the database may not even overlap at all.
05:02
There could be different levels of severity of disease and you can also have different start dates for when the drug was administered. And you could also possibly even have different days where guidelines have changed or the new restrictions for something or even COVID comes in and those could also become index dates as well. Let’s first go back. What actually is an index date? An index date is
05:27
I guess you could mimic a randomization and say that would be your day that you’re going to specify having a baseline period and then maybe see and follow up with them afterwards. And yeah, it’s challenging. So oftentimes as you’ll see later, they use the term first. So it could be like the first disease date. It could be the first date that they received a drug prescription. It could be the day that the new guidelines come in for a medical. So it could be.
05:57
individual dates. Yes, it could be that for certain patient it is whatever December 3rd and for another is January 25th. But it could also be said it’s for all patients the same date because that’s the date of the guideline changes. Yes, that is also a possibility or they could also have. So, for example, if there was an issue that they found with pregnancy, then that they advised medics not to give this drug and then they see the amount of
06:27
women that are pregnant that take the drug before the guideline and after the guideline. So then the guideline date could be an index date as well. Yeah. Yeah. Or it could be start of hospitalization or things like these kinds of things. Yeah. Exactly. So it’s very flexible. It can change quite a lot. And that is certainly a challenge when designing, conducting, and just being interested in real world studies. So in terms of
06:55
this kind of first, if you have patients that are going in and out of the registry or the database that you’re looking into, what would you then look into? Would you look into the first entry or the second entry or would you consider both and just highlight that you’re looking into the same patient twice? How do you handle these kinds of things?
07:22
So many times is the first time they encounter the experience, whether it’s a disease or a drug. There are some studies where you want to do some due diligence to make sure that it’s indeed that disease or in a sensitivity analysis. And therefore it might be the second mention of, I don’t know, the disease code or the drug, or if there’s a titration, you might say once they’re on a stable dose of the medication. So.
07:49
It could be not the first experience and it could identify a different period of time within the patient’s record. If you have some kind of, let’s say, reoccurring events, let’s say you’re interested in pregnancies, yeah. And you have the first, second, third, whatever pregnancy of a woman. And then you could have multiple index dates for an individual patient. And then you need to.
08:18
have a look into your covariance metrics to account for these kind of clusters. Let’s see, yes, it is possible in some studies for a patient to have multiple index states. And in that case, I think they might contribute to separate rows. So could be pregnancy one for this subject, pregnancy two for that subject, depending on how many pregnancies your cohort of women have. Yeah, yeah. So that’s pretty interesting, something that…
08:45
Sometimes may happen, but it’s more rare, I guess, in clinical trials. Yeah. Okay. That is the index date. Yeah. I want to point out, I think there was a lot of confusion for me when I was working with index dates in particular, because if they’re longitudinal, they often don’t capture from the subject’s date of birth. Instead, it’s the first mention within the database. And so there is also an element that sometimes
09:11
the world studies use and it’s called the washing period. And it’s just that time to make sure that when they encounter it in the database, it’s most likely to be their first experience so they can see any outcomes based on that exposure. But I also want to mention that the word first can also be confusing because it’s very natural in the English language to say the first thing. So for example, when I say, this is my first coffee.
09:39
It could be my first coffee of the day, or it could be my very first coffee ever. And so when in real world evidence studies, when the term first or new or incident or initiate is used, I often try to make sure that there’s a time period specified for this as well. And that has helped me to alleviate a lot of silly questions that I’ve asked. Which first or how do I know that this is actually the first one and things like that, which kind of.
10:08
The first one within years 20 to 22 or something like this, for example. Yeah. Or it could be the first mention of disease X in the database or the first drug Y after the washing period or something. Okay. Okay. Oh, okay. Very good. That’s interesting. Yeah. It shows how unprecise.
10:37
Our language sometimes is probably why we use in mathematics, not really the English language.
10:47
Okay, the next topic, and this is associated with the index date, is exposure. And that can already be in clinical trials difficult, because you never know often if and when the drug was really taken, but at least since they received the drug, which is not necessarily the case in real-world evidence, very often because he’s like…
11:15
example, claims databases, you can only see maybe the prescription, maybe sees that they actually went to the pharmacy, but there’s all kinds of different topics there. So tell us a little bit more about how do you define exposure in real-world evidence and what are the common problems there? classically in the studies that I’ve done, but also just in epidemiology, you have the
11:44
exposure, which is often the treatment or the mention of disease. it’s mostly linked to the index date in my experience. They have a similar topic. So if it’s the index disease, then it’s usually the index date or the disease that is the exposure. And then we’re also looking to see if that is associated to an outcome or an endpoint. So classically you have the
12:11
exposure on the left, and then you’ve got an arrow pointing to the right and you have the outcome there. And that’s ultimately what the study is designed to do. Naturally, because in real world studies, you can design your own, I guess, sandbox where you’re going to be performing your analysis. You could sometimes have it go the other way, as in a case control study. So you could look at an outcome and then try to see and identify the exposures.
12:40
So you basically look backwards, yeah, which is also really different thing to clinical trials where you always look forward. So here you basically look for, okay, how many patients have died? And then you look backwards, did they have the treatment or didn’t they have the treatment? example. Yeah. So you have that ability to flip it on its head and look in reverse in.
13:06
in real world studies, but also you can look at it the classical way of finding the exposures and then looking for the outcomes. So of course, if individuals was to give an exposure, far found to have a greater probability of developing the particular outcome, it suggests there’s an association. However, if the groups have the same probability of developing the outcome, then it suggests that there may not be an associated risk. However, I think
13:35
we still need to think about that critically within clinical trials because we’ve got other elements such as confounders and things like that that could play a role in this association between exposure and outcome. Because that is one of the big challenges is you can have in clinical trials, you more often have the precise measurement of the exposure, whereas in real-world evidence that’s much more difficult.
14:04
And also not given by the protocol. Yeah. So you can have much more different dosing schemes and the prescription intervals may not really make sense. If you first look into it, the amount of drug prescribed can vary over time. You can have big boxes and small boxes and all kinds of different things. Or maybe you even they change the treatment, but it’s from one generic to another generic. Yeah.
14:34
And of course the way medics operate and prescribe things can also differ from practice to practice or person to person. So all of those things can be influencing. Yeah. Yeah. You really need to look very closely into how data happens here and why is the data was collected in the first place, because that can drive your understanding of why certain data is not collected. Yeah.
15:03
Exactly. Tell me a little bit more about things about stratification and other areas that you can do to adjust potentially for confounders and how does it work here? Sure. So when I was a programmer, I wasn’t statistically trained quite yet. So a lot of these things I figured out by talking to a statistician. And also then when I had done my masters, I did it part-time. I had a light bulb moment and then maybe realized.
15:32
how I could have streamlined and done my programming a bit better. For example, usually things are stratified by the things that occur in the baseline period. So that’s why you would want to identify, for example, the baseline age and we’re not so interested in the outcome age because that could be very different and I don’t know if it would really influence our interpretation of the results. is baseline age and the outcome age? So as a programmer, you can calculate age
16:02
at any point in time. So you could calculate the age at the index date. You could calculate the age when they experienced the outcome, or if they experienced the outcome, you could also calculate the age that they leave the database or enter the database. And for when I was a programmer, I was always wondering why this age was most important and none of the other ages were of interest. And when I started to realize that, we’re actually interested in the exposure outcome association, that makes so much sense why we’re really only interested.
16:32
or place a high focus on the exposure age and maybe none of the other opportunities where age can be calculated. Yeah. What other kind of things have you stepped over while doing the programming that you wish you would have known earlier? Sure. This helped me understand an order of operations and how I liked to program my code.
16:57
So I usually would start with defining the cohort, performing the different attrition steps, and then structuring the different study periods. So like the baseline, the follow-up, if they were interested in any of their different windows, those would be the next step that I would take. It would only be then that the index or exposure would be created and programmed followed by the outcomes. I know that sometimes I would create the index exposure and then do the baseline demographics.
17:26
But I overcame a big challenge because there was one study where I was, we were looking at mortality and because as you mentioned, you need to know your data and the real world evidence nature of things. Sometimes people have typos and whatnot, and you can’t really go back and try to correct those mistakes. But we found that there were some death dates that had occurred before the index date as we started to program the outcomes. And so that kind of threw the entire study into a bit of a whirlwind and we had to start from scratch again.
17:56
and remove the outliers that we couldn’t really account for. So then I would program the outcomes. And finally, I would then do the baseline characteristics, any compounders, covariates, and then the statistical analysis. Or I’ll put up tables after that. Actually, that’s another really important point. If you work with real data, there will be always some weird patients in it.
18:25
Yeah. And having some robust analysis and robust programming techniques that take account of these extreme outliers makes a lot of sense because otherwise they can completely derail your analysis. I am once working on a study in bipolar disorder. Yeah. And within bipolar, you have these so-called rapid cyclos as well. So it’s a switch very fast between depression.
18:55
and mania. And there was this one patient that had 20,000 cycles lifetime. And you were thinking, how can that be? Yeah. And that data was actually query intensive position. Yes, that is correct. But of course, if you do some kind of linear regression and most of your patients have less than 100, and then you have this one with 20,000. Yeah.
19:25
Linear regression can be pretty much just dominated by said individual. Yeah. Yeah. They can have a big influence. Yeah. And I think that this is a big opportunity that real world studies can borrow some of the standards that programmers are required to follow within clinical trials with the creation of the STTM and the ADaM datasets and things like that. Yeah.
19:54
But it’s really good to as soon as check for these outliers for any extreme values and have a discussion whether you want to exclude them, for example, from the analysis. because yeah, you don’t want to have one or two patients driving the complete analysis. Yes. And it helps you build trust with your stakeholders so that you find those arrows first before they are delivered in.
20:21
That’s the other point. If you really understand what are reasonable values, that will make a big difference. Does this, is that association expected to be positive or negative? Is this, these values to go up or down? That helps you to avoid lots of uncomfortable discussions.
20:50
Yes, indeed. Okay, very good. Let’s step to the next point. So again, the English language a little bit. If we talk about specifications, what are the kind of things we should, yeah, be careful of? So oftentimes when you say the words prior to or before or after when you’re referring to dates, these are very clear to the…
21:19
most English speakers like, okay, just over there after that, but you need to decide if you’re going to include or not include the equal sign. That is the question. Does it include that before or prior to date or does it exclude that before or after date, for example? So I think when I was programming, that would often always be a question that I would be asking the statistician or the site team if the state is included or not. So.
21:49
I guess, to help a little bit of extra to and froing. That’s really helpful to be sure to say it’s included or not. Yeah. Especially if the date, if these dates have are incomplete. Yeah. And you have maybe just a year and months in there, but not a day. Yeah. Then, and you compare it with an event date that is complete, then you may not always know exactly whether it’s.
22:17
before or after is more curing. Exactly. Yeah. It’s always a big challenge when you’ve got partial dates. Another word that kind of perks my ears up when I hear it is type of when it’s referring to categories. It always helps to specify if the groups are mutually exclusive or if they’re not mutually exclusive and subjects can fall into one more than one category.
22:46
because there’s been times where we think groups are mutually exclusive, but then a duplicated experience or record in a patient’s history puts them in more than one category. In those instances, you might need to assess which record is going to be categorized into these mutually exclusive groups. Is it the most recent or is it the most common? Perhaps you’re going to apply some hierarchical rules to it or even pursue a data-driven approach.
23:15
And that is exactly where I very often have the problem with you need to have everything pre-specified. Because yes, pre-specification is really great, but my experience is as soon as pre-specification hits the data, you need to have a second thought about it. Because you just can’t anticipate everything that might happen in the data.
23:43
Yeah. If you’re more experienced, surely you can take care of lots of different things that might happen. And if you, especially if you have worked, have been working with a specific data set for a longer period of time, maybe you already have some kind of standard definitions and paradigms that you’re working with all the time, but there’s always this one patient that, you know, you need to take care of.
24:11
Isn’t it? Yes. Hopefully it’s not the same outlier before, but yes, there’s always things that you have no idea how they arrived in the data set, but you need to account. And if you’re going to include them, decide how they’re going to be included within the structure of your analysis. Yeah. Very good. Yes. That’s a good point. What else? Treatment patterns or line of therapies are also a really common topic to explore. However,
24:39
with the first one, the English language has a very broad and loose term for some words such as dose or dosage. It doesn’t really have a time period associated with it. So it could just be like the strength of one pill, but it could also be the strength that somebody takes in one day. Or then you also have different routes of administration, which can complicate things like IVs. Like how long of a duration does this IV last for? Yeah. Yep. Yes. That’s good.
25:09
Yeah. And if you combine different stuff there and you have overlaps, In terms of you fill the prescription early or late, these kinds of things can make it really complex. Yes. There are some of the more challenging studies that I think I’ve performed. But yeah, I think if you’re going to be doing a treatment pattern or line of therapy study for your study team’s sake, it would be really nice to have just a definition sheet of these different words.
25:38
and the definition of how you are going to interpret it and implement it within the study. Yeah. One other thing there from a statistical point of view, of course, is if you have these errors in your covariates, yeah, they have an influence on your analysis. And the interesting thing is they always dilute the effect.
26:08
The bigger, you can very easily see it. So the bigger your kind of error is in the covariate, the smaller the measured effect will be. Because if you can go to the extreme and the error is so big that it’s completely random, then of course the regression coefficient should be zero. If you understand the variability, if you can find some kind of
26:38
measurement for how much error you have associated. Then you can actually at least adjust for it or get a feeling for how much you have underestimated the effect of the covariate you’re looking into. And there are sometimes specific sub-studies might be helpful. Yeah. If you have some kind of gold standard somewhere, yeah, you can look into these or
27:08
Maybe you can at least kind of assess how much variability you might add due to this error in the covariates. It’s a typical thing and you know, is that we rarely look into clinical trials, but in observational data that can be really important. Yes, measurement error and misclassification is a big issue. Yep. Yeah. By the way, it’s also a really interesting thing when you think about PROs.
27:36
because inherently we always have measurement error. And so that’s yet another topic. Yes. OK. What else are common pitfalls that you would step into when doing real-world evidence data analysis? So I think you alluded to it before, but know your data. Know how was the data collected? Why was it collected? And what gaps possibly exist? Some real-world data, if you’re lucky, is collected for research purposes.
28:05
Well, quite a few others are repurposed administrative data. So it’s good to know how the data got there and possibly how reliably it’s collected. So for example, for reimbursement databases, some fields might get reimbursed while others do not. And this could contribute to the missing data that you see. And so if you’re creating variables based off the parts that are not reimbursed.
28:32
the likelihood of it being reliable or having error in it could be large. Whereas if you’re trying to investigate a variable that is reliably collected and is reimbursed, then you’re going to have a much more higher chance of it being complete and not too many missing variables. Yeah. What else are typical data issues that you step over? So I think the most common ones are duplicates, zeros,
29:01
and missing data and then implausible values. So duplicates, think, are just a natural phenomenon of real world studies. sometimes have, for whatever reason, two rows that are identical, except for maybe an identifier of some sort. And you need to understand then, is this like the same row or are these two different experiences that is just this recorded? And so they also sneak into and sometimes add
29:31
issues like the categorization one we spoke about earlier. Then for missing values, sometimes a zero could also be a missing value, which is sort of weird. So it’s important when you see a missing value or you see a place variable that has zeros, it’s good to think about the other side of the coin. Is it indeed missing or is it just a zero and zero means missing? Then you can also have implausible values, things like
29:59
values with errors. So do the totals add up? Are there outliers? This could also be inconsistent values. So it could be different recordings of the same variable. Yeah. So it could have out of date variables. You could have uncommon characters find their way into places they shouldn’t be. And then also there’s formatting issues. So, you you’ve got your US spelling of things and you’ve got a UK version of spelling.
30:26
They also use slightly different date systems. So those are also challenging things to overcome if depending on the data set you’re using. And then the one that threw me one time is a trend over time. So in one of the databases that I used, they started to reimburse one year for the reporting of diseases, because I think the government was trying to understand and try to support it. Let’s say it was diabetes.
30:56
And so we noticed that there was a real big spike in this year where this requirement to report diabetes better occurred. But then of course, that meant if our study spanned that spike period, then the covariates that we would capture for the comorbidity of diabetes would be completely different for the two different periods of the spike. So it’s really important to look for trends over time as well.
31:25
It’s coming back to the kind of how does the data happen? Yeah. If there’s before it was reimbursed and there are that wasn’t reimbursed or the other way around. Yeah. Of course you have these kind of triggering external events that, that then you see, wow, what’s happening here. And there’s some kind of interesting thing in the data. And when you then show it to physicians that actually have worked in the field, say, yes, of course, because.
31:52
So we had the guideline change or there was this external incidents and then everybody was looking into it. yeah. So yeah, these are just things to look into and try to understand the data a little bit more, because I think if I would have had this little checklist of things to go through, it would have prevented many errors from being delivered in my past. And by the way, we will put lots of this into the show notes.
32:17
go back and then you can see, okay, what are all the different things you should watch out for so that you don’t make the same mistakes as well. And, or at least capture them before you report them. Yes. someone else. Maybe just set, set expectations. Yeah. That you say, here, don’t expect this to be the same quality as we usually have with clinical trials. Yeah.
32:45
I had said in the past, very often when stakeholders that first would work on the observational study, they would expect the same data quality as in a clinical trial. then how can it be that we have, don’t have an agenda for some patients? Welcome to real life.
33:10
Yeah. Yeah. Exactly. Yeah. Okay. Speaking about managing expectations, let’s talk a little bit about managing real world evidence projects, because that is also a little bit different than clinical trial projects. What are your thoughts about this? When I joined a project, I tried to keep an eye out for what I call the critical success factors.
33:34
These are the building blocks or milestones that determine if a project will succeed or will face some challenges. Some of these examples are, the outcome variable available? Or do the important variables have a lot of missing data? Or if a segment will be data-driven, I know that part is going to require a lot of focus and attention. Or if an algorithm you’re designing that will feed into your results has a large impact on them. And also the study objectives. So.
34:05
any of these critical success factors, I try to link them to a deliverable. Even if it’s not requested or one of the objectives, and it can be as simple as a two by two table or a histogram, just something that will help facilitate stakeholder engagement and trust in the results that you’re going to deliver. Yeah. So let’s go through them kind of step by step. So first is are these outcomes.
34:31
actually available and what is the quality of these data? I think this is a really important first step. It’s what I often call some kind of feasibility testing. So whether we actually can do what we anticipate to do and the data is good enough for it. So how do you tie that to some kind of deliverable as you said? So it could be something like you list out all of the variables of interest.
35:01
And I don’t know, let’s say that they’re comorbidities, for example, and you could say the proportion of patients that experience or have that reported in their history. And then you could compare that to clinical trials or which published in the literature to see if it’s comparable or if you had, you were noticing big gaps that you can’t account for in the data set. Maybe those medical codes aren’t very well reported or they’re reported at a more.
35:28
broad level and not so granular. So yeah, I think it’s important to see that. It’s also important to, I guess, see how big your study population is as well, because there’s been times where we do a preliminary feasibility to see how many subjects in the database have both a drug and a disease. But then of course, on top of that, you might also do some extra data cleaning to remove those strange
35:57
You have patients that have outliers that will just really influence, you can’t, you don’t know how to proceed with analyzing them. Or could also be then adding age restrictions or also stratifying further. And then you could end up with really small numbers and then maybe that data set isn’t so feasible or you want to broaden your cohort and reduce some of the exclusion criteria or inclusion criteria. then- a really important thing. Yeah.
36:27
Maybe initially you were thinking too strict, yeah, to saying, okay, we only want to include patients, you know, which has been in the database not longer than five years. And then you see, if we do that, we end up with so few patients that it’s not feasible anymore to get to any conclusions. Yeah. So can we relax that? These discussions are really important to have because it’s a little bit of bias variance trade-off discussions. Yeah.
36:55
How clean do you want your data to be for trading off if you have less data? Exactly. Yeah. So yeah, it could be attrition. It could be just like all the variables, as I mentioned, and the missing completeness reports. It could also be if you are developing a variable. There was one time I developed an algorithm that fed into the outcome for the study.
37:20
We wanted to make sure that the algorithm was reliable. So we had a subset that we didn’t need to do the algorithm on. And then we had a subset that we needed to apply the algorithm to. And so I just did a simple two by two kind of table where I was able to apply the algorithm and we could then view if it was a reliable algorithm or if maybe we need to think about how we’re going to proceed with the algorithm. Yeah, that’s very good. The other really interesting thing is.
37:50
data-driven parts of your analysis. Can you double-click on this one? What does that mean for you? Yeah, so I’m pretty sure if you’re coming from a clinical trial background, this might not make any sense. So there’s many times where when you’re setting out to design the study before you really start analyzing things, there’s aspects of the database that you might not be sure about. I don’t know if this variable is available or
38:18
if we can create these many categories for occupation or so sometimes you need to actually do a little bit of analysis to find out what I guess the different categories are and how many patients fall into each category. So you can see if they need to be augmented into bigger categories. that would be maybe more of a kind of a simple data driven approach, but it could also be that finding a conclusion on locks.
38:46
the ability to proceed with A or the ability to proceed with option B. So you sometimes will have go no-go decisions that depends on what happened with the analysis prior to that as well. Yeah. And that is really important to set expectations around these and say here and here we need to, we’ll need to have meetings to make decisions, discuss the data and be sure that you can interact with your stakeholders.
39:15
at these time points. It’s in clinical trials, it’s very often a much more kind of linear process here. can be become quite fuzzy and iterative. so having a little bit of an agile mindset here where, know, you do something, you test it, you show it, you discuss it, you consider the next steps. Yeah. Makes a lot of sense. But of course that requires that you have a
39:45
very close collaboration with your stakeholders. And if you can talk to them only every other two months, that can have huge impact on the timeline. Yes, and the timelines.
40:01
Yeah, timelines is maybe yet another kind of project management topic. tell me a little bit more about your experience with timelines. So particularly in these situations where you can have these go no-go decisions, it can impact the timelines. But what I usually do is as I started to get more confident with the databases that I was programming with or performing statistics on, I got a feel for how long…
40:29
creating different aspects are. So for example, in the UK, we have a HEDS database, is Hospital Episodes Statistics. And this one is unique because it reports each row as a experience that one doctor has with the patient. So if a patient sees multiple doctors within their visit, there can be multiple rows that contribute to their full hospitalization. like kind of building that takes
40:59
some time, of course you can automate it if you would like to do that, but there’s also many different ways to create it. There’s not like one certain way. So it’s, it’s just really challenging, but I got a feel for how long that took to, to create each time, depending on what method they were going to approach it with. And then also just understanding that it’s really important to do the exposure and the outcome before creating the baseline variables. And so how long will it take to create those before you can deliver your first table to, to showcase.
41:29
your cohorts. So I think getting to know the data and how long things take is a big element to being able to estimate and add in the buffers that you need to make sure that you’re able to meet your milestones on time. Yeah. Adding buffer, I think is really important because it’s not a question if something weird will happen. It’s just a question when it will happen. Something that you just haven’t foreseen happening.
41:59
will show up. Yeah. So always plan with Buffer. It’s anyway a good guidance to do this because even in clinical trials, these kinds of things happen, for sure in observational research always assumes that something will be weird. Something will not be as expected. So don’t plan, you know, that everything will be going smoothly. So it’s planning for failure. Yeah. It’s always going to be.
42:27
some trend that maybe you didn’t think about or include or yeah, or some dates that are wonky or some missing data and things that maybe not don’t occur in the order that you’re interested in. Yeah. The buffers really help to make sure you can identify those and control for them before they become a deliverable. Yeah. And always as soon as you see something coming up, as soon as you see, oh, this may have an impact.
42:56
considerable impact on timelines, directly raise it. That will help you to build trust with your stakeholders and everybody you rather knows earlier than later about shifting timelines because then you can still manage it. Yeah. They don’t say the day before the delivery, oh, by the way, will be two weeks late. That doesn’t come across really nicely.
43:24
It’s not nice for anyone involved. Okay, what else on the project management parts that you would have liked to know sooner? So I guess going back to these critical success factors and including them as a deliverable, it gives a great opportunity to communicate and document those decisions that you make. And if timelines need to flex and change a bit, this
43:51
is a good tool to come back to to highlight why extra time is needed or how things are going to change and this new analysis is now going to be included and that’s going to require additional time. So that really helps you to clearly communicate but also document in case this project now gets handed off to somebody else and identify areas of where analyses could be added on or scope is changing but also opportunities for future studies that could
44:18
build off of one of the ideas that you had, maybe tweak it a little bit and improve it and see how that changes the outcome or the analysis. So I used to think the less deliverables, the less things a stakeholder or the study team had to pick apart. But I realized that the less deliverables can also equate to more opportunities for a stakeholder to feel let down. It’s actually…
44:45
Now I see them as a part of effective communication and to make sure that expectations are met and properly communicated. Yeah. It’s a really good opportunity for honing in on your communication skills and get people always updated. Maybe there’s even some kind of rhythm you have with it, so that people always feel that things are under control and they are informed. So everybody is a little bit different there.
45:14
And so, yeah, understand what are the needs. I think it’s also really important to understand what will happen with these analysis. Are there certain analysis that are more time critical than others? Are there certain kind of external timelines that drive things like an abstract timeline or a submission timeline or anything like these things that, you know, you can’t easily move around?
45:43
that’s really important to understand how does your study fit into the bigger picture. So that will help you to ask the right questions and potentially prioritize things. It’s generally an important thing, but especially with real-world evidence where you have so many moving parts and you need to be a little bit more agile, having the bigger picture is really vital. Yes, because even if you are
46:09
changing your timelines and they’re moving a little bit, perhaps one aspect that’s, I don’t know, is going to be fed to an HTA submission or a regulatory agency. Maybe that one you don’t have to move the timeline for. So that’s always a win as well. Yeah. And if you have this kind of bigger picture, you can also tailor your communication much better. Thanks so much. That was an awesome discussion. Actually, it took
46:33
out to be a little bit longer than I expected, but lots of gold because we talked about learnings about kind of data in real world evidence, index date, exposure, typical problems with language as to what is prior really means and what is at the same time or things like that, how this can have an impact. We talked about a couple of common pitfalls in terms of data.
47:01
data not being available, missing, implausible data, inconsistent data, all these kinds of different things. And finally, we touched on managing these projects. Yeah. Said it’s, it’s much more need for communication, much more need for adding buffer into the plans. And so overall, I think that gave you a lot of insight into how we were able to evidence projects, analysis, data.
47:31
are different to clinical trial analysis if you’re coming from that side. Or if you’re coming from the real world evidence side, you can see how much easier it is on the other side. Okay, thanks so much, Rachel. Any final things that you wanna take the listener away from this discussion? No, thank you so much for having me. And I look forward to hearing about the real world studies that others do and how we can, as a broader,
48:01
group proceeding down this new trail, improve and make sure that patient lives are at the forefront. Thanks so much. Have a nice time and listen to the podcast next week.
48:19
This show was created in association with PSI. Thanks to Reine and her team at VVS who helped with the show in the background and thank you for listening. Read your potential, great science and serve patients. Just be an effective statistician.
Join The Effective Statistician LinkedIn group
This group was set up to help each other to become more effective statisticians. We’ll run challenges in this group, e.g. around writing abstracts for conferences or other projects. I’ll also post into this group further content.
I want to help the community of statisticians, data scientists, programmers and other quantitative scientists to be more influential, innovative, and effective. I believe that as a community we can help our research, our regulatory and payer systems, and ultimately physicians and patients take better decisions based on better evidence.
I work to achieve a future in which everyone can access the right evidence in the right format at the right time to make sound decisions.
When my kids are sick, I want to have good evidence to discuss with the physician about the different therapy choices.
When my mother is sick, I want her to understand the evidence and being able to understand it.
When I get sick, I want to find evidence that I can trust and that helps me to have meaningful discussions with my healthcare professionals.
I want to live in a world, where the media reports correctly about medical evidence and in which society distinguishes between fake evidence and real evidence.
Let’s work together to achieve this.




