The FAIRification Of Research In Real-World Evidence: A Practical Introduction To Reproducible Analytic Workflows Using Git And R

Dr. Alexander Schacht

How can you ensure your data and analytic workflows are reproducible and transparent?

What are the FAIR principles, and why are they crucial for real-world evidence research?

How did a pharmacist and epidemiologist become an expert in real-world data analytics?

In this episode, we explore the practicalities of creating reproducible analytic workflows using Git and R with our special guest, Janick Weberpals. As an instructor in medicine at Brigham and Women’s Hospital and Harvard Medical School, Janick shares his journey from pharmacist and epidemiologist to an expert in real-world data analytics and methodology.

He highlights the critical importance of reproducibility in statistical programming and explains how the FAIR principles—making data and code Findable, Accessible, Interoperable, and Reproducible—can transform research practices.

This episode is a must-listen for anyone involved in real-world evidence research, offering hands-on insights and step-by-step guidance to ensure your work is robust and transparent.

Tune in to learn how to harness the power of Git and R for your own projects, ensuring that your data and results are both reliable and reproducible.

Key points:

Reproducible Analytic Workflows
FAIR Principles
Challenges and opportunities in Real-World Evidence (RWE)
Git and R Integration
Janick Weberpals’ Journey
Best Practices in Coding
Collaboration and Accessibility
Implementation Steps

This episode offers invaluable insights into the importance of reproducible analytic workflows and the application of the FAIR principles in real-world evidence research. By integrating tools like Git and R, researchers can enhance transparency, collaboration, and the reliability of their findings.

If you believe your friends and colleagues can benefit from these practical tips and expert guidance, please share this episode with them. Let’s spread the knowledge and help advance the field of data science together!

Useful links:

Transform Your Career at The Effective Statistician Conference 2024!

Exceptional Speakers: Insights from leaders in statistics.
Networking: Connect with peers and experts.
Interactive Workshops: Hands-on learning experiences with Q&A.
Free Access: Selected presentations and networking.
All Access Pass: Comprehensive experience with recordings and workshops.

Secure your spot today! Register now!

Learn on demand

Click on the button to see our Teachble Inc. cources.

Load content

Featured courses

Click on the button to see our Teachble Inc. cources.

Load content

Janick Weberpals

Instructor in Medicine at Harvard Medical School

Janick Weberpals is a passionate Health Data Scientist, pharmacist, epidemiologist, and R enthusiast with over 9 years of experience in both industry and academia. His expertise lies in evaluating the real-world evidence of medical interventions, with a particular emphasis on utilizing machine and deep learning techniques. Janick’s work focuses on combining and leveraging multimodal, high-dimensional routine-care healthcare databases (“real-world data”) to enhance causal inference in oncology and other disease areas.

For more information, visit his website: https://janickweberpals.github.io

Transcript

The FAIRification Of Research In Real-World Evidence: A Practical Introduction To Reproducible Analytic Workflows Using Git And R

[00:00:00] Alexander: Welcome to another episode of the Effective Statistician. Today I’m super happy to actually have a listener [00:00:10] on the show. It’s always great to have people that benefit from the podcast to also provide things back to the [00:00:20] community. And yeah, with that, I would like you to introduce you to Janick Weberpals. So Janick, tell us a little bit about your career and [00:00:30] where you’re up to now.

[00:00:32] Janick: Hi, Alex. It’s great to be here. So my name is Janick. I’m an instructor in medicine at Brigham and Women’s Hospital and [00:00:40] Harvard Medical School. But training, I’m a pharmacist and epidemiologist with particular interest in real world data analytics and also methodology. So it’s probably unusual for you to [00:00:50] have someone on the podcast who’s a pharmacist.

[00:00:54] Janick: My interest actually in my interest in statistics and data science and statistical programming came [00:01:00] from really working in, in practice. Because at some point when, when I was at the end of my pharmacy training, I was really wondering like, how can I quantify treatment effects? [00:01:10] Besides like the pharmacology and biological basics I learned and how can I determine if like drug A or drug B is the better and safer choice for the patient that is in front of me.[00:01:20]

[00:01:20] Janick: And so I, at some point I started teaching myself, like our programming and also did research stay in the U S and this is how I. Got into touch with [00:01:30] the field of pharmacoepidemiology, which is basically the academic discipline behind real world evidence and real world data. And so I decided to do a PhD at the [00:01:40] German Cancer Research Center and then also did a postdoctoral fellowship in data science and early drug development at Roche and then also worked for some time in industry.

[00:01:49] Janick: [00:01:50] for having me. And then in 2022 decided to come back to academia and here I’m now focused on working on real world evidence questions and methodological [00:02:00] questions in the area of oncology.

[00:02:03] Alexander: Yeah. And we already had a colleague from you on the show a couple of months [00:02:10] earlier, no, maybe already a year earlier.

[00:02:12] Alexander: Shirley Shirley Wong, who talked about RCT Duplicate, and that project is about how you can [00:02:20] potentially recreate data from clinical trials and replicate them in real world evidence data. And you’re now doing something, also something similar in [00:02:30] the oncology space, which is very, very interesting.

[00:02:34] Alexander: All these different therapy. or therapeutic indications definitely [00:02:40] have big differences in terms of how they replicate in campus databases. So I’ve seen that already from [00:02:50] the other RCT duplicate papers. So that is really, really interesting work to help us better understand What is possible and which indications [00:03:00] and where are the strengths and limitations.

[00:03:02] Alexander: So please continue to do these kind of things. It’s absolutely fascinating.

[00:03:07] Janick: Yeah, yeah, absolutely. I, I, I fully agree. [00:03:10] And as I said, like in these different therapeutic indications and areas, really different problems come or it’s like you have all, all sorts of different problems [00:03:20] because in the, in the first, like RCT duplicate, it was more about like cardiovascular and diabetes which you can study pretty well in, in claims data.

[00:03:29] Janick: [00:03:30] But then when you move to oncology, you need really like access to special information like biomarkers and electronic health records. And then you have all the problems like missing data. And so it’s another another type, another [00:03:40] type of, of problem set that you have here.

[00:03:43] Alexander: Yep, definitely. So I’ll look forward to maybe that we speak about that at a [00:03:50] Different in a future episode, because today we want to talk about much more kind of fundamental problems that is all around [00:04:00] programming and transparency, reproducibility I had already a couple of different Guests on the show, and today I want to [00:04:10] talk with Janick because he actually provides some really kind of hands on experience.

[00:04:17] Alexander: So when it comes to [00:04:20] reproducibility, why is that actually important?

[00:04:28] Janick: Yeah, I mean, that’s a very big [00:04:30] question that you’re asking, but like so Shirley Wang and I, we published together a paper that introduces a specific [00:04:40] software that’s called Git that is like really like the Swiss knife tool to use for reproducibility when you do like hands on coding and implementation [00:04:50] of statistical programming.

[00:04:52] Janick: And we had different, like a different, or several motivations for that paper. So first of all, there’s guidelines for study pre [00:05:00] specification like templates of protocols to write which is very important. And then there are also reporting guidelines like strobe and record and so on. But there’s not [00:05:10] really something about the implementation back best practices when you implement a study, because there are so many things that can go wrong and the [00:05:20] That can go wrong.

[00:05:21] Janick: And there’s really a need to have a better transparency and also reproducibility for code. So in the most ideal setting, you either [00:05:30] like provide for example, like a software package, like an R package that comes as packaged. Tool to reproduce certain analyses all over again. Or if [00:05:40] you have, for example, a if you did analytics on a specific study the most ideal way is that you can, you could basically just use the same code and [00:05:50] redo the analysis.

[00:05:50] Janick: The analysis all over again without changing anything. And that’s really the, the, the problem set. And I think we’re still, especially in, in, in pharmacopoeia and verbal [00:06:00] evidence, we’re far away from that. So I was at a conference last year. And in terms of like, for example, code sharing a group did a systematic [00:06:10] review.

[00:06:10] Janick: On code sharing and it was, they found that just 5 percent of all real world evidence studies actually share code. So we really don’t know what’s going on behind the [00:06:20] scenes because you can write a lot of things in the, in the protocol and the protocol is very important, but what it really comes down to is how you program it because you can write it one way in the [00:06:30] protocol. But If you implement it really the same way, that’s the big question that we’re all after.

[00:06:36] Alexander: Yeah, and especially with real world evidence, yeah, real [00:06:40] world data is so messy. There’s so many kind of different aspects you can do it. And if you then want to replicate your analysis, [00:06:50] yeah, for example, especially with real world evidence, you have New data, you have more data, you have updated data.

[00:06:58] Alexander: Yeah. Then you [00:07:00] want to do it again. And yeah, maybe, you know, you have left the institution. Yeah. [00:07:10] And now someone else needs to pick it that up. Yeah. Or there is another institution that works on the same. Data set, you [00:07:20] know, and wants to kind of do maybe some modifications to it. And in order to modify something, you need to start with your baseline.

[00:07:28] Alexander: Yeah. So you first start to [00:07:30] replicate the same things. And if you have ever tried to do that on a clinical trial, such as already hard, On real world evidence, it’s [00:07:40] way more complicated. Yeah, because very often you have much more messy data and not only more messy, [00:07:50] but just more data. So that also makes it much more complex.

[00:07:55] Alexander: So in your paper that you wrote with Shirley, you [00:08:00] call it a verification of research in real world evidence. So, and there is the capital FAIR [00:08:10] in it. So can you quickly speak about what FAIR stands for?

[00:08:16] Janick: Yeah, definitely. So FAIR is actually comes from data [00:08:20] so that data is findable accessible interoperable and reproducible. And we thought that we extend that not [00:08:30] only to data, but also to the code that you use to produce a certain statistical output. Because all the code should be findable, accessible, interoperable, and [00:08:40] like reproducible. So that the, the intention of the study that you wrote in the protocol is like really reproducible.

[00:08:46] Janick: And That, that does not like just that [00:08:50] actually starts with also how you, for example, label and name the, the programs where you coded up your analysis, because for example, as a, as a PhD student, [00:09:00] I often found that things were labels like main analysis, one main analysis, two main analysis final. And then the question is like, really, which is the [00:09:10] actual script that produced the output?

[00:09:11] Janick: I don’t know. And this is just where it starts. So this is not even going into, into the topic of. Git and, and like having, having a track record [00:09:20] of, or an audit trail of, of changes that you did, but just like, which is the actual code program that, that produces the output that I want.

[00:09:29] Alexander: Yeah, [00:09:30] so findable, this kind of first thing that on, you know, most of the statisticians and pharma will say, [00:09:40] well, that’s easy that, you know, you have you have the tables and the footnotes and the footnotes point to the source, yeah, the [00:09:50] path and so on.

[00:09:52] Alexander: The problem, of course, is this path is not. readily [00:10:00] available to lots of people. And of course, these tables will then be used further, [00:10:10] for example, in publications and slides and all kinds of different things. And then people will use these slides [00:10:20] to create other slides. And from these other slides, they will create a Further stuff like promotional material, and from the [00:10:30] promotional material, another publication is done, or whatsoever, yeah?

[00:10:34] Alexander: So, very often, kind of, these things travel through multiple [00:10:40] layers. And, later on, even if you don’t have a footnote, yeah, that tells you, well, that [00:10:50] is in folder XYZ, Yeah. You still want to find this stuff. Yeah. So just [00:11:00] relying on, Oh yeah, we have these tables with the footnotes. is not enough. Also if you work with CROs [00:11:10] and you know, these kind of areas, it becomes even harder. Yeah. So making sure that your code is really [00:11:20] findable. Makes a huge difference. Then also, your results are findable. Yeah, because, well, with the code [00:11:30] comes usually also the results. And that’s, that’s yet another thing.

[00:11:35] Alexander: So that is the F. Can you repeat what the [00:11:40] A stands for?

[00:11:42] Janick: Yeah. So that’s accessible. So that means that the code is not like siloed maybe on like [00:11:50] personal drive or, or somewhere where no one else has access to it. But that you, for example, have all of the things that are important to reproduce your [00:12:00] study are compiled into one, what we often call like repository.

[00:12:04] Janick: And that repository is then can be, for example, found in a [00:12:10] remote for example, in the cloud. So there are often cloud remote repositories where you can connect your code to and synchronize it with often this is like [00:12:20] GitHub, GitLab, Bitbucket. So these are like the three big players that most of the people use.

[00:12:25] Janick: And that’s The, the, the cloud instances where you can synchronize your code [00:12:30] with, so that everyone really has access to this one central repository where all of the important things are stored.

[00:12:37] Alexander: So that means, for example, in the case of R, [00:12:40] also any macros, packages, and please forgive me if I’m not using the right names because I’m not an R programmer all of these kind [00:12:50] of Other things that help you to run the code are also all in there.

[00:12:56] Janick: Yeah, exactly. So so, so when, for example, I [00:13:00] implement the study, something I also have there is, for example, the study protocol. I have all of my scripts that I use to clean the data, that I use to do the descriptives, the [00:13:10] main analyses, the sensitivity analyses I often then have all of the table outputs.

[00:13:15] Janick: And for example, the figures stored there I have general [00:13:20] information about the project, like what’s the overall scope, what’s the IRB information the instructions on how to install the software, for example, our packages and the [00:13:30] corresponding versions that I use to do that. To, to run the analyses and like everything that is required so that someone else could just very [00:13:40] easily grab that code and reproduce my analyses.

[00:13:44] Alexander: So that is really cool that you have all the Metadata’s there, yeah, the [00:13:50] data about the data so that it actually makes sense and you can understand it. Awesome. Let’s go to the I.

[00:13:58] Janick: Right. [00:14:00] Interoperable. So That is something that is maybe a bit harder to, to do in the in terms of the data analysis but something [00:14:10] how, how we defined it in, in the paper was that the, the code is written in a way that is like easy easily understandable and that someone else [00:14:20] can really do the analysis without adjusting much, much of the code or any of the code, for example, something that is often like a first, you hurdle is that [00:14:30] one defines like absolute paths.

[00:14:32] Janick: For example, if you want to grab some data, for example, from an SQL server or from your hard drive or wherever the data is stored, [00:14:40] and then you have like the, the, the path to the data that is like specific to your computer. And that should not be the case. So we In the paper we then [00:14:50] describe that this path should be like a relative path so that it can adapt to the user’s environment and it’s not like that the code doesn’t break at this very [00:15:00] point.

[00:15:01] Alexander: Yep, that is awesome. I think it also means that, for example, the the results [00:15:10] are stored in such a base that you can easily work further with them. So for me, that, for example, means that. Your [00:15:20] summary statistics are not just stored in a PDF, but also in a data set. So is that if you want to do this and [00:15:30] use these for, I don’t know, comparing it to your next and updated analysis, you don’t need to kind of scrape it or, you know, manually type it in again [00:15:40] or something like that.

[00:15:41] Alexander: Exactly. So It’s a, it’s a real bad habit in our industries that we store summary statistics [00:15:50] in PDFs or RTFs or any, anything like that. And not in well documented data sets that are much easier to use. [00:16:00] Okay. Let’s close with the R of FAIR.

[00:16:04] Janick: Yeah. Well, the R is basically everything that this is about, which is the reproducibility.

[00:16:09] Janick: So that which [00:16:10] brings together like everything so that you can go to the remote repository. You can, you can clone the repository onto your local machine. You can run the code and then [00:16:20] you hopefully get the like really exciting same numerical results as, as someone else. That is pretty cool.

[00:16:26] Alexander: Yeah. And I had Heidi [00:16:30] Seibold owns the show already who talked about a couple of different things around reproducibility. And so you can also scroll a little bit [00:16:40] backwards and that now what you did is you. And your paper provided a step by step guidance through [00:16:50] I think something like 10 steps so that you can do Git and R on a real world study [00:17:00] effectively.

[00:17:00] Alexander: So it starts with installing Git, yeah? Which is probably a pretty much [00:17:10] straightforward thing. And then it’s goes through a couple of different further steps. What do you think are the steps where, which [00:17:20] makes the biggest difference? If you kind of not miss on these.

[00:17:27] Janick: Yeah, yeah, that that’s a pretty good question. [00:17:30] So in the paper, we really just covered the really, really basics to create a baseline reproducible workflow, because Git is such a [00:17:40] powerful software where so I work with Git on a daily basis, and I would love Not dare to say I know [00:17:50] everything about Git. I think you can go really, really deep into, into Git and what you can actually do with it.

[00:17:55] Janick: Because it was set up, first of all, as a, as a software to, to [00:18:00] manage the development of Linux. So the operating system Linux. That was the, the birth of, of Git. What it was. Initially developed for and then it [00:18:10] expanded also to, to other areas. And I think before going into the different steps, it’s also good to, to know, and what, what get actually is and [00:18:20] get us distribute version control system, which is basically a time machine.

[00:18:25] Janick: If you want to the codes that you’ve developed and the changes that you [00:18:30] did to it. So at every single. time point where you would do snapshots and snapshots or the frequency of snapshots that you can do. We call this and get to like [00:18:40] commits. This is something that is up to you. And there are different philosophies or different approaches to it.

[00:18:46] Janick: I usually handle it that I commit often and [00:18:50] early so that I have a very granular audit trail of the changes that I did to to my coach. And if there’s like a very [00:19:00] fundamental, like version that I want to, for example, snapshot, you can do additional things like create branches or add attack to it.

[00:19:08] Janick: So you exactly [00:19:10] know that this is the version that I use, for example, for, to reproduce the results that I submitted to the, to as an abstract to the conference, or for example, that I use for [00:19:20] the first version of the manuscript that I sent to To a journal and also an industry, you could say this is then the snapshot or to attack it that I used, that [00:19:30] I used as the version that I sent to FDA or EMA, things like that.

[00:19:34] Janick: And so this is the basic workflow where you do changes to your files, you stage them. [00:19:40] So staging means you determine which changes should be included in this snapshot or commit. And then in the commit, you create this like local [00:19:50] snapshot of the changes made with Like for example a short informative message so you know what what was changed and the beautiful thing about git is then afterwards you can then [00:20:00] for example on github or gitlab can really see the additions and the deletions that you did to the code and everything that correspondingly [00:20:10] changed and then You can push this, for example, then to a remote repository, be it GitHub, GitLab, Bitbucket.[00:20:20]

[00:20:20] Janick: And then if you, for example, work collaboratively with someone on the code, you also want to make sure that you often pull or fetch [00:20:30] The changes that maybe someone else did to the code. So everything’s in synchronization. So you can say that, for example, Git is like to track [00:20:40] changes. Part of, of what you often have in Word, just that you have it longitudinally, so all the changes, not like one time changes and then [00:20:50] GitHub GitLab is basically like, like a Dropbox or, or Google drive where the track changes are stored onto.

[00:20:57] Alexander: Yeah, yeah, and you can [00:21:00] work with. A couple of different people kind of at the same time and the same areas like you know, okay. These can do also with with [00:21:10] teams and things like that. So that’s pretty awesome. Yeah. And pretty fundamental to coding because you can. See every [00:21:20] step. You can also see who did it, isn’t it?

[00:21:23] Janick: Yeah, exactly. Who and when the changes were done.

[00:21:27] Alexander: Yeah. So that is how it [00:21:30] works with, with, with Git. And I think this makes a huge, huge benefit. By the way, in the paper, there are a couple of really, [00:21:40] really nice graphics, screenshots, and Ways to show how’s that actually then look likes in, in real.

[00:21:48] Alexander: So if [00:21:50] you don’t have any experience in that regard, it’s this very, very detailed step by step guidance. So this is pretty [00:22:00] awesome. Now learning about this is. Of course, you can read the paper. What else could you do to get better at these kind of [00:22:10] things? Yeah.

[00:22:10] Janick: And I think that that’s also a very good question because I think at universities or where you usually get your, your education, they are not like [00:22:20] really courses dedicated to, to get or version control.

[00:22:23] Janick: So how I learned it was more like by myself. So I want more or less self taught with [00:22:30] different, different yeah things that you can find online and also while I was on the job and there are a couple of yeah, courses or papers [00:22:40] or tutorials that you can find all over the internet and we compiled actually non exhaustive, but still a comprehensive list of, for example, Coursera [00:22:50] courses or edX courses or other thing or other courses that do explain Git, like, from scratch to a very advanced level [00:23:00] to, you know, And you can do that at your, at your own pace.

[00:23:03] Janick: And you can also find a lot of like tutorials online. Like whatever works best for you, if you’re more like the visual type [00:23:10] or the, the reading type. I think there’s something that you can find online that suit that, that works for everyone. And in the supplementary part of our paper, we actually [00:23:20] compiled a comprehensive list of courses that we would recommend.

[00:23:24] Alexander: That is awesome. And this supplementary part is hopefully available [00:23:30] for free or is that behind a paywall?

[00:23:33] Janick: Ah, that’s a good question. I, I would need to check myself, but we have also made the repository that we actually used [00:23:40] to write the manuscript. This is this is openly accessible and you will also find the supplementary material in there.

[00:23:46] Alexander: Okay, very good. Then we will definitely put a link [00:23:50] to this in the show notes so that you can easily find it. Thanks so much, Janick to talk about reproducibility, [00:24:00] about the specific problems of creating something that is fair and what actually fair means. And I highly recommend that you [00:24:10] go and check out this paper that goes through all these different steps, very, very detailed. And as Janick mentioned, you can [00:24:20] very, very easily access that via the link in the show notes. So, Janick, that is awesome. What is [00:24:30] the, what is the biggest benefit that you have seen from people using these kind of reproducible things?

[00:24:39] Janick: So, [00:24:40] myself, so when I think about my development in statistical programming, a benefit that I really had was looking at other people’s code, [00:24:50] especially if you work in the statistical or data science area, and You read about a new methodology that you would really like to implement in your own study.

[00:24:59] Janick: The [00:25:00] question is just like, how do you do it? And the worst way to do is or the you can of course like just do it depending on the theory that you read in the paper But then [00:25:10] you never know if you really implement it correctly. And so if people provide code to, to implement the, the new methodology then you can just basically grab the code and [00:25:20] implement it in your, in your own study.

[00:25:21] Janick: And that is the, the largest benefit at alpha. I found for myself that if someone publishes code along with the methodology, it’s [00:25:30] much easier to, to apply that methodology to another research question or topic. So that would be one part of the answer. And the other thing that [00:25:40] I find quite beneficial is that sometimes you may also like like mess up your own code.

[00:25:47] Janick: And then while you develop the code for your [00:25:50] statistical analysis and then you are, Pretty safe if you have everything in, in an audit trail using Git. So you can always go back to the previous version and [00:26:00] see what’s changed. And that you can basically then reconcile everything again.

[00:26:04] Janick: Also something, if you, for example, made it made a change after, for example, the root. View comes back [00:26:10] from a manuscript or from a regulatory authority and you implement the changes. You can very easily see what how the changes in your code then reflect the changes in the [00:26:20] results. And also make sure that every time you rerun your code, you get the same results.

[00:26:27] Alexander: These are awesome benefits just for [00:26:30] yourself. And I’m not talking about the benefits for all the others that will need to work with your code. So thanks so much, Janick. All the best.

[00:26:39] Janick: [00:26:40] Thank you so much, Alex. It was a pleasure to be here.

Join The Effective Statistician LinkedIn group

This group was set up to help each other to become more effective statisticians. We’ll run challenges in this group, e.g. around writing abstracts for conferences or other projects. I’ll also post into this group further content.

Join Group

I want to help the community of statisticians, data scientists, programmers and other quantitative scientists to be more influential, innovative, and effective. I believe that as a community we can help our research, our regulatory and payer systems, and ultimately physicians and patients take better decisions based on better evidence.

I work to achieve a future in which everyone can access the right evidence in the right format at the right time to make sound decisions.

When my kids are sick, I want to have good evidence to discuss with the physician about the different therapy choices.

When my mother is sick, I want her to understand the evidence and being able to understand it.

When I get sick, I want to find evidence that I can trust and that helps me to have meaningful discussions with my healthcare professionals.

I want to live in a world, where the media reports correctly about medical evidence and in which society distinguishes between fake evidence and real evidence.

Let’s work together to achieve this.