Moving from SAS to R - Practical Tips Part 3

Welcome to the third and last part of Moving from SAS to R (here is Part 2) where Thomas Neitmann joins us again to talk about transitioning from SAS to R.

We also discuss the new SAS to R course of The Effective Statistician and you’ll learn, if this is the right course for you.

We provide a couple of learnings from the course for you to get an impression of what we cover in the course.

Click here to get to the course overview!

We also discuss the following points:

R is an open source statistical programming language favored for its complex collaborations between pharmaceutical companies.
It is possible to transition from SAS to R as long as the user is familiar with mathematical concepts and thinking required to work with data.
Tools like Quarto, R-markdown, and tools around reproducible research can help make the transition smoother and make users more efficient in programming.
A SAS to R course tailored for healthcare and pharmaceutical professionals may be a good resource instead of generic courses.

Interested to learn more? Check out the links and course:

Share this link to your friends and colleagues who can benefit from this episode!

Subscribe to our Newsletter!

Do you want to boost your career as a statistician in the health sector? Our podcast helps you to achieve this by teaching you relevant knowledge about all the different aspects of becoming a more effective statistician.

Thomas Neitmann

He is an R enthusiast currently working for Swiss pharmaceutical company Roche as a Statistical Programmer Analyst for late-phase clinical trials in neuroscience indications.

His R journey began in 2014 when a coworker told him to run an R script to “analyze some data”. Having never programmed before at that time, he was overwhelmed. But he took on the challenge and soon realized the power and joy of programming.

Since then, he learned a couple of other programming languages including Matlab, Python, and SAS. But his favorite is still by far R.

He enjoys sharing his knowledge and started doing so publicly on LinkedIn in late 2019. Since then he went from around 300 to 7000+ followers. Many of those encouraged him to create a blog to have a central place for all his posts. So that’s what he did.

As the name suggests this blog focuses predominantly on R. He will occasionally cover other, related topics such as git, though. He also enjoys data visualization a lot so you’ll likely find some posts on that, too.

If you have a specific topic you would like him to write about, please feel free to reach out. The best option to do so is via his LinkedIn. If you are not yet connected with him, make sure to send him a request!

Transcript

Moving SAS to R – Part 3

Alexander (Host) | 00:00:01 to 00:00:16

Welcome to another episode of The Effective Statistician. And this is episode number three with Thomas about moving from SAS to R. Hi Thomas, great to have you again.

Thomas (Guest) | 00:00:16 to 00:00:17

Pleasure to be back.

Alexander (Host) | 00:00:17 to 00:00:56

Awesome. So if this is the first episode that you dial into, then I would really encourage you to go a little bit back and have a listen to the other episodes as well. In the episode today, we’ll talk more about statistical modeling and all these kind of different things. So, as you mentioned in the last episode, R was really invented by statisticians. So there’s a lot of emphasis on statistics, which I think is a little bit of a difference, for example, to Python, isn’t it?

Thomas (Guest) | 00:00:56 to 00:01:38

Absolutely. I mean, Python is now very popular also for data science, but that is very much an add on and not baked into the language. It has been perceived as a general purpose scripting language and then some people liked it so much that they kind of put these statistics on top of it. I would argue, though, if you are a serious statistician, the number one language use is still R because it is perceived by statisticians and everything is, in terms of statistics, actually baked into the core of the language. So if you start R, you already have access to so many statistics functions, which in Python you still have to install some libraries, load them and whatnot to even get to doing a simple correlation coefficient.

Alexander (Host) | 00:01:38 to 00:02:02

That is really good. Yeah. And so the most basic thing, so to say is doing linear regressions. How do you do that in R?

Thomas (Guest) | 00:02:02 to 00:02:28

Yeah, so in R there is a function called LM, which is short for linear model. And the first thing that you need to input when you use that function is what is called a F formula. So F formula that describes your model. So let’s say you have a continuous variable, Y, and you want to model that as a function of another continuous variable, x. So the way you would write that is Y. Then you use a tilde, which is an interesting character on your keyboard, which is hard to find if you’ve never used it. But it’s sort of like a little waveform.

Thomas (Guest) | 00:02:28 to 00:03:00

So you will find it if you look hard enough. And then your x and you read that as Y as a function of X. And then if you, for example, want to have an interaction of X with another predictor, you would use a star for that. So in a way, it’s what is called a DSL gosh, now I forgot the acronym, a domain specific language to describe these kind of statistical models in a very precise way. But yeah, the simplest example calls the LM function.

The first thing you put in is Y Tilde X. And then you also need to put in some data where you can find both Y and X to fit the model.

Alexander (Host) | 00:03:00 to 00:03:35

Awesome. That sounds very simple. If I now want to fit something like categorical variable into this as well because I have a group with, let’s say group treatment A and treatment B, how do I do that?

Thomas (Guest) | 00:03:36 to 00:04:05

I think from the top of my head, it actually doesn’t change the formula. So you would have still Y as your sort of dependent variable. Yes, and then that tilde and then you just have, for example, C as the categorical variable. If you then look into the model results, you will see that it will be displayed differently because it’s not a continuous variable, but a categorical one. And you have the coefficients for the different levels of that categorical variable.

Alexander (Host) | 00:04:05 to 00:04:20

So you basically directly get a T test very easy from Z. Very cool.

Thomas (Guest) | 00:04:05 to 00:04:20

You can however also use the t test function if that is really what you’re aiming for. But yeah, I mean, a linear model is very general.

Alexander (Host) | 00:04:21 to 00:04:58

Yeah. That builds into all the different ANOVA things that you can do with it. Now, there is one area that we do quite a lot and that is also very specific to the pharmaceutical industry. There’s these mixed models for repeated measures or in short, MMRM. And there’s lots of development within SAS that has gone over the years. How does that look like in R?

Thomas (Guest) | 00:04:59 to 00:05:18

Yeah, that’s actually a very interesting point. So I think even one or two years ago, if people had asked you how do you do an MMRM in R, there would be somewhat of A. There is this package which does some aspects of it. There’s this other package which does another. So it was a bit all over the place.

Thomas (Guest) | 00:05:18 to 00:05:56

There was no one comprehensive cohesive package for that, which is very different to SAS, where you have Procmixed and it does it all for you. And that is kind of the gold standard. But as you mentioned, this is a very popular statistical model within a pharma. So being open source in nature allows for a lot of collaboration. And this is a great example of one where different pharmaceutical companies came together and said well, maybe we should get all these little pieces which are in different packages and put them together into one comprehensive package for mixed models which is now this package called Mmrn which, for example, Roche is part of.

Thomas (Guest) | 00:05:56 to 00:06:42

But many other pharmaceutical companies. And I think as of right now, it is not as wide in scope as actually Procmixed is in SAS. But it is already very comprehensive and I think from my understanding is that it covers probably 90 or 95% of all use cases and it is also designed in a manner that it’s easily extensible. So if there’s some kind of covariance structure missing that you need for your model, it would be very easy to add that onto the package because again, it’s open source, it’s on GitHub. And as long as you follow the kind of package guidelines which the authors are very happy to help you with, then you can add on top.

Alexander (Host) | 00:06:43 to 00:07:41

Yeah, that is actually a really interesting thing. There’s so much new development in terms of statistical approaches and here with R you can react to these pretty fast. Lots of the new publications nowadays come directly with an R package and there’s more and more of these cross company collaborations happening that establish these packages. So I for example know that there’s areas that look into match adjusted indirect comparisons or network meta-analysis or all these kind of Bayesian things or multiple testing procedures. All these different areas get adopted really fast.

Alexander (Host) | 00:07:41 to 00:08:34

Whereas my perception thus has actually moved its focus far away from pharma and I think you can wait for a very long time that something happens there and you don’t have really an influence in it. Whereas here you can actually participate, drive it forward yourself. And lots and lots of the big pharma companies are joining this effort and creating things together with this community approach which I really love. And MMRM is a great example for that. Another analysis technique that of course we need all the time are survival analysis.

And this is not just an oncology kind of all kind of different area. You have these time to event analysis. I don’t know, my experience with ZAZ is actually not the best in that regard. What can I do in R?

Thomas (Guest) | 00:08:55 to 00:09:37

Yes, so when you freshly install R, there is actually already a package that ships with it which is created by one of the R language maintainers called the survival package. So if you want to do your proportional hazard kind of analyses, do a log rank test, these kinds of things, that is actually very easy using that package. And I would say this is probably the single most battle tested survival package out there. I think if you look in the number of citations in academic journals which cite that they use this particular package, it is an astonishing number. So I would feel very comfortable that this is a great package and gives you exactly the results you need.

Alexander (Host) | 00:09:37 to 00:10:15

You mentioned one really interesting area, the mentioning of packages in publications. How does that actually work? So I hardly see any kind of mentioning of proc whatsoever in any kind of publications. But with R it seems to be much more frequent, isn’t it?

Thomas (Guest) | 00:10:16 to 00:10:59

Yes, but I think also that is something that took a while until it got established, but I think definitely nowadays they are mentioned a lot. I think what you typically had in the past is you have your method section and you have the subsection statistics. And then the first sentence is something along the lines of we use software X, version Y to do statistical analyses. And sometimes that may have been SAS, other times it was R, and oftentimes that’s where people then stopped. And to be honest, if you use SAS that probably makes sense because well, if you know the version of SAS, you know everything because everything is a monolith that comes with it, with open source languages that actually tells you very little, to be honest, because you need much more information about it.

Thomas (Guest) | 00:10:59 to 00:12:08

So what people nowadays typically use is then they go on after they stated we used our version 4.2, for example, and then they state we used package survival, for example, in version, I don’t know, 3.2 or something like that, which is much more precise in terms of what tools you actually used. And also I think this is very important, giving the open source nature of the R language. It gives credit, credit to the author who put in the investment to actually create this package in the first place. And they never have a commercial intent with that because you don’t charge for our packages, they’re just available, open and free to use for everyone. So I think it’s really good practice to acknowledge the work that these authors have done because yeah, they may not have invented that particular method, but implementing them is, I can tell you, hard enough and it’s not actually easily to translate the kind of formulas you typically see when a new method is developed into code.

Thomas (Guest) | 00:12:08 to 00:12:42

Because if you just naively implement it like that, it is numerically only approximate, to say the least, and also many times very slow. So implementing statistical software is actually very hard. It’s kind of a job in and of itself, which, just as a side note, is also a reason why the American Statistical Association actually has a working group on statistical engineering. So not talking about how to develop new methods or apply them, but actually how to put them into code, how to engineer that.

Alexander (Host) | 00:12:43 to 00:13:02

Yeah. There’s also the Aims special interest group AIMS that looks into this on the European side within PSI and EPSPY. And you can find that on the PSI homepage as well.

Alexander (Host) | 00:13:08 to 00:13:46

Now, one topic is here things can be much more transparent and easy and I would say reproducible. Now, when I’ve done all these kind of analysis using R, is there an easy way to kind of know all the different packages that are used and kind of in which environment on these kind of things? How do I get all of that? Do I kind of need to screen through all the code and kind of search for all of that or what do I do?

Thomas (Guest) | 00:13:48 to 00:14:24

So, if you on your local machine, you start a particular analysis, you follow best practices and create a new project and then you install a couple of packages that you need. Then kind of the question is how do I get someone else to reproduce that? Which is an extremely important question. So there is a really nice R package again, which can help you with that, which is called RN, which is short for R Environment, which takes a snapshot of given the code in your project. First of all, it determines what are all the packages that you actually use, because you may have installed a lot more on your computer.

Thomas (Guest) | 00:14:25 to 00:14:45

Then it looks into what are the actual versions that are installed, and then it looks into where did you actually install that from, because that also can make a difference. And then it creates basically a JSON file so it’s plain text which lists out all of that. And then I can give you that file. You install also this RN package. And then there’s a single line you can run.

Thomas (Guest) | 00:14:45 to 00:15:20

It’s called Restore. And what it does is for each entry in this JSON file, it actually installs these packages such that only given this single file and provided that you do have R installed, and ideally the same version as I did, but this is actually also captured in that file, it then reproduces the complete environment. So this is already bringing you probably 95 or possibly even 99% of the way towards reproducibility. There are still some little differences. If I work on a Mac and you work on a Windows machine, unfortunately, these kind of things do matter as well.

Thomas (Guest) | 00:15:20 to 00:15:52

So if you want to get that additional notch, you have to look into something like Docker, for example. But yeah, I would say this is a bit more on the advanced kind of things. Just making sure that you can reproduce all the packages that you’ve used. That already brings you 95% plus of the way.

Alexander (Host) | 00:15:52 to 00:16:18

Yeah, and by the way, in terms of reproducibility, there’s another course that we are currently developing with reproducible expert Heidi Seibold on these kind of things. And we’ll go into Docker and all these things that you just mentioned as well.

Thomas (Guest) | 00:16:18 to 00:16:31

Yeah, and let me just say that I think if you are serious about doing any kind of data science activities, this is a subject you need to be an expert in. Because fundamentally, if you come up with a result which someone else then is not able to reproduce, I would say you failed your job. So really? Yeah, I would say take a look at that course. If Docker and stuff is covered, I think it’s very interesting.

Alexander (Host) | 00:16:31 to 00:16:44

Yeah. And this is really kind of where trust comes from. Absolutely. This is what regulators are looking for. This is what the general public is looking for. This is what academia wants. Well, academia can be actually much better in that regard. But that’s just a side note.

Alexander (Host) | 00:16:47 to 00:17:19

When we can show kind of what we have done in a very easy and transparent way, we’ll earn a lot more trust. And of course, it makes collaboration much easier. You can see they did it this way, not that way. I find it always really hard when I read kind of statistical sections of manuscripts to really understand what was done, especially when it comes to the details. It’s kind of a nightmare.

Alexander (Host) | 00:17:19 to 00:18:13

So to reproduce exactly what was done is nearly impossible. And here, with these kind of ways moving forward, I think we can do much, much better. And we covered a couple of that in the first episode already, where we talked about how you can actually set up your coding environment so that you have a report and all what you have done directly built into your work process. And there’s some really nice tools like Quarto and unmarked on, which can help you with that. So this was the last episode where we cover the SAS to R course.

And in this course we go over a couple of different things. We start with an introduction to R. We go into data manipulation. We look at functions and unit testing. If you don’t know what that is, listen to the previous episode, data visualizations or typical tables about demographics, AES, lab parameters and all these kind of things.

Alexander (Host) | 00:18:40 to 00:19:19

And of course, what we talked about today, all these statistical modeling techniques. And this course is really designed specifically for you. Takes into account your needs as a statistician, as a data scientist, as a programmer, in healthcare, in the pharmaceutical industry. And it takes into account that you already are a SAS programmer. So we don’t kind of start with complete basics and we have very practical focus.

Alexander (Host) | 00:19:19 to 00:20:07

So it’s not all theory, you’ll actually see what’s going on and we cover pretty much everything that you need. Well, not everything, of course, but I think if you have gone through this course, you will have a very, very comprehensive understanding. And as you probably learned from this and the previous two episodes, Thomas is an outstanding instructor. He has instructed lots of lots of your peers, and so I can highly recommend that. Okay, Thomas, any final advice you would give to someone that jumps into R the first time?

Thomas (Guest) | 00:20:09 to 00:20:54

Well, I think the most important advice, I think is if you are someone within our industry and this is something you’re interested in, do sign up for the course. That being said, outside of that, I think it’s very important that whenever you make a commitment to actually pick up R and learn it, make sure to apply it then to your day to day work, because otherwise you will learn it. And then a couple of weeks later you have forgotten it. And this is not the way how to make sustainable learnings. So if you know that right now you’re working on a particular data visualization, maybe skip the data manipulation part and jump straight into that and learn how you can create that, and then actually take what you learned, apply it to your project and produce that scatter plot or whatever it is.

That way you’re learning gains will be ten x, 100 x probably. It is the best thing you can do to actually apply what you learn to your day to day job as a statistician, as statistical programmer, as a data scientist.

Alexander (Host) | 00:21:20 to 00:21:44

Awesome advice. Also, it is basically embedded in your day to day working activities, which makes it much easier.

And also you really have a reason. It’s not kind of a theoretical in the evening or on the weekend learning. It’s actually something where you directly have vested interest. Awesome. Thanks so much, Thomas, for these three episodes and for all those who sign up in the course. You’ll learn a lot more from Thomas there.

Thomas (Guest) | 00:21:44 to 00:21:48

Thank you so much, Alexander.

Never miss an episode!

Join thousends of your peers and subscribe to get our latest updates by email!

Get the shownotes of our podcast episodes plus tips and tricks to increase your impact at work to boost your career!