This episode is the second part of Moving from SAS to R with Thomas. (Here is Part 1)
We also discuss the new SAS to R course of The Effective Statistician and you’ll learn, if this is the right course for you.
We provide a couple of learnings from the course for you to get an impression of what we cover in the course.
We also discuss the following points:
- Installing R and R Studio
- Data Exploration Techniques
- Data Manipulation Packages
- Macros vs. Functions
- Use of Unit Testing
He is an R enthusiast currently working for Swiss pharmaceutical company Roche as a Statistical Programmer Analyst for late-phase clinical trials in neuroscience indications.
His R journey began in 2014 when a coworker told him to run an R script to “analyze some data”. Having never programmed before at that time, he was overwhelmed. But he took on the challenge and soon realized the power and joy of programming.
Since then, he learned a couple of other programming languages including Matlab, Python, and SAS. But his favorite is still by far R.
He enjoys sharing his knowledge and started doing so publicly on LinkedIn in late 2019. Since then he went from around 300 to 7000+ followers. Many of those encouraged him to create a blog to have a central place for all his posts. So that’s what he did.
As the name suggests this blog focuses predominantly on R. He will occasionally cover other, related topics such as git, though. He also enjoys data visualization a lot so you’ll likely find some posts on that, too.
If you have a specific topic you would like him to write about, please feel free to reach out. The best option to do so is via his LinkedIn. If you are not yet connected with him, make sure to send him a request!
Moving R to SAS Part 2
[00:00:00] Alexander: Welcome to another episode of The Effective Statistician. This is the second episode with Thomas about Moving from SAS to R. Welcome again, Thomas.
[00:00:13] Thomas: Great to be back. Thanks for having me.
[00:00:15] Alexander: So in the last episode, we talked a little bit about kind of why actually diving into R what are kind of some core differences between R and SAS. How to get going with R? All these kind of different things. And if you have missed that, we’ll just skip a little bit back in your player and dive into this episode first.
We have a course, the SAS to R course that we have specifically designed for you as statisticians, programmers, data scientists in the pharmaceutical industry because there’s a lot of specific things. In terms of clinical trials that really apply to us and our processes, and it’s really valuable to learn directly about these specific things rather than looking into some more generic stuff.
So, when I started with SAS, one of the things that I really loved doing was setting up macros because I know. Whenever you know, something will happen, something will change and if you have copied and pasted your kind of code five times and you then need to change it five times, that is so tedious and error prone. And so I learned a lot about, you know, whenever you do something more than once, then probably write a macro for that. Now, in R these are called functions. What are the differences?
[00:01:59] Thomas: That is a very good question and I don’t want to get too technical at this point, but actually there is a fundamental difference because asec, what an assess macro does is it in a way it writes you code. So when the, the macro gets expanded, it just puts out a lot of text, basically.
Whereas in our function Is is different in the sense that whatever you feed in is sort of then in its own little bubble, the environment of the function. So it doesn’t actually affect, you know, your global state in a way. And the only thing that gets out of the function is the result at the end of the day.
Whereas in assess macro, if you create 10 intermediate datasets, you will see them at the end of executing a macro unless you explicitly have taken care of removing that, which you don’t have to do with a function. So you could create a hundred datasets inside a function. If you only return a single data set at the end, that is the only result you will get.
So I think this fundamental difference for many people is they need a bit of time to actually understand what that means, and we will learn about that in their scores. But again, a fundamental idea that you said that this is sort of about trying to avoid repetition, that very much is true for both cases.
So, You said if you do it more than once, put it in a macro. You can, you could do the say the same about, you know, if you do it more than once put it in a function. It is definitely all about, you know, making your more efficient and also making it Easier to change code because if you imagine that you have the same block of maybe 10 lines multiple times in the script, well, if you change a little bit in one of those blocks, you have to actually remember that there are other blocks and then make sure to actually update it so it’s very error prone.
Whereas if it’s a single function, it has this one definition. If you make the change there, it will automatically propagate to wherever you need it. Really functions or macros for that matter are the superpower of every programmer. I would say it’s something you really, really want to get deeply familiar with. And we will make sure that after the course, you will have the skills to write your own functions.
[00:03:56] Alexander: That brings me back to a story I had about 20 years ago when I was you know, first job in the industry and I was working with this very, very talented programmer on one team and on the other teams.
There were lots of programmers, but none of them were as kind of good as this expert programmer. And so we had this discussion about, oh, we need to change something. And yeah, and the program was kind of in the call as well. And he was, I was kind of saying, oh, we need to change this and this and this.
And he was kind of doing it, you know, as, as we were talking about it. And then there was a question at the end from the physician that was responsible for these studies. How long do you need to implement that? And we were kind of silent for a second because, what do you mean? Well, how long does it take to kind of change the analysis?
We already did this. And this was because, you know, it was nicely set up and you know, these hundreds of different tables were all kind of calling from the same macros. Yeah. So when we changed something and one place, it was changed in all places. Whereas the other team, Was working very, very differently.
Yeah. Every programmer had hiss own kind of setup and was working independently, more or less from the other programmers on their tables. Yeah. So they needed to change things and thousand places. And of course, you know, always something got lost. Yeah. And so where we changed it instantly for them. It took ages compared to our approach.
So yes, functions are really, really great. The other thing is, if you have ever tried to simulate something using macros, then you directly get into this problem that Thomas just described, because SAS will actually generate all these data sets. Yeah. And more than once I actually crushed something because some kind of tensile got too big or whatever.
Yeah. This, this thing so easily happens with simulations. Yeah. And so that was always a pain to kind of take care of that. And you need, really need to, yeah. Start for small with kind of 10 or 100 iterations just to check whether you have somewhere, you know, forgotten to, you know, delete some data sets.
And also, of course, it takes a lot of time and space. Yeah. So that’s one of the nice things about functions in R. There, I, my understanding is there’s basically two different functions. There’s these functions that are published and you can write your own functions. Now if you write your own functions, what are all the different things that you should take care of? What are kind of the different levels, so to say, of function and validation and these kind of things?
[00:07:22] Thomas: Yes, so. Maybe we can just start with an initial skeleton you would need to have, so any function or if you wanna reuse it later, it needs to have a name. Then there’s a special keyword you have to indicate that this is now a function and then you start listing kind of input parameters.
That you would like to use. So in the simplest case, you could have a function which takes no parameters, which basically means it always returns to same result. But that’s rarely what you need. Typically, you have some kind of input parameters you should strive for actually having few of those that makes it easier to reason about.
But we all know at the end of the day, it needs to grow as complex as it needs to, to simulate the real world in a way. And then inside the, what is called the body of the function, you take those parameters that they have been put in and do some kind of calculations on them to get whatever it is that you, that you need.
It can be very different things. Creating a new data set, creating a simulation as you had, creating a data visualization, writing out some files to this, whatever. And then last step is that you return a meaningful result. So, If that is a data visualization or a data frame, that is what you would return.
If you write out something to a file, it’s not actually so clear what you should return. But for example, you could return the file to the path, add a path to the file you just created. And then so that’s kind of structural. But then I think you made a point of like, how do you actually go about writing these functions?
So what people typically like to do is they just. Jump straight head first in, write some functions, then they go into the R console and kind of do an ad hoc testing. Does this produce the actual result I want to do and that works for sure? The problem is though, you don’t save these tests, so as soon as you make a change, you have to remember all the ad hoc tests you did and see if it still works.
So a better approach is actually to say, Hey, let’s formalize the requirements, so to speak, of the function given X as input. Why should be the output? So this is called unit testing. So you call the function with a given input and then you say, this is my expected output. And you compare those two.
Does the function give this input actually produce the expected output and you wrote, write them once and then you can reuse them as many times as you want? So whenever you make an update to a function you can actually test do my old test cases still work. And oftentimes as it happens, they don’t, because you change some if condition or whatnot but it immediately tells you that, hey, you messed up in a way.
So really unit testing is a great safety net for you when you’re developing. And also at the end of the day, especially in our industry, we better be sure that whatever we output is actually correct because the decisions we make are whether people get a drug which has the potential to have great benefit, but also harm in some cases if we talk about adverse events.
And imagine the case where you screw up an analysis because you didn’t test your function properly. I mean, you wanna avoid that at all costs possible. So, Which is actually why I think anyone working in this particular industry should be really a guru in unit testing and use it very frequently.
[00:10:24] Alexander: Awesome. That is actually something that I’ve never done in SAS before. I’m not sure whether there’s kind of SAS or macros for doing that.
[00:10:34] Thomas: So it is a good point because it is not baked into SAS, the language, there is nothing from SAS to do unit testing of CS macros. There is something that some people similar to an art package actually have written and you can get it on GitHub, I think.
And I’ve used it in the past. But I think that already tells you a lot about it, that it’s sort of, It’s a bit of a second class citizen in a way. To do that. So yeah, I think in, in our, to be honest, it’s also an add-on package that we use. But out of all these 20,000 packages that are out there, probably 95% of those usage. So it’s really heavily be tested. And I think anyone who’s somewhat familiar with R knows how to use that. So it’s a great tool.
[00:11:16] Alexander: Yeah, and we’ll go into this kind of unit tested in the course. So…
[00:11:23] Thomas: absolutely.
[00:11:24] Alexander: That will be a very, very fundamental part of the course from SAS to R. Another topic that is really, really great with all is data visualization, and I mentioned that in the previous episode already. Nearly all submissions to the wonderful Wednesday webinar where we do lots of data visualization on all kind of clinical trial, observational studies, and, you know, all these typical data sets that we work with come with R, and there’s a rich amount of packages examples, all kind of different things. Why do you think actually r is so great in that?
[00:12:12] Thomas: It’s a good question. So I think when R was initially conceived the authors who actually were statisticians so that already tells you a lot, I think. They were, they had, I think they had two big things in mind. So they wanted to make it easy to somehow wrangle data.
Then obviously to do statistics, so it’s actually feature things, but more, more importantly also to make it easy to visualize data because this is such a powerful thing to do. You know, displaying a couple of numbers from a statistical model to stakeholders. You can try it, but it’s oftentimes not meaningful.
Whereas if you have a data visualization to show them even if they don’t understand the statistical measure you use behind that, if you have two Kaplan meier curve, which are that far apart, they will immediately see and say, Hey, there is something there. This structure may actually work, which is great.
So yeah, it’s somewhat baked into the language. So if you install R there’s something called the graphics package. Which actually allows you to create, I would say, pretty decent plots very, very easily. But r has evolved over the times. And nowadays really what most people use is something called GGplot2, which actually builds on some theoretical work around data visualization called the grammar of graphics, which is. I have to say a very intellectually interesting approach ’cause you sort of cut the visualization into different layers that you see on the plot and whatnot. And it’s, it’s somewhat implements that in code in a way. And you can, the amount of customization you can do with that, I think it’s really second to none.
So you can obviously just stick to all the default and you will create a plot which looks, I would say very decent, but everyone would who knows, will be able to say, yes, this was great with ggplot2, but then you can create, you can customize it to agree that people would never even think that that’s the case. So it’s extremely powerful and for that reason, this is also the tool. We will be teaching in the course because if there’s one tool for data visualization I think people should know about it’s this one.
[00:14:07] Alexander: Yes, absolutely. And it is so for data exploration, It’s awesome. Yeah, you can have interactive data, you can have animated data, you can have great dashboards.
You can create, you know, great data visualizations for data explanation. So for, you know, the things that go into your manuscript, into your dossier, all kind of really, really cool things and all the different Database experts that I’ve worked with in the industry and say quite a lot, they all rely heavily on all.
And the wonderful Wednesday webinar series that I just mentioned to you has created library with all the cases. Yeah. So there’s over two dozens of webinars now, and for all these webinars, there’s a dataset, there’s data visualizations in there, and there’s a code in there. So even if you don’t know Exactly.
Okay, how do I get a bar chart? A line graph, by the way, we’ll cover that of course, in the course. Even if you have something completely fancy. Yeah. Exploding pie chart if you want to create one that I would highly recommend not to do, but there’s surely some code out there which you can really easily reuse. And mostly it’s based on ggplot2, as you mentioned.
Of course, will also go into some. Data, visuals that are specific to our area. For example, Kaplan-Meier plots. Yeah, so lots of courses out there teach you about, you know, data visualization and they’ll talk about line charts and part charts and these kind of things. But mostly they will not cover what you really need in clinical trials.
Yeah. And so that is something that we specifically go into, especially also showing uncertainty estimates and these kind of things. Most of the kind of generic data visualizations. Yeah. Just shows you the means and the proportions and all these kind of different things, but not really the statistics and the deviation and the confidence intervals and all these kind of things that we usually wanna see. So that is also something we’ll specifically cover. What is one advice you would give to someone programming data visualizations in R that’s absolutely vital that you would say?
[00:16:42] Thomas: Ooh, that’s a tough question. I think I would actually like to refer to something I’ve learned from you, which is maybe don’t jump directly into the code, but actually take a step back, maybe put out a pen and paper and sketch.
What is it that you want to display? Because oftentimes when you use a tool, you think in the constraints, so to speak, of the tool. And you don’t necessarily want to constrain yourself right from the beginning. So yeah, I think that’s actually a very smart thing to do. Think first about what is it that you want to display and only once you’ve made a decision to that see how you can actually implement it within code.
[00:17:20] Alexander: Awesome. Thanks so much. Okay. Data visualizations. Well, I’m a big fan of it. The bread and butter of all what we do are summary tables.
[00:17:33] Thomas: Unfortunately that’s the case.
[00:17:34] Alexander: Unfortunately still, and there’s, you know, typical things that we always need to create. Yeah. So what will we cover in the course there and are there any typical kind of help from us. That is really useful to them?
[00:17:53] Thomas: So we will definitely look at specific examples of kinds of tables you would probably find in any clinical study report. So your ae table of system, organ class, preferred term and a kind of unique frequencies within that. You know, your table one, your demographic table will certainly also be our table one.
At least that’s what I have in mind right now. Something like change from baseline by visit, which is something you typically do for lab measures, for example. Another thing could be shift tables over abnormal ties over time. So, yeah, I mean a lot of, you know, as you said, the bread and butter to what we typically do, but in a way, what I’m aiming to do is of course, show you how to do these specific tables, but actually equip you with the knowledge and tools to be able to not only adapt them to whatever needs you have, but also create other kinds of new tables.
So you should not only learn how to create table X, Y, and Z, but actually the process behind, why do we have to create it in that way so that you can also create table A, B, C, and whatnot.
[00:18:57] Alexander: Yeah, and I think one of the nice things is you can very easily then work with all the results that are in these tables further One of the things that I hate about SAS, or you can also do it with SAS of course, is that, you know, very often this, you know, it’s just put into a PDF or an R T F.
Yeah. And then it’s so difficult to further work with it. Yeah. If you just think of an output from a clinical trial, That will always be further processed. Yeah. It’ll used not just in the clinical trial report, it will be used in the abstract, in the manuscript, in another dossier and, you know, maybe used for a secondary manuscript and all these kind of different things.
Yeah. And. I think it’s really, really good practice to make sure that you store all these results in an effective way so that you can also leverage what you learned in the previous module with the data visualization. So I think both go really, really hand in hand.
[00:20:11] Thomas: Yeah, absolutely. And this what you just mentioned, kind of if you just have a P D F or an R T F, and then you need to pull out some number programmatically. There are people who can do that, but no one actually should be able to do that because it is a horrible approach. And in R, for example, if you, yeah, whether it’s a table about, actually ggplot is even a nicer example. Just save that object before it sort of gets rendered to something that you can see such that.
If you, for example, let’s say you want to change the themeing for a conference presentation. It should look something slightly different from your clinical study report. You know, you shouldn’t have to be able to go back to the plot, to the code and re crunch all the numbers and display them. You can just take that object, read it in, and change the formatting slightly.
The numbers didn’t change, you didn’t do anything to that, and then you go, or if for a table, for example, for your PowerPoint presentation, you only actually. Want to display those three variables of interest in your demographic table rather than the 10 you maybe had originally. It’s so easy to just subset those and then display it not in a P D F, but in a PowerPoint slide instead.
[00:21:21] Alexander: Thanks so much for the second episode about the SAS to R code course. With lots of helpful tips for you to move from SAS to R. And if you wanna learn more about this course, just head over there to the effective statistician, check for the course page. And so you’ll find it. And if you’re really quick, you can still get, grab some of the live sessions actually that we have for this course. Thanks so much Thomas and see you in the next episode.
[00:21:56] Thomas: Thanks for having me. It’s been a pleasure.
Never miss an episode of The Effective Statistician
Join hundreds of your peers and subscribe to get our latest updates by email!