Ok – not everything, but in this episode you will get all the tips to make sure, you avoid the most common mistakes and that your code looks professional.
Shafi Chowdhury is an expert programmer who has developed a style guide, which his clients apply broadly. He regularly gives trainings on SAS programming and build his own company based on these skills.
He walks us through the different points and clarifies, why they are important from an efficiency but also from a quality perspective.
Click here to get the quick guide!
Shafi Chowdhury
I have over 20 years of experience as a statistical programmer in the Pharma industry. I worked for Pharma companies and CROs across Europe in many different therapeutic areas and in all phases of clinical trials before setting up my own consultancy firm. I believe knowledge should be shared and therefore I am a regular presenter at PhUSE conferences and regularly attend many other conferences including PSI conferences for Statisticians in the Pharmaceutical Industry. I also provide bespoke training and have a website to allow users to learn just the module they need at that time.
I specialise in reviewing processes and developing standards, tools, templates and macros to improve the expertise of individuals and efficiency of processes. As an independent consultant with all the proven experience behind me, I offer unbiased expert opinions which can be used by management to make their decisions. My aim is always to drive up Quality by Design.
Specialties:
- Writing SAS programs to check, modify, analyse and report any kind of data.
- Developing client specific template programs and generic macros.
- Developing bespoke training programs to produce well rounded programmers within weeks.
Featured courses
Click on the button to see our Teachble Inc. cources.
Transcript
Everything to know to write programs like a pro – Interview with Shafi Chowdhury – principles for good programming
00:00
you
00:09
Welcome to the Effective Statistician with Alexander Schacht and Benjamin Piske. The weekly podcast for statisticians in the health sector designed to improve your leadership skills, widen your business acumen and enhance your efficiency. In today’s episode, number 14, we’ll talk about everything to know to provide programs like a pro. In the interview with Shafi Chaudhuri about principles for good programming.
00:36
This podcast is sponsored by PSI, a global member organization dedicated to leading and promoting best practice and industry initiatives for study students. Learn more about upcoming events at psiweb.org.
01:00
Hello and welcome to another episode of the Effective Statistician today again with Shafi. Last time we talked about how to build your own company and we talked about lots about leadership and these kinds of things. If you haven’t listened to this episode, go back in your favorite podcast program and check out the other episode with Shafi.
01:27
It was really, really inspiring and motivational, even if you don’t want to build your own company. So there were lots of lots of good ideas in it. But today we’ll actually go over a quick guide to good programming practice. Hi, Shafi, how are you doing today? Hi, Alexander, I’m doing good, thank you. So good.
01:51
programming practice is something that we probably should use every day. Where is this quick guide actually coming from? It’s actually from what I’ve seen in lots of different companies that you have a big guide of a good programming practice that everyone usually has to read when they start and then it stores on one of these dusty shelves that no one ever looks at.
02:20
everyone’s forgotten by the time they start programming actually what was in there. So what I thought was that actually how do we resolve this and so we created a one sheet which you can actually put and with one client that’s what we did we put it on the we created a laminated sheet put it on everyone’s desk and it’s just a small list of things that if you
02:49
bad programming habits and so on. So that’s really the aim. So you can achieve really good programming practice by following these simple steps as it were. So that was really the inspiration. So we’ll break these kind of programming steps down into three parts. First kind of before you actually get into the details of the programs, what do you do to get started?
03:18
Then we’ll talk a little bit about the programming style, so the core part. And once you have done all your programming, what’s coming then at the end in terms of checking. So let’s dive into the getting started. When I was at programming at university, I think I was directly going into all the data sets and so on.
03:48
But when I actually started in the industry, and you were actually my teacher there, there were lots of other things and better things to do. So what should we all do when we get started with the programming? I think one of the key things is really to make sure you have an initiation file somewhere. So it’s like an autoexec.sass in SAS. Or if you have a.
04:16
other project folders, then you have somewhere an initiation file which defines all your lib names, all your options, it says your format. So I think these are the key things. If you just put all these in one file, then you don’t need to define them in every program. And so that in terms of like, for running the program, this is that this set is perfect.
04:46
where data is coming from. You don’t need to change it in lots of places, just one place. And the other thing is, when you are writing programs, then if you have standard ways of writing your header, program header, or writing your comments, that also helps. And these days in SAS, you have abbreviations, which are really superb.
05:14
So you can have an abbreviation called header. And then as soon as you write that, it brings up the standard header in your program editor. So you don’t need to worry about it. It’s the same thing for comments, sections. You can create some standard ways of writing so that it helps you to structure the program. So everything’s kind of set.
05:39
that you have in more than just one program at the start, all these kind of things would go into these auto exit files. Exactly. Yeah. Okay. Very good. I, for example, have even an abbreviation, which is called new prog. So when I type that, it sets up my complete program structure. So like with the header, with my name in it, it says, okay, at the top,
06:04
part, you know, get data, get your initial data, then there’s comments section for processing and then closing section and there’s the end of program. So it already feels gives me this whole structure. So just makes life a little bit easier. That’s really nice to get started. Very, very quick, not just from a blank blank screen. Yeah, yeah. So
06:31
Now let’s get into the next part, which is the programming style. What’s your first guide on that? The thing that I always tell people is that there are two key things that you must remember when you start programming. Any program, it doesn’t matter. One, that it will have to be changed. That is a given. And then second is that it might need to be changed.
07:00
either by someone else later on or you in a year time, in a year’s time. Can you remember this program in a year’s time? Why? What are you doing in it? What are the different parts doing? So if you keep these two things in mind, then that will help you to write comments and stuff better in your program. So the key thing is really so keep this at the beginning and then start ahead. Make sure.
07:31
If you do nothing else, you have a standard program header. And that way, every one of your programs will say, actually, who’s created it, when it was created, why it’s created. You can add things like, you know, what data it’s using. But that can also be solved, you know, if you follow the rest of the structure. But at least what you’re trying to produce should be in this program header. Yeah, and I think it’s a good guidance to imagine.
08:00
that your future self has completely forgotten about your present self. Exactly. And we’ve often come across this, where you’re looking at programs like, what was I thinking? Why did I do this? And as you say, so much time has gone past. You just need to make sure when you’re writing comments, you explain why you’re doing something, not just that you’re merging this and this, because this they can see. But why are you merging? That’s the important bit.
08:29
especially if it’s going to be used somewhere further down the program. Okay. Yeah. Yeah. Okay. So you need the reasoning to connect to different parts of the bigger programs. So, yeah, the bigger the program, the more comments you should have in your program. And again, structure the program. So you have a section at the top of your program where you read in all your external data. It doesn’t matter where it’s used.
08:58
just read it at the beginning, and ideally with a keep statement so that you know what’s actually coming in. And what this does is this lets anyone else who takes over your program to really follow it straight away. So they can say, OK, this is where all the external data is coming in. And then you have another section where you do all your processing, and then you have another section where you do all your outputs. So basically, you divide all your programs into three parts.
09:27
process and all. Exactly. Yeah. And when you’re doing your output, then again, use a keep statement or an SQL statement where you’ve got all your don’t just use stars, you name the variables and that way, it’s really clear for anyone taking over what is actually being produced at the end. So if it’s a data center, if it’s an output file, again, of course, that’s different. But still, make sure it’s clear what is being produced at the end.
09:56
So in terms of actually writing the code and the formatting of that, do you have any guidance on that, what kind of to make it easy to read? One of the things that I always tell people is avoid putting comments inside data steps. I know that if you have a bit of a long data step, then there’s always a temptation to explain what you’re just about to do.
10:26
But what that does is that distracts people from actually seeing what the overall program is doing, what the overall step is doing. So what I always say is put all your comments at the top of the data step. So above the data step, and then you go into the data and do things. And where there’s different parts, where it’s a long data step, then put like star A, star B, star C, and then put the comment in the again.
10:55
above the data step with A, merging this and this because of this reason, B, doing this and this, C, following this algorithm. So what this does, it allows people to actually read the SAS code much more clearly, and then they can easily look up to see if there’s some parts they’re not sure. They can always just look up and see, ah yeah, that’s what they’re doing here. But having too many comments in the programs just makes the programs difficult to read.
11:25
I mean, within the data steps. So basically, you have just the kind of tags in the data step and what these tags mean are actually all put together at the top. So you can see, OK, in line 20, there’s tag A and that does this. And in line 40, there’s this.
11:49
tag B and to say this and in line 250 is as tag C and it does that. But you see it’s all at the top and you don’t need to kind of scroll through your program to see what it all does. So this is like for big data steps. So I would actually just put the comments just above the data steps. So not like right at the top of the program, but just above the data steps. But where it’s like a big complex program.
12:15
or sometimes like big macros and so on. Then what I usually have is I have a comment section at the top where I explain what is happening in the different sections of the program. So that someone can follow. If they need to update something, they know which part of the program to go to. So again, that helps with longer analysis programs, for example. What more do you have in terms of programming style? There’s some key things like, again, these are basics.
12:45
Write one statement per line. That might sound strange, but some people really write a whole data step in one line. Indent your program. Again, the good program practice, one of the key things is really how easy is it to read. If you look at a program and it’s easy to read, you straight away have more faith in this program. You think, okay, at least this programmer knew what he was doing. It might not be correct and you might need to look into deeper, but…
13:14
you start off with a good understanding, good impression. If you look at a program and it’s really structured badly, there’s no indenting, there’s lots of statements in one line, you straight away start looking at this program with guarded view. It’s like, oh, there’s gonna be problems. And it’s just in our nature, that’s what we’ll do. So always write one statement per line and indent.
13:41
So you know, you indent so that things are lining up. If you have a do loop, where your do starts and where it ends, make sure everything inside is indented. And again, be consistent. So that is really important, because I’ve seen people indenting two characters in some places, three in others, and four in others. And that’s also not helpful. So if you indent two spaces, then always indent two spaces.
14:09
And that, again, makes a huge difference when it comes to reading programs. Of course, you also need to make sure each of your data sets, they have meaningful names so that when someone looks at that data set, they know what it means. They know something about the data that’s contained inside. And if you are much more experienced, then you can even go for which part
14:38
part of the program it’s in or what function is doing. But that’s much more later. But at a general level, just make sure every data set has a different name. Yeah, yeah. And not just TMP. Exactly. Or the number of times I’ve seen final, final, final. Or final one, final two, final three. If you already have a program and you’re updating it,
15:07
make sure you stay true to it, you’re consistent. If the data set which was final before is no longer final, change its name so that it has something meaningful. And that way, when someone else is looking at it, they can continue to follow this. And it doesn’t look like a new program where it’s been updated eight times by eight different people. Yeah, I think what I learned from you is the code.
15:36
needs to be beautiful somehow. It makes it far easier to actually change. Usually, these changes are kind of last minute changes and under lots of pressure. Then you don’t want to go in there at 2 a.m. in the morning.
16:03
We assist data set. Exactly. Yeah. So I mean, the easier it is to read, the better it is both for you, for any reviewer, and increasingly we’re having to send programs to the regulatory. So again, if they open a program, it looks nicely well set out, they’ll think, okay, this is a well thought out program. And then they will start looking at it with an open and a good impression, as opposed to open with a negative impression.
16:33
Yeah, that’s another good point. Yeah, completely agree. So now we have written our code. What do what comes next in terms of checking? You can’t really get away from basics. You must check the log for errors and warnings. These are like two of the most basic things. And the number of times I see programs where there are errors and warnings in the log and then you speak to the programmer, they explain, ah.
17:03
that’s okay, it doesn’t have any impact. You should always try to avoid getting the error or getting the warning. Especially warnings. If you can program around it, do that. So that you avoid these. But there are lots of other things where we need to really make sure we’re careful. So things like uninitialized values. It might be because you just haven’t set something to missing or something to zero.
17:29
But it’s really important to get rid of this, because it could also be you had a typo. And so you’re trying to use a variable which doesn’t exist in a formula. So again, if you see uninitialized, resolve it so that the program works without any of these things. Other things like when you’re merging data sets, you want to avoid things like repeats of by statement, because.
17:56
this only appears as a note, so it doesn’t even appear as an error or warning, but it can actually have really huge significance if you’re not careful. So it often happens you get this like during many-to-many merge. So you’re merging two data sets and say, you know, you have a patient one on two different rows, and patient one on two different rows in the other data set. When you merge them together, it doesn’t
18:26
know how to merge it, it might not merge things correctly. So you need to make sure that you don’t have these cases. And if you do have these cases and you want to merge it like this, then you should use SQL. So yeah. And I think it’s always kind of maybe it works with current data set, but maybe it doesn’t with the next one. We often develop.
18:54
programs were on like dirty data. So data is always changing until it’s end. So you really need to make sure you have your guards about you. And if you’re using SQL, for example, make sure that you don’t see Cartesian product in your log unless you’re expecting it. Again, it’s because there’s many too many merge. And it’s suddenly, instead of merging two data sets and you’re ending up with two records,
19:23
You could be merging two rate data sets, and you’re ending up with four records. So you really have to be careful when you check the log. So these are really critical parts. One of the things, again, it’s part of me which is very defensive. And think, actually, you should always program defensively. So you say, actually, if sex is male, then do this. Or female, then do that.
19:52
otherwise put a message to the log. So don’t just assume if it’s not male, then it will be female because we’ve always come across where there’s missing values or something else. So it’s really important to make sure you program defensively in these cases. Thanks a lot. That was a very nice quick overview of all the basics that we should all keep in mind when we do our programming.
20:21
And I think having these good habits helps a lot to make sure you get things done effectively, not just for yourself, but also for others that later need to work on your stuff. And I think if you need to pick up a program, you also want to have it really nicely structured so that it doesn’t take you a couple of days to actually understand what the program is actually doing.
20:50
before you can make this little tweak to update the program. Thank you.
20:58
Okay, thanks a lot Shafi for this short episode and talk to you soon. We thank PSI for sponsoring this show. Thanks for listening. Please visit thee to find the show notes and learn more about our podcast to boost your career as a statistician in the health sector. If you enjoyed the show, please tell your colleagues about it.
Join The Effective Statistician LinkedIn group
This group was set up to help each other to become more effective statisticians. We’ll run challenges in this group, e.g. around writing abstracts for conferences or other projects. I’ll also post into this group further content.
I want to help the community of statisticians, data scientists, programmers and other quantitative scientists to be more influential, innovative, and effective. I believe that as a community we can help our research, our regulatory and payer systems, and ultimately physicians and patients take better decisions based on better evidence.
I work to achieve a future in which everyone can access the right evidence in the right format at the right time to make sound decisions.
When my kids are sick, I want to have good evidence to discuss with the physician about the different therapy choices.
When my mother is sick, I want her to understand the evidence and being able to understand it.
When I get sick, I want to find evidence that I can trust and that helps me to have meaningful discussions with my healthcare professionals.
I want to live in a world, where the media reports correctly about medical evidence and in which society distinguishes between fake evidence and real evidence.
Let’s work together to achieve this.