This episode is one of the most downloaded of all time—and for good reason. As data science continues to disrupt and redefine the healthcare and pharmaceutical industries, statisticians everywhere are asking: Where do I fit in? In this insightful conversation, two leaders from Cytel—Yannis Jemiai, Head of Consulting and Software, and Rajat Mukherjee, Head of Data Science—share their personal journeys from traditional statistics into data science, how the field is evolving, and why statisticians are uniquely positioned to lead the future of analytics in life sciences. Whether you’re curious, skeptical, or already exploring data science, this episode will inspire and equip you with practical insights.
What You’ll Learn:
✔ How two leading statisticians transitioned into data science
✔ The key differences (and overlaps) between data science, statistics, big data, and machine learning
✔ Why data science is more than hype—and why statisticians are needed more than ever
✔ The role of visualization and statistical learning in interpreting high-dimensional biomedical data
✔ Real-world applications of data science in biomarker discovery, precision medicine, and pharmacovigilance
✔ What makes data science in pharma different from tech giants like Google or Amazon
✔ Tips for statisticians who want to get started in data science
Why You Should Listen:
If you’re a statistician wondering whether data science is your next career step—or simply curious about how the two fields intersect—this episode offers an honest, expert-led exploration. Yannis and Rajat pull back the curtain on what data science really involves, how it’s transforming pharma and healthcare, and what skills and mindset statisticians can bring to this evolving space.
Links:
🔗 Explore Cytel’s data science insights and case studies
🔗 The Effective Statistician Academy – I offer free and premium resources to help you become a more effective statistician.
🔗 Medical Data Leaders Community – Join my network of statisticians and data leaders to enhance your influencing skills.
🔗 My New Book: How to Be an Effective Statistician – Volume 1 – It’s packed with insights to help statisticians, data scientists, and quantitative professionals excel as leaders, collaborators, and change-makers in healthcare and medicine.
🔗 PSI (Statistical Community in Healthcare) – Access webinars, training, and networking opportunities.
If you’re working on evidence generation plans or preparing for joint clinical advice, this episode is packed with insights you don’t want to miss.
Join the Conversation:
Did you find this episode helpful? Share it with your colleagues and let me know your thoughts! Connect with me on LinkedIn and be part of the discussion.
Subscribe & Stay Updated:
Never miss an episode! Subscribe to The Effective Statistician on your favorite podcast platform and continue growing your influence as a statistician.
Never miss an episode!
Join thousends of your peers and subscribe to get our latest updates by email!
Get the





Learn on demand
Click on the button to see our Teachble Inc. cources.
Featured courses
Click on the button to see our Teachble Inc. cources.
Yannis Jemiai

Yannis Jemiai has a pivotal role within Cytel, leading the company’s consulting and software business units, as well as the global marketing group. With Cytel Consulting he heads up an elite team of biostatisticians, skilled in applying the latest trial techniques and methods, to help our customers accelerate clinical development and mitigate portfolio risks.
Yannis also oversees the development of Cytel’s software product lines, including trial design packages East® and Compass®, and exact statistics applications StatXact® and LogXact®. Yannis guides global marketing efforts to raise awareness of and uncover new opportunities for the company’s growing range of clinical research services and specialized software.
His research has been published in numerous statistical journals. Dr. Jemiai earned his Ph.D. from Harvard University, an M.P.H. from Columbia University, and a B.A. in Molecular and Cellular Biology also from Harvard
Rajat Mukherjee

Rajat Mukherjee has 15 years of professional experience as an industry and academic statistician and brings a range of expert knowledge to Cytel’s customers. This includes work in pattern recognition problems for devices and biomarker discovery, Bayesian clinical trials, adaptive designs, and the design and analysis of complex epidemiological studies. His experience and expertise also include statistical computing, survival analysis, longitudinal analysis, nonparametric and semiparametric inference, as well as statistical classification and high-dimensional data. Rajat has a strong background and interest in the development and implementation of statistical methodology with application to real-life medical problems.
Transcript
Alexander: [00:00:00] You are listening to the Effective Statistician podcast. The weekly podcast with Alexander Schacht and Banjamin Piske designed to help you reach your potential lead great science and serve patients while having a great work life balance.
In addition to our premium courses on the Effective Statistician Academy, we also have. Lots of free resources for you across all kind of different topics within that academy. Head over to the effective statistician.com and find the Academy and much more for you to become an effective statistician. I’m producing this podcast in association with PSIA community dedicated to leading and promoting use of statistics within the health industry.[00:01:00]
For the benefit of patients, join PSI today to further develop your statistical capabilities with access to the video on demand content library free registration to all PSI webinars and much, much more. Head over to the PSI website@psiweb.org to learn more about PSI activities and become a PSI member to pick.
Welcome to this interview for the effective statistician with Cytel. We have Janis and Raja today here from us. Hello, together. Hello. Good morning. I also have, of course, again, my cohost Pike here. Yeah. Very good. Today’s the topic is to speak about data science. It’s probably. The hot topic of this year, it’s especially for PSI.
There’s a couple of things going on. [00:02:00] There was a webinar, about data science and a big data science session at the PSI conference. I just recently read an article about data scientists for the third year in a row. The most. Sexiest job in the US obviously really a hot topic. Let’s dive into this.
Jemiai, maybe you can start with an introduction of yourself. Where are you from? How did you get your career to this space and what are your special interests in statistics?
Yannis: Sure. I’ll begin. I am, Yannis Jemiai. I’ve been working at Cytel for 13 years and head our consulting and software groups
Cytel started as a software company in statistics and we’ve continued to provide tools and services to the statistical community I’m very interested in data science as more and more relevant the world and more and more relevant to our particular industry, the life Sciences and drug development.
I thought my statistical [00:03:00] degree from Harvard. I was interested in causal inference, because, that’s a very different direction than data science and it interesting to see how the two can be reconciled to assess causality as well as correlation in these types of problems.
Rajat: Hi everybody. My name is, Rajat Mukherjee. I’m a trained data station. I got my, doctoral degree in statistics from diversity of Wisconsin Madison, mostly focusing and doing a lot of mathematical statistics, working with semi parametric models, but with applications and survival analysis. That was my interest in biostatistics.
Then I started working and teaching in public health, and that got me exposed to the sort of, statistical or data related problems people, look at in the, healthcare industry in general. Then I started working as a consulting statistician in Jemiai four years back, mostly dealing with innovative designs for [00:04:00] clinical trials.
But while doing this, some interesting, not traditional statistical problems came to me. They were mostly related to, factors such as environmental and genomic factors related to nutrition and diseases. that’s how I got my exposure to, something, not traditionally handled by statisticians.
Then I got into, interesting projects dealing with, biomedical images and, signals, for, diagnostic purposes. that’s how I got into pure data science. I realized that, just statisticians alone cannot, deal with.
The massive amount of, computing and data parsing required, to solve these, issues, and get useful solutions. Like Yares mentioned, we recently, started, doing pure data science and now have a team, data science team at Cytel. [00:05:00] And currently I’m leading that, data science team at Cytel.
Benjamin: you mentioned that both of you mentioned there was a starting point, statistics, and you went into data science. So what exactly is distinguishing the statistician from data science?
Rajat: If you google this question, you’ll get lots of different, viewpoints and it’s quite a debate
What is statistics versus data science? My view is, data science is actually a component of statistics, which is basically engaged in predictions, as the end goal as opposed to statistics, which is also engaged in, not just prediction, but also designing of experiments or clinical trials.
Estimation of parameters of interest and doing hypothesis testing. The main difference in terms of, technical expertise that you need for doing, data science additionally relies heavily on computing. [00:06:00] That’s how I see, data science.
Alexander: I recently talked to a friend that is, much more programmer, than a statistician. And he’s said to me, me being really not a very good programmer and much more a statistician saying, w. We are maybe in today’s world, we are all data scientists. and I said may, maybe that’s true, but maybe, a data scientist is a combination of us, I recently read some quotes that said, data scientist is someone that knows more about programming and computing than a statistician. and more about statistics than, a pure programmer. is that valid
Rajat: I can relate
Yannis: I would say, we’re living in a, in an age where there’s an explosion of data and everybody wants to extract information from that data. People are coming at this from various backgrounds and [00:07:00] disciplines. you could come in as a mathematician, a computer scientist, a biologist, and you have bioinformatics.
Any of these disciplines are interested in extracting meaning and information from data and acting upon it. data science, has been. a tug of war between these disciplines trying to claim ownership most other disciplines lack, compared to statisticians is the, training, of the framework and understanding of probability and uncertainty and.
Many disciplines come to data science extracting information from data and making predictions in a deterministic way. a lot of what people call machine learning, artificial intelligence is trying to repeat patterns not necessarily accounting properly
Fundamentals we learn in statistics about, proper sampling, design of experiments, [00:08:00] generalizability, causality. a lot of people who are not trained as statisticians miss that point. there’s a big role for statisticians. we should work with colleagues from other disciplines, but data science is cross disciplinary where statisticians should play a leading role.
Alexander: I completely agree that there’s lots of different people from a diverse background, going into this. I recently came across a profile that was a. Person that called himself a data scientist that, obviously knew a lot about kind of computing and actually had a very kind of, for me, surprising background as a patent lawyer.
The, I think, that just speaks to this diversity of, of peoples that, that move into this, into this hot topic. You mentioned the explosion of data overall that probably contributes [00:09:00] to being such a hot topic. do you think there’s further kind of contributions why this is such a hot topic?
Yannis: I think there’s, some level of hype, misunderstanding in the general public of what current methods. Can do to, explain things is a certain level of magic, or magical perception that are associated with the words like artificial intelligence, for example, or, big data, people are seeing in their to day life.
some of the. Of the data science work come to life. So whenever people use, Siri or Alexa, a lot of these methods are playing in the background and it’s, that’s maybe what’s capturing the, general [00:10:00] public, imagination and, excitement. And I think also. A bit of concern, probably see some people worried about what, artificial intelligence, automation, machine learning will do for their jobs and, the future of, societies.
So I think there’s a mix of, trepidation, excitement about what these things can bring very much the same way that this has happened before with, gene editing, cloning. and many other new technologies that capture people’s imagination, without real understanding it’s, upon us as statisticians to have explained to the general public what the limitations of these techniques
Alexander: what is the difference in terms of data science versus big data analytics? Is this the same or is it, [00:11:00] because I think very often these terms are used interchangeably.
Rajat: I think, data science, one of the applications of data science is to, basically, Look at, look at big data. So basically you have these massive data sets. It could be data sets, genome, human genome data or, in addition to that, social network data set, or, basically all this, Can be combined. So basically you’re combining structure that unstructured data sets, and looking for evidence the big challenge for data science, is to filter the noise from such a big volume of data and extract relevant information.
these are probably not competing, areas, but, big data, the problems of big data can be solved using, Data science techniques. That’s my, way of looking at it. But it’s
Alexander: you need a big data set to apply data science techniques.
Rajat: [00:12:00] we look at biomarker discovery. the discovery process starts with small data sets problems typically have small end, but Big P. the number of, parameters or factors is much bigger than the number of, subjects, or experimental units.
Benjamin: You mentioned the biomarker so there’s an application for, data scientists in the, pharmaceutical medical area. when you go through, job ads, big companies like Google, Amazon, regularly looking for data scientists.
who are the companies interested in data scientists or data science? as you describe and offer with Cytel.
Yannis: I think every business is examining, what it does and trying to understand how can it best leverage data and is interested in, data scientists, to do so many are excited and thinking, okay, I’ve got some data. If I just get a data scientist, we’ll do some amazing things, but not, data science [00:13:00] addresses very specific problems. a lot of it is about prediction.
there are, particular instances where it lends itself better. in life sciences and drug development there is more design of experiment. there is more thought into what questions we wanna answer and therefore.
What are the data that we want to collect and then how we get to analyze that data to get to, answer the question properly. And that ties, that’s happening more than ever with, the whole discussion on estimates where people are coming back and really trying to get back to the basics of why are we doing this?
What is the real question? What are we trying to estimate? that Makes it different in terms of the data we handle. It tends to be more structured, particularly in clinical development. you could [00:14:00] argue once you are on the market, you may collect commercial and safety data and get more unstructured data, but at least.
in clinical developments there and in general, there’s a much more thoughtful approach to collecting data. collecting data can be very expensive, so you don’t wanna collect it. faultlessly. So that’s, I think is where, the type of data scientists we use in our industry.
Would be different from ads with Google and Amazon, et cetera. And it also means statisticians are particularly, important contributions to make in this area since, we are trained in designing experiments and making, efficient use of data.
Alexander: Can you give some kind of, case study where data science had a profound impact?
Rajat: historically, data science was more popular [00:15:00] in medicine than, business analytics. we can look back and, think about, brain imaging and, biomedical signals being used for diagnostics. Patient reported outcomes that were, collected in real time.
they were people using them to, quantify the quality and to then improve the quality of healthcare. So these things were always around, for, have been around for some time. But I think the biggest impact of today, of data science in the field of medicine is actually, personalized and, precision medicine.
biomarker discovery biomarkers that could be used as, diagnostics or predictive. predictive biomarkers can, predict, whether, particular, therapy would be, useful for particular patient, given, their genetic or environmental or other factors.
Alexander: If you have lots of data. do you use visualization [00:16:00] techniques within data science to handle that analyze it and make meaningful conclusions
Rajat: I think that’s the first step before using fancy, statistical models and algorithms.
I think. The buzz word, machine learning is a buzz word, but, there’s also, a term called statistical learning, and that comes from, doing simple statistics, but also that includes visualization of data, especially when it comes to, things like biomedical signals and images.
You have to, rely, on, Visualization because you’re talking about high dimensional and genomics data. you have to look at, features of this data. And, that gives you insight into what kind of features you wanna extract, to be incorporated into the statistical models that will be used in machine learning algorithms.
visualization is a very important step. once you have, a certain, a prediction model or [00:17:00] a diagnostic, algorithm, to, to showcase the results to, the, scientific community. I think, visualization, again, is quite important.
Alexander: so can I understand that? Statistical learning would include descriptive statistics and visualization at the beginning, and could include machine learning at a later step. yes, exactly.
Yannis: depending on what you’re trying to predict, it may be
Interesting to understand What explains what is going on in the background. So there are a lot of the, techniques that you hear, such as machine learning or deep learning. they’re often considered a black box and you may not, meaning you don’t quite know how the prediction is being made. Now, if you’re, looking for prediction of, a recommended movie, say on Netflix.
you may not need to know exactly how Netflix came up with a recommendation. you care [00:18:00] about the recommendation is, something that looks interesting or not. And then you’ll go decide and watch the movie. However, when you’re developing a drug, it is important to understand the mechanism of action the biology, how all this is happening.
why certain patients are affected. Not just more likely to be, a respondent to this drug the final users of these medicines, once approved on the market, are going to want to understand which patients are responders, which ones are not.
it’s not enough to use these methods in a vacuum that does need to be accompanied by some kind of understanding and meaning. And I think that’s where visualization. General inference, estimation causal methods and all the sort of regular tools, standard tools that statisticians are familiar with really come into play.
we need a marriage of both disciplines to, make things work in our field.
Alexander: You just brought up [00:19:00] another term, deep learning. So what does deep learning differentiates from machine learning?
Rajat: Machine learning is just basically, use of statistical algorithms, to parse and learn from data.
And this could be, small amount of data or massive amount of data. And then the goal is to. Apply what the machine has learned to make, predictions and inform decisions. deep learning is a technique in machine learning, in deep learning, in machine learning, basically, The user has to go through the whole process of designing these algorithms or putting together these algorithms in a controlled fashion to make sure, that the end goal, which is prediction, is useful and accurate. And for example, it means you might, and you might start with, huge amounts of data.
you’re looking at. different, covariates and [00:20:00] features. the question is, which features are useful, to, make a discrimination between whether a therapy is working for a particular kind of, patient. That’s what’s, goes on in machine learning.
Deep learning is, is an automatic tool. To achieving all this. And it mimics the human brain in terms of, how the human brain learns and it’s based on these layered structure of algorithms called artificial neural networks. So a lot of it is actually, black box and it happens in a automatic way.
Alexander: Okay, this sounds really interesting I’m statistician, and I would like to learn more what would be good resources to, dive deeper into these topics?
Rajat: There’s a lot of, universities, for example, Stanford University comes to mind.
They offer online courses in data science machine learning and statistical learning. as a statistician, [00:21:00] my entry to doing these, this kind of work was actually, hasty and, tip Irani’s book on, in elements of statistical learning. for a statistician, that’s a good start.
Alexander: Yeah, that’s a very good book. I read that as well. maybe I’m already a bit more. Data scientist There’s the PSI conference coming up, which, Cytel is, the main sponsor and TEL is planning, activities in data science.
can you tell us about that? what’s going to happen in Amsterdam
Yannis: We’re excited, to help organize these sessions. We’ve been brainstorming for a while about these, and there are so many topics that are relevant and directions we could go in.
So we’re organizing two sessions on the afternoon of, June 5th, which will be the Tuesday, which will follow, one of the keynote speakers by keynote talks by Steven [00:22:00] Rubert actually on. Data science and big data itself. And our two sessions will focus on particular areas of application of data science.
I’ll let, Rajat, speak a little bit more to each one of them, the topics that, they will focus on.
Rajat: we are, planning a data science session, which will be split up into two sessions. One. Session, just talking about, some case studies and the general problem in, a handling of, high dimensional or, big data.
so that, that could be, and obviously, we’ll keep in mind that the audience will be a statistician audience. we’ll select some case studies where. Statistics actually, is the driving force to, solving, the next session, we thought, we would, look at a data science, soft field, which is, emerging and exciting and, has, potential, application, in medical research, which is, [00:23:00] pharmacovigilance. you get information from any source, in terms adverse events or side effects of, drugs already, in the market.
I think it’ll be a good mix of looking at and talking about some technical issues, but also, talking about, data science topics in a general fashion.
Benjamin: Is there any. Type of statistician or anything the statistician, should bring, into the talks to better understand it.
Is there a general, recommendation you can give, regarding the audience you are targeting
Yannis: Where we are trying to invite, some speakers who will give high level, understanding of. opportunities for statisticians to use their skills to improve data sciences in, the development of medicines and medical devices.
we’re trying to mix that with some also a more, technical, [00:24:00] applied. topics on, things like high dimensional data, brain imaging and other topics that would really give a flavor of how this is done, really in a practical sense. So there’ll be a bit of both types of talks, and I think it should, in that way.
Be accessible to most everyone interested in the topic.
Alexander: If people would like to learn more about Cytel and data science, what would be a good place to, learn more about that?
Yannis: We have, a few ways to do that. one is through our website, where you have, explanations of what we’re doing and some case studies.
on our webpage as well as in our blog, and, would be, encourage you to follow our blog, which, has diverse, topics the other, is to stop by Cytel PSI chat with us, about what we do and, how we can help
Rajat: our data scientists, have been active [00:25:00] in not just, doing applied work, but also, writing about it. you can look at, the side of website, there will be links to these blogs on particular topics, in data science.
So that could be a very useful, pre-read if, planning to join our session.
Alexander: Great. Thanks We will put these links into the show notes. thanks for being here today at the Effective Statistician
Yannis: Yeah, appreciate it. Thanks
Benjamin: Yannis. Thanks Jemiai. Great talking to you.
It’s for having us.
Alexander: See you at PSI conference in Amsterdam. We thank PSI for sponsoring this show. Thanks for listening. Please visit the effective statistician.com to find the show notes and learn more about our podcast, to boost your career
If you enjoyed the show, tell your colleagues
This show was created with PSI, thanks to Reine and her team at VVS. Thank you for listening. Reach your potential lead right science serve patients. be an [00:26:00] effective statistician.
Join The Effective Statistician LinkedIn group
This group was set up to help each other to become more effective statisticians. We’ll run challenges in this group, e.g. around writing abstracts for conferences or other projects. I’ll also post into this group further content.
I want to help the community of statisticians, data scientists, programmers and other quantitative scientists to be more influential, innovative, and effective. I believe that as a community we can help our research, our regulatory and payer systems, and ultimately physicians and patients take better decisions based on better evidence.
I work to achieve a future in which everyone can access the right evidence in the right format at the right time to make sound decisions.
When my kids are sick, I want to have good evidence to discuss with the physician about the different therapy choices.
When my mother is sick, I want her to understand the evidence and being able to understand it.
When I get sick, I want to find evidence that I can trust and that helps me to have meaningful discussions with my healthcare professionals.
I want to live in a world, where the media reports correctly about medical evidence and in which society distinguishes between fake evidence and real evidence.
Let’s work together to achieve this.




