Is data science something for you? - The Effective Statistician

Dr. Alexander Schacht

Data science, big data, business analytics … these are all buzzwords in the industry at the moment and the hopes are high on what these areas will provide to our industry.

Cytel organized a long session about data science for the PSI Conference 2018 in Amsterdam (learn more and register here).

In this episode, we’ll speak about:

Why is data science such a hot topic at the moment?
How can we separate the different buzzwords?
Is this only something for Google, Amazon, and such companies?
What distinguishes a statistician from a data scientist?

What are the biggest areas of impact for data science?
What case studies are there, where data science had a profound impact?
Which problems you might face?
Which role do visualization approaches play in data science?
What’s the difference between “machine learning” and “deep learning”?
What would a statistician need to know, to compete in this field?
Where are good resources for this (see also the Cytel blog here)?

Finally, we will dive into the sessions, that Cytel chairs as the main sponsor of the PSI conference. You’ll learn if this session is a good fit for you.

Featured courses

Click on the button to see our Teachble Inc. cources.

Load content

Yannis Jemiai has a pivotal role within Cytel, leading the company’s consulting and software business units, as well as the global marketing group. With Cytel Consulting he heads up an elite team of biostatisticians, skilled in applying the latest trial techniques and methods, to help our customers accelerate clinical development and mitigate portfolio risks.

Yannis also oversees the development of Cytel’s software product lines, including trial design packages East® and Compass®, and exact statistics applications StatXact® and LogXact®. Yannis guides global marketing efforts to raise awareness of and uncover new opportunities for the company’s growing range of clinical research services and specialized software.

His research has been published in numerous statistical journals. Dr. Jemiai earned his Ph.D. from Harvard University, an M.P.H. from Columbia University, and a B.A. in Molecular and Cellular Biology also from Harvard

Rajat Mukherjee

Rajat Mukherjee has 15 years of professional experience as an industry and academic statistician and brings a range of expert knowledge to Cytel’s customers. This includes work in pattern recognition problems for devices and biomarker discovery, Bayesian clinical trials, adaptive designs, and the design and analysis of complex epidemiological studies. His experience and expertise also include statistical computing, survival analysis, longitudinal analysis, nonparametric and semiparametric inference, as well as statistical classification and high-dimensional data. Rajat has a strong background and interest in the development and implementation of statistical methodology with application to real-life medical problems.

Transcript

Is data science something for you? Interview with Cytel statisticians Yannis Jemiai and Rajat Mukherjee

00:09
Welcome to the Effective Statistician with Alexander Schacht and Benjamin Piske. The weekly podcast for statisticians in the health sector designed to improve your leadership skills, widen your business acumen and enhance your efficiency. In today’s episode, number 5, is data science something for you? Interview with two side-all statisticians, Janis Jemreai and Rahat Muggeri.

00:37
We will talk about data science and give you a little bit of insight on the upcoming PSI conference, what you can expect there from the Cytel session about data science. This podcast is sponsored by PSI, a global membership organization dedicated to leading and promoting best practice and industry initiatives for statisticians. Learn more about upcoming events at psiweb.org.

01:17
Welcome to this interview for the Effective Statistician with Cytel. We have Janis and Harsha today here from us. Hello together. Hello. Good morning. I also have, of course, again, my co-host Benjamin Piske here. Excellent. Very good. So today’s topic is to speak about data science. It’s probably a…

01:45
the hot topic of this year, especially also for PSI. There’s a couple of things going on. There was already a webinar about data science and there’s a big data science session at the PSI conference. I just recently read an article about data scientists is for the third year in a row, the most sexiest job in the US. So,

02:14
obviously a really hot topic. So let’s dive into this. So, so, Janis Khashat, maybe you can start a little bit with an introduction of yourself. Where are you coming from? How did you get your career to this space? And what are kind of your special interests in the fields of statistics? Sure. So I’ll begin. So I’m Janis Jemai.

02:44
I’ve been working at Citel for 13 years and head up our consulting and software groups at Citel. Citel, as you might know, started out as a software company in the area of statistics and we’ve continued all this time to get very interested in hot topics and statistics and how we can provide tools and services to the statistical community.

03:14
around those. In the last couple of years, I’ve gotten very interested in data science as it has become more and more relevant in the world and more and more relevant to our particular industry, life sciences and drug development. I got my statistical degree from Harvard. I was interested in causal inference, which

03:42
makes it particularly interesting because that’s a very different direction than data science. And it’s interesting to see how the two can be reconciled to assess causality as well as correlation in these types of problems. Hi, everybody. My name is Rashad Mukherjee. I’m a trained statistician. I got my doctoral degree in statistics from the University of

04:12
and mostly focusing and doing a lot of mathematical statistics, working with semi-parametric models, but with applications in survival analysis. That was my interest in biostatistics. And then I started working and teaching in public health. And that got me exposed to the sort of statistical or data related problems people look at.

04:41
in the healthcare industry in general. And then I started working as a consulting statistician in Cytel four years back, mostly dealing with innovative designs for clinical trials. But while doing this, it’s just basically a matter of chance that some interesting

05:10
you know, not so traditional statistical problems came to me. And they were mostly related to, you know, looking at, you know, factors such as environmental factors and genomic factors that were related to nutrition and diseases. So that’s how I first got a, you know,

05:38
my exposure to something that is not traditionally handled by statisticians. Then I got into some interesting projects dealing with biomedical images and biomedical signals to be used for diagnostic purposes. So that’s how I got into basically pure data science. And then I realized that…

06:07
you know, just statisticians alone cannot deal with, you know, the massive amount of computing and data parsing that’s required to solve these issues and get useful solutions. So then, you know, Saitel, we, like Yanis mentioned, we recently started, you know, doing pure data science, and now we have a team.

06:36
data science team at Cytel and currently I’m leading that data science team at Cytel. Okay, so you mentioned that the, or both of you mentioned that there was kind of a starting point, the statistics and you went into the area of data science. So what exactly is then distinguishing the statistician from a data scientist? So what is the extra?

07:03
I don’t know, experience or methodology or knowledge that you need to be a data scientist rather than a statistician. Well, if you Google this question, you will get lots of different viewpoints. And it’s quite a debate actually. What is statistics versus what is data science? My simplistic view is data science as practiced today.

07:31
is actually a component of statistics, which is basically engaged in predictions as the end goal, as opposed to statistics, which is also engaged in not just prediction, but also designing of experiments or clinical trials, estimation of parameters of interest and doing hypothesis testing.

07:56
The main difference in terms of technical expertise that you need for doing data science additionally to statistics is computing. So data science relies heavily on statistics and informatics and computing. That’s how I see data science.

08:17
I recently talked to a friend that is much more a programmer than a statistician. And he said to me, me being really not a very good programmer and much more a statistician, saying, well, maybe in today’s world, we are all data scientists. I said, maybe that’s true, but maybe data scientists is actually something more kind of a combination of us both.

08:47
I recently read some quotes that said, data scientist is someone that knows more about programming and computing than a statistician and maybe less about statistics, but knows more about statistics than let’s say a pure programmer. Is that some kind of valid? I can relate to that, yes.

09:16
We’re living in an age where there’s an explosion of data and everybody wants to extract information from that data. And people are coming at this from various backgrounds and disciplines. You could come in as a mathematician, a computer scientist, a biologist, bioinformatics. Any of these disciplines are interested in somehow.

09:45
extracting meaning and information from data and then acting upon it. The data science, I would say, has been sort of this bit of a tug of war between all these disciplines trying to claim some ownership of this area. And what most of the other disciplines lack compared to statisticians is the, you know,

10:15
training of the framework and understanding of probability and uncertainty. And many disciplines come to data science, the discipline of extracting information from data and making predictions from it in a very sort of deterministic way. And a lot of what people call machine learning or artificial intelligence is sort of trying to repeat patterns.

10:45
and not necessarily accounting properly for such fundamentals that we learn in statistics about proper sampling, design of experiments, generalizability, causality. I think a lot of people who are not trained as statisticians miss that greater point.

11:12
And so I think there’s really a big role for statisticians in this area, although we would work with many of our colleagues and we should work with many of our colleagues from other disciplines. But I’d say data science is almost a cross-functional, cross-disciplinary area, but an area where statisticians should play a leading role. Yeah, I think, C, I completely agree that there’s lots of different…

11:39
people from a very, very diverse background going into this. I recently came across a profile that was a person that called himself a data scientist that obviously knew a lot about computing and actually had a very, for me, surprising background as a patent lawyer.

12:05
I think that just speaks to this diversity of people that move into this kind of very, very hot topic. So you mentioned the explosion of data overall that probably contributes to being such a hot topic. Do you think there’s further kind of contributions? Why is this such a hot topic?

12:34
I think there’s some level of hype, misunderstanding in the general public of what current methods can do to explain things. There’s a certain level of magic or magical perception that are associated with the words like artificial intelligence, for example, or big data.

13:03
people are seeing in their day-to-day life, some of the data science work come to life. So whenever people use Siri or Alexa, a lot of these methods are playing in the background and that’s maybe what’s capturing the general public

13:32
imagination and excitement. And I think also a bit of concern. You probably see some people worried about what artificial intelligence, automation, machine learning will do for their jobs and the future of our society. So I think there’s a mix of trepidation, excitement about what these things can bring.

14:02
Very much the same way that this has happened before with gene editing, cloning, and many other new technologies that capture people’s imagination without real understanding of what can be done and what cannot be done.

14:24
And it’s upon us as statisticians to explain to the general public what the limitations of these techniques are. So what is the difference here in terms of data science versus big data analytics? Is this the same or is that because I think very often these terms I think are used interchangeably.

14:53
Well, I think data science, one of the applications of data science is to basically look at our application on big data. So basically, you have these massive data sets. It could be

15:22
social network data set or basically all this can be combined. So basically you’re combining structured and unstructured data sets and looking for evidence from wherever it’s possible. So basically the big challenge for the data scientists or for data science techniques to be used in a proper way is to filter.

15:50
the massive amount of noise that you’re getting from such a big volume of data and to extract the relevant information. So, these are probably not competing areas, but the problems of big data can be solved using data science techniques. That’s my way of looking at it.

16:18
But it’s not necessarily you need to have a big data set to actually apply and apply data science techniques. Yeah. No. So, you know, you know, we look at, we look at something called biomarker discovery. And, you know, the whole discovery process starts with really small data sets, actually. Well, you you’re looking at, you know,

16:45
problems typically where you have small n but big P. So the number of parameters or factors that you’re looking at is much bigger than the number of subjects or experimental units. You mentioned that there are biomarkers and so there’s an application for the data scientist in the, let’s say, pharmaceutical medical area.

17:15
job ads, for example, you see that there are big companies like Google, Amazon, and others are regularly looking for data scientists. But who are the companies that are interested in data scientists or in data science as you describe and as you offer with Cytel? I think everybody at this point, every business is sort of examining.

17:43
what it does and try to understand how can it best leverage data. And is interested in data scientists to do so, as I said, because there’s a sort of popular imagination around what the data scientists can do. Many people are just getting excited and thinking, okay, well, I’ve got some data. If I just get a data scientist, we’ll do some amazing things. But not…

18:11
You know, data science addresses very specific problems since as we’ve mentioned, a lot of it is about prediction. There are particular instances where it lends itself better to the problems and questions you’re trying to ask. And I think what’s important in our domain in the life sciences and drug development is that we…

18:40
there is a lot more design of experiment. There is more thought put into what are the questions we want to answer, and therefore what are the data that we want to collect, and then how we get to analyze that data to get to answer the question properly. And that ties, you know, that’s happening more than ever with the whole discussion.

19:09
on estimates where people are coming back and really trying to get back to the basics of why are we doing this? What is the real question? What are we trying to estimate? And that I think makes it a little bit different in terms of the data that we handle tends to be more structured. But particularly in clinical development and

19:38
You could argue that once you’re on the market, you may collect commercial data and safety data and get more unstructured data, but at least in clinical development there, and in general, there’s a much more thoughtful and considered approach to collecting data. And collecting data can be very expensive, so you don’t want to collect it faultlessly.

20:08
That, I think, is where the type of data scientists that we use in our i

ndustry would be somewhat different from all the ads you see with Google and Amazon, et cetera. And it also means that statisticians are particularly important contributions to make in this area since…

20:38
we are trained in designing experiments and making efficient use of data. Can you give some kind of case study where data science had a profound impact? Yeah, I think historically data science was actually more popular in medicine than business analytics.

21:07
you know, brain imaging and biomedical signals being used for diagnostics. You know, patient reported outcomes that were collected sort of in real time. They were sort of, you know, people using them to quantify the quality and to then improve the quality of healthcare. So these things were always around for, have been around for some time.

21:37
But I think the biggest impact of today, you know, of data science in the field of medicine is actually, you know, personalized and precision medicine. And also, you know, biomarker discovery, biomarkers that could be used as diagnostics or as predictive. So the biomarkers, predictive biomarkers can predict whether particular therapy would be

22:04
useful for a particular patient, given their genetic or environmental or other factors. If you have these kind of lots of data, do you also use visualization techniques within data science to handle that and to analyze it and to make meaningful conclusions out of it?

22:32
I think that’s actually the first step before you start using fancy statistical models and algorithms I think. So I guess the buzzword machine learning is a buzzword, but there’s also a term called statistical learning. And that comes from…

22:56
You know, doing simple statistics, but also that includes visualization of data, especially when it comes to, you know, things like biomedical signals and images, you have to rely on visualization because you’re talking about high dimensional and even genomics data, you’re talking about high dimensional data and you have to look at, you know, features.

23:22
of this data and that gives you insight into what kind of features you want to extract to be incorporated into the statistical models that will be used in these machine learning algorithms. So I think visualization is a very important step, not to mention once you have a certain prediction model or a diagnostic algorithm.

23:52
to showcase the results to the scientific community. I think visualization, again, is quite important. So can I understand that kind of statistical learning would kind of include descriptive statistics and visualization at the beginning and could include something like machine learning more at a later step? Yes, exactly. To answer that, I think,

24:22
depending on what you’re trying to predict, it may or may not be meaning interesting to understand how, what explains what is going on in the background. So there are a lot of the techniques that you hear, such as machine learning or deep learning, they are often considered as black box.

24:51
And you may not, meaning you don’t quite know how the prediction is being made. Now, if you’re looking for prediction of a recommended movie, say on Netflix, then you may not be very, may not need to know exactly how Netflix came up with a recommendation. You just care about what the recommendation is, something that looks interesting or not, and then you’ll go and decide and watch the movie.

25:20
However, when you’re developing a drug, it is important to understand the mechanism of action of the drug, the biology, how all this is happening, why certain patients are affected, not just that they’re more likely to be a respondent to this drug. That may not be good enough. The final users…

25:45
of these medicines once they’re approved on the market are going to want to understand which patients are responders, which ones are not. And so it’s not enough to use these methods in a vacuum that does need to be accompanied by some kind of understanding and meaning. And I think that’s where visualization, general inference, estimation methods and causal methods and all the sort of

26:12
regular tools, standard tools that statisticians are familiar with really come into play. We really need a marriage of both disciplines to make things work in our field. Just brought up another term, deep learning. So what does deep learning differentiate from machine learning?

26:36
Machine learning is just basically use of statistical algorithms to basically parse and learn from data. And this could be a small amount of data or a massive amount of data. And then the goal is to apply what the machine has learned through these algorithms to then apply and make predictions and other informed decisions.

27:04
Now deep learning is basically just a useful technique in machine learning.

27:13
In deep learning, in machine learning, basically the user has to go through the whole process of designing these algorithms or putting together these algorithms in a controlled fashion to make sure that the end goal, which is prediction, is useful and accurate. And for example, it means you might start with

27:43
huge amounts of data. Basically, you’re looking at a lot of different covariates and features. The question is which features are useful to, let’s say, make a discrimination between whether a particular therapy is working for a particular kind of patient. So that’s what goes on in machine

28:13
is an automatic tool to achieving all this. And it sort of mimics the human brain in terms of how the human brain learns. And it’s based on these layered structure of algorithms called artificial neural networks. So a lot of it is actually black box, and it happens sort of in an automatic way. OK. So this all sounds really interesting.

28:43
If I’m a statistician and I would like to learn more about, I’ve said, what would be good resources that you would recommend to dive deeper into these topics? Well, there’s a lot of universities, for example, Stanford University comes to my mind, they offer these online courses in data science and machine learning and statistical learning and

29:13
As a statistician, my entry to doing this kind of work was actually Hasty and Tipshiran’s book on elements of statistical learning. I think for a statistician, that’s a very good start. Yes, that’s a very good book. I read that as well. Maybe I’m already, by reading this book, a little bit more data scientist as well.

29:41
There’s actually the PSI conference coming up, which Cytel is the main sponsor, and where Cytel is also planning some activities in terms of data science. Can you tell us a little bit about that? What are you, what’s going to happen in Amsterdam regarding this? Absolutely. We’re very excited to help organize these sessions. We’ve been

30:11
brainstorming for a while about these. And there’s so many topics that are relevant and directions we could go in. So we’re organizing two sessions on the afternoon of June 10th, which will be the Tuesday, which will follow one of the keynote speakers by keynote talks by Steven Ruber, actually on data science and big data.

30:41
itself. And our two sessions will focus on particular areas of application of data science. I’ll let Rajat speak a little bit more to each one of them, the topics that they will focus on. So I think currently we are planning on having a data science session which will be sort of split

31:09
just talking about some case studies and the general problem in handling of high dimensional or big data. So that could be, and obviously we’ll keep in mind that the audience will be a statistician audience. So we’ll select some case studies where statistics actually…

31:39
is the driving force to solving the problems. And the next session, we thought, you know, we would look at a data science self-field, which is sort of emerging and exciting and has, you know, potential application in medical research, which is the field of pharmacovigilance. So basically, you know,

32:09
You get information from any source you can imagine in terms of looking at, you know, sort of adverse events or side effects of drugs that are already in the market. So such kind of things. So I think it will be a good mix of sort of looking at and talking about some technical issues,

32:39
you know, data science topics in a sort of general fashion. Okay, is there then any type of statistician or anything that the statistician should bring into the sections or into the talks to better understand it? Is there a general recommendation you can give regarding the audience that you are targeting your talks to?

33:07
I think we’re trying to invite some speakers who will give high level understanding of the opportunities for statisticians to use their unique backgrounds and skills and capabilities to further promote and improve the way data science is done in…

33:34
the development of medicines and medical devices. We’re trying to mix that with some also more technical, very applied types of topics on.

33:51
things like high-dimensional data, brain imaging, and other topics that would really give a flavor of how this is done really in a practical sense. So there’ll be a bit of both types of talks, and I think it should in that way be accessible to most everyone that would be interested in the topic. If people would like to learn more about Cytel and data science, what would be a good…

34:18
place to learn more about that. So we have, there’s a few ways to do that. One is through our website, where we have set up some explanations of what we’re doing. And there’s some case studies on our web page, as well as in our blog. And we encourage you to follow our blog, which has a lot of diverse

34:46
topics and is quite interesting. The other is, I’d say, to stop by the site, Albuquerque PSI and just chat and have conversations with us about what we do and how we can help. Rajat, any other thoughts? Well, you know, our data scientists have been quite active and not just

35:12
you know, doing applied work, but also writing about it. So you can also look at all the blogs. So if you go into the CIDL website, there will be links to these blogs on particular topics in data science. So that could be a very useful pre-read if one is planning to come join our session. Great.

35:38
Thanks a lot. We will put these links also into the show notes. Thanks a lot for being here today at the Effective Statistician. Thank you. Yeah, we appreciate it. Thanks, Janis. Thanks, Richard. It was great talking to you. Thanks for having us. Definitely. See you all at the PSI conference in Amsterdam. Absolutely. See you. Looking forward.

36:02
We thank PSI for sponsoring this show. Thanks for listening. Please visit thee to find the show notes and learn more about our podcast to boost your career as a statistician in the health sector. If you enjoyed the show, please tell your colleagues about it.

Join The Effective Statistician LinkedIn group

This group was set up to help each other to become more effective statisticians. We’ll run challenges in this group, e.g. around writing abstracts for conferences or other projects. I’ll also post into this group further content.

Join Group

I want to help the community of statisticians, data scientists, programmers and other quantitative scientists to be more influential, innovative, and effective. I believe that as a community we can help our research, our regulatory and payer systems, and ultimately physicians and patients take better decisions based on better evidence.

I work to achieve a future in which everyone can access the right evidence in the right format at the right time to make sound decisions.

When my kids are sick, I want to have good evidence to discuss with the physician about the different therapy choices.

When my mother is sick, I want her to understand the evidence and being able to understand it.

When I get sick, I want to find evidence that I can trust and that helps me to have meaningful discussions with my healthcare professionals.

I want to live in a world, where the media reports correctly about medical evidence and in which society distinguishes between fake evidence and real evidence.

Let’s work together to achieve this.