Do you want to learn where the health sector might be heading in terms of data science?

Do you understand the opportunities and barriers in terms of the application of data science?

Are you prepared to learn the individual skills needed for these changes?

Then this interview with Ursula from Cytel provides you with the relevant answers. We are talking about:

  • What is big data, data science, machine learning and AI?
  • How was the survey run by Cytel?
  • What data sources will play an important role?
  • What do we hope to get out of applying data science given that more and more companies have dedicated resources in this area?
  • What are the barriers for achieving these goals? How can we overcome them?
  • Will RCTs become obsolete in the future?
  • What can statisticians do to prepare themselves for the future?

Ursula Garczarek

Ursula Garczarek, Ph.D. is an Associate Director of Strategic Consulting at Cytel. She has extensive experience in providing statistical support for clinical and non-clinical aspects of product development within both pharmaceutical and consumer companies. 

As a member of Cytel’s Strategic Consulting team, Ursula provides guidance to trial sponsors on optimizing their development strategy, and successfully implementing trial design innovations. She applies new and pragmatic methodologies to address the needs and requirements of the sponsor within the regulatory environment.

Prior to Cytel, Ursula was the Program Leader Data Science at Unilever R&D (NL), and Biostatistician at Roche Diagnostics GmBH (DE) developing multi-marker diagnostics based on proteomics and metabonomics approaches. She received her Ph.D. from the University of Dortmund (DE), in context of a project on Machine Learning and Statistics within a collaborative research center on complexity reduction in high-dimensional data spaces.



Current trends in data science for pharma – findings from a qualitative survey and what we should do about it!

You are listening to episode number 48 of the Effective Statistician Podcast. Current trends in data science for pharma, findings from a qualitative survey and what we as statisticians should do about it. Welcome to the Effective Statistician Podcast with Alexander Schacht and Benjamin Pieske. The weekly podcast for statisticians in the health sector designed to

Improve your leadership skills, widen your business acumen and enhance your efficiency. The PSI conference is coming up. It’s from the 2nd to the 5th June in London and it’ll be awesome.

I have been to a couple of the PSI conferences and not just because PSI is sponsoring this podcast I’m a really really big fan of it. Since I’ve been there first time I’ve been there every year. The early bird trade is until March 20th. So sign up fast to save some money.

and I will present there as well. So just, you know, sign up, come to London and let’s meet there. We can have a chat about what you want to like and it will be pretty interesting from the content perspective. It’s for me the best conference for statisticians in pharma or if you’re working for a CEO role I can think of. Register now, don’t waste your time, ask your supervisor if you need

but do it really really fast. By the way if you need reasons for your supervisor just check out my LinkedIn page and you’ll find an article with I think 10 reasons why your supervisor should approve going to the conference so check that out.

ask your supervisor to approve it and then just come to London. And I’m really, really sure you’ll have a good investment of your time and you will enjoy it as well. So by the way, there’s lots of fun activities around it as well. So it’s not just learning, it’s also a lot of fun. I can tell you.

So today in this interview, we talk about the survey. And you can learn where the health sector might be heading. And you can understand the opportunities and barriers that comes with that. So also, you will learn what kind of individual skills are needed for you to master these changes.

This podcast is created in association with PSI, a global member organization dedicated to leading and promoting best practice and industry initiatives.

Welcome to another episode of the Effective Statistician. And this time, again, we are speaking about data science. Like, about the same time last year, we talked about data science. And like last year, we have yet another guest from Cytel. Hi, Ursula. How are you doing? I’m fine. How are you doing, Alexander? And of course, as usual, I have my co-host with me. Hi, Benjamin. Hello, Alexander. Hello, Ursula.

Okay, so let’s get going. In terms of Ursula, as I already told you, you’re coming from Cytel, but maybe you can tell us a little bit more about how your career has brought you here to Cytel and in this area of data science. Well, I started studying statistics in Dortmund and

Then actually I did my PhD already on the interface between statistics and machine learning. So that was a collaborative research center. And the topic was complexity reduction in high dimensional data spaces. So basically from the moment of my PhD on, I was between the two groups, between the machine learners and the statisticians. And whenever…

from the machine learning side, at that time, it was not called data science, it was called knowledge discovery and databases and data mining. And it was in the early 90s, late 90s. At that time already, there was the claim that statistics will die and KDD2 will take over. And…

At the same time, whenever I was there’s a lot of things that I really appreciate very much of the machine learning approaches. And then talking to statisticians being very nosy about machine learners. I was then always feeling like I have to defend the machine learners on their side because they do some things really better. Anyway, so I started being between the two groups and it just continued.

in my whole career. So I started doing really pattern recognition for metabonomics with Roche diagnostics and multi-marker biomarker search. So that’s really what now is a very fancy area for data science. And then I went to UniLever and there actually I was even going under the name of data science, though I wasn’t doing much more statistics.

So a lot of experimental design, clinical studies as well, but really hardcore technical experimental design, which I think is a core expertise from statisticians, which what they can bring to the table. But I was also doing all sorts of consumer science, sensory science, things where really it’s, the term data science to me makes a lot of sense for all the things that we were doing at Unilever.

And well, since a year about I’m with Cytel doing really pharma statistics and planning of clinical studies. But of course, always having my second head on all the time when I see data, which is a data science head. Okay, very, very good. Okay, well, and also, you mentioned already a few words, I would just like to, you know, get a little bit more details on

what you mean because maybe I mean we are mainly focusing on an audience from statistics. So it might be interesting to understand what is your definition or how would you describe that big data science, data science we talked about a little bit, but machine learning. What is the in comparison especially to statistics? So you said you’ve been between machine learning for example and statistics.

Why are you in between? Is there no overlap? Yes, there is. How would you describe this? I’m quite opinionated in the area of where do they differ, where do they overlap. Machine learning is a subdivision of artificial intelligence from its historical origin.

a branch of artificial intelligence because they said, well, artificial intelligence in the 50s, 60s, we are not yet there. We don’t have anything that even comes anything but close. But maybe we start by letting machines learn like a human would learn.

There’s already one very nice differentiation between statisticians and machine learners in that machine learners approach the human way of learning and thinking as something admirable, something that if you get the machine to copy that would be really great. Whereas the statisticians have their history in the scientific method.

saying, oh, the human reasoning is ever so false and faulty, we go down so many bad routes. So the machine learners come and say, humans are great and we just make machines even greater. And the statisticians are, humans are so bad and thinking they can’t do it. And this is a constant thing about the statisticians versus the machine learners.

You are always the bad guys with the bad mood, the skeptics, negative. And that’s the perception of the statisticians in huge things. And I think it’s even grounded in this. What is the thinking behind what you want to learn? And statisticians are trained and they start with the idea they want to learn about a natural process and they want to learn.

something that is happening in the real world. And so they think of their data comes as a sample, a random representative sample from the real world, and which they then want to generalize to. And the machine learners, and well, one of the basic assumption in statistics is we have an

IID sample from some unknown distribution or from some, well, we start always simple. We always have an IID sample in the beginning from a normal distribution and then we… A little bit more. Or you mark simple, you have an independent replication from a…

binary distribution. Oh, yes. We start with this idea, our data is, and many of the methods that we generate come originally from this simplified environment and then we make them more complex. And the machine learners have the same on their side, which is, but a very different starting point. And that is the closed world assumption. They start with thinking the data that we have is a complete description of the problem.

and it is without noise. They have a complete, correct description of the thing that they want to learn. That’s their prior assumption. So a very deterministic approach, so to say. Well, it’s the data. The data is the world. That is all you know. OK. And it’s what, yeah. So they start from this position, and then they make their machine learning things

better over time, adapting to the real situation in a better way, reflect the thing in a better way. But when you have these two different starting points, one of the main differences really is that a statistician is trained to think of a representative sample of something in the world, whereas for machine learners, the data is the data is the data.

when you see where machine learning approaches are extremely good and where statistical approaches are extremely good, then this is really when these very beginning assumptions, whenever the one or the other is better or more correct, you will use methods from the one or the other area. And if both are very good, then you have a big overlap.

Yes, or when both are actually not really good. I’m sorry, but in many cases, I mean, our assumption of a representative IID sample, I’m sorry, much data has nothing to do with that assumption. We still use our statistical approaches. And the data is never the closed world and often absolutely noisy. And this type of data…

this noisy dirty data is equally treated well with both approaches. Can you give an example of such a case where basically both assumptions are pretty bad?

Well, I would say the electronic health records, they are an excellent example of neither being the closed world nor being any represented IID something. Yeah, yeah, yeah. They’re OK. Yeah. They fit really. And they have a lot of artifacts. They’re complicated. Oh, it’s nothing random. But they are.

everything else but the closed world for what you want really to describe because there’s a lot which is not in that data Which would be needed to understand the closed world Okay Makes sense Yeah for such a we usually would say we just do the wrong method Find the method

Yeah, we’ll come back to non-parametric at a different point in time. Okay, just a little bit of a joke from two statisticians from Göttingen. Yeah, I think non-parametric doesn’t…

cope with confounding and bias and everything. So non-parametric helps you with a lot of deficiency in the distributional assumptions, but not with these, well, bias and confounding issues. Well, I did my PhD in exactly these kinds of things. So you just need to have the right message there as well. So there’s some, you can have non-parametric analysis of covariance as well and these kinds of things.

assess some nice developments beyond the, let’s say, basic Wilcox-Mandvindt test. Which is a fantastic method. Yep. It also assumes IID, but it doesn’t assume a Gaussian distribution.

Okay, so thanks for the very, very nice intro. Our main topic for today is a survey that Saitel actually ran in terms of trends in data science. And let’s speak a little bit about this survey. So how was this survey actually run, kind of from a methodological point of view? We talked about bias now a little bit. So…

Can you talk a little bit about how you did the survey? Yeah, we did the survey on two conferences where Cytel also has what is called a booth. So a place where we make marketing for our company and we where we tell people what Cytel is doing and what we can offer to them.

So it’s a place where people can go and talk about what SciTel is doing. And so it was two conferences. It was one conference with mainly statisticians. That was the PSI last year in Amsterdam. And the other was the Fuse, which has more the programmers in the US. So by that, you already see you have two different.

cohorts, you have two different cultural backgrounds, and then it needs to be those people that come by a site help booth and do not escape when somebody asks them to answer questions in a survey. I think we even gave them a little elephant when they did it, so we have a selection bias by those people that are attracted to a little soft elephant.

It’s not a representative sample from anything, but it is people that were willing to share their thoughts on data science from programmers and statisticians predominantly. And we had 144 people that responded, which is, I think, quite a nice number in a specialized area and getting people to talk about a topic in their area which is relevant to them.

What did the survey aim for then? I mean, to get an opinion of what exactly? So what were the questions? So the starting question is really, do you have a definition of data science? If so, give it to us. And this is one of the major problems. And actually, I think only a quarter or one third had anything like a definition of data science.

Most people don’t have a definition of what it really is. And everyone uses the term slightly different, which makes the discussions around data science sometimes very funny because people completely talk about different things. So talking about definitions, so I’m using as a statistician the definition that Donoho in 2017 published.

a very nice article called 50 years of data science. And it relates data science simply to the exploratory statistics as introduced by Tukey. Okay. So.

people don’t really have a definition of data science, but then the following questions were really something like on in their organizations. Do you have in the department dedicated to data science? And really, most of them worked for employers.

that do have already a department on data science, whatever they’re called. It could also be predictive analytics or any of these words. So within pharma and within academics where many people came from, it was really something like 70 to 80%. And within the CIO world, it was more than 50% where there is a department for data science.

Which I found really, really surprising. I wouldn’t have thought that. I thought that, you know, that is kind of, you know, certain bigger companies maybe have that, but it’s that kind of already within all the different things is quite astonishing to me. What I also think is quite notable is the difference in terms of pharma to CROs. Yes.

Do you have any kind of hypothesis of why that is? Any personal hypothesis? Yes, I think my personal or my hypothesis really is that the pharma companies and academic companies have more access to more diverse data sources, whereas the clinical CIO has clinical data. That’s it. Okay. Yeah.

Speaking about data sources, what data sources are actually playing an important role?

Well, within this whole pharma drug development area, I think the hot data sources are from registries and electronic health records. The others are, of course, all the historical databases from historical clinical trials, the preclinical data.

the cohort studies, the epidemiological studies. So that also means kind of observational studies and things like that. So it’s a big observational study. Yeah, it’s the real world data that can lead to real world evidence. But I would always include the preclinical data as well as being a major source for information. And yeah, I think those would be my big buckets.

You mentioned already that there are quite a surprisingly high number of departments with data scientists out there. What do we then actually hope to get out of applying data science in this area where we all come from? CRO, business, pharma, academy, media?

When I look into what people said in the survey, then one of the biggest, biggest, biggest hopes is better planning of clinical trials. And better planning of clinical trials, I mean for statisticians, if you ever had to come up with an expected sample size, an effect size, and a standard deviation which is relevant

targeting for and you don’t have any good data, then you’re very sad. So, I mean, for a statistician, I think it’s very clear that having historical databases is an extremely good thing to make trials, to plan trials better. But I hear more often is one of the big challenges in clinical trials is

planning the logistics of a clinical trial, having the right target population where you have enough sites that can recruit enough people. So it’s both about sample sizing as well as…

having, you know, the rights, site selection, estimating speed of recruitment, these kind of things. Yes, a lot. So at the conference where I’m just now, it is really all about patient recruitment and having registry data and information from electronic health records that help you to see where the patients are that you would want to enroll in your trial.

and which inclusion criteria and exclusion criteria you can use and you still have a good population. Do these patients even exist? Or are you so strengthened with your exclusion criteria that you carved out? Yes, I think we have all seen trials where we looked into the inclusion-exclusion criteria and thought, hmm, that’ll be difficult.

How would the data science help in finding out these points that you just mentioned? Well, the first of all is when you have data sources like registry data and electronic health record data, don’t underestimate how much work you have to do to get all those data sources together. So that’s already data science.

how to bring data of very different types and formats and at various places, changing all the time, to have a repeated process of bringing that data to your table. And I think this is one of the major differences between statisticians and machine learners to me, where machine learners are just so much better.

about statistical models and feature selection and these kinds of things. It’s really about having a deep understanding of the data itself, getting unstructured data into structured data, these kinds of things. Yes, it is data processing, it is data accessing, it is data transformation, it’s data representation.

And then the mindset of, I’m not doing this process once, but I set it so up that I can ask my databases every day on updates. OK, yeah. So that you can basically create things like dashboards or stuff like that. OK, OK. Really, really nice goals, but I’m pretty sure it’s not.

easy to achieve them. So what did the participants of the survey speak about barriers to achieve these goals? Those I have prepared for this talk because I didn’t remember all of them. One of the things that I was most surprised of is that people from the programming and the statistics

skill gaps. And I thought, well, bringing programmers and statisticians together, they shouldn’t feel that they cannot do it. Programmers can do data merges, they can bring data sources together and stuff like that. I was always thinking so unfortunately, we had no follow up question where people feel that they have their skill gaps.

I would very much want people to say where they see their own gaps, because I actually believe that statisticians and programmers together don’t have so huge skill gaps. Could it be that it’s maybe more the fear of doing something that is a little bit more unpredictable than working on clinical trial data? Yeah, and actually, no, they probably have no experience in this.

I think this sounds to me, it’s basically reflecting back what you said, Ulrich, of the definition of data science. So if they don’t even know really how they can define what they’re talking about, so you know, they can’t really say what do they have, what lack of knowledge do they have, because they… I would have problems to really say what I would need to learn or what I need to do in order to…

become or to understand the data scientists view, because I don’t have the experience. So it’s actually a very modest attitude. If I don’t understand what data science is, I shouldn’t believe that I have any skills that would lead to it. Where if you’re more broad, you could say well,

Data science is just statistics, so I’m a statistician so I can do it.

Well, I’m surely not a very, very good programmer. So I would need to have a really, really good programmer on my side and then maybe a good programmer and me as a statistician, we both together would be a good data scientist. I think we all need the data managers extremely deeply, I think. So a programmer who’s also an extremely good data manager.

And then we, yeah. So, but the skills gaps was one of the hurdles. Another one was to get the data in a good shape. There is money necessary. And so our company is spending enough money to get their data sources in shape. There is a lot of standardization that you still need to do.

the claim that unstructured data is just as easy to digest as structured data is just not true. It’s fake news. So, I think the data standardization and data cleaning is a huge effort that needs to be done and that is translating to investment. So that’s of course one of the hurdles that people also named quite a lot.

And then another one is really trusted solutions. There’s a lack of trust towards data science. And so that is trust by regulatory bodies, but also trust by scientists, by companies. So is it really delivering what it promises? So the trust issue, a big one. Yeah, I think like…

kind of any innovation that goes through this change cycle and innovation cycle where you have a hype and completely exaggerated expectations on new things. And well, data science had that as well. And there were some companies that made really, really very bold statements and then couldn’t deliver on it or claimed things. And just a couple of…

weeks or months or years later, you found out it was all fake, or it was, you know, over-fitting the data and nothing that you could prolong in the future. Well, you know, these kind of things happen in every industry. And I’m pretty sure if you go back in time, you’ll see something similar for statistics. And so, you know, just

because it’s new doesn’t mean that over time there will not be some kind of normalization on things. I think that is just a natural process. But there are sometimes pretty bold statements in the data science world, and one of which is, for example, that we basically

get rid of RCTs. What do you think about that?

Well, I’m a big fan of Steven Sen. Get him started on it. Now, randomized controlled experimentation. I would really just say randomized controlled experimentation. Whenever you want to find a causal relationship of the interventionistic type, which means if I do A, I will have result B.

you need randomized controlled experimentation. Point. OK. And that’s about as bold a statement as IID data from normal distribution. And then you weaken it a little bit. But when you want to provide evidence for a causal relationship and you don’t do any controlled

you will not go anywhere. That’s the statistical success that we had over the last 300 years with the scientific method. Whenever you don’t do randomization and whenever you don’t do controlled experimentation, I’m sorry, you will go down all routes of bias and confounding and you will never find out why things really happen and we will continue.

to bid against thunderstorms. So, but of course there will be, there is situations where we cannot do controlled experimentation at all. I was working for Unilever and we were having these plant sterile margarines and the long-term effect from…

plants throw margarines if you would really want to show them not on a surrogate endpoint but on a true clinical endpoint you would have to force people, randomized, into either eating margarine or butter for 10 or 20 years of their life. Yeah, sorry, but this experiment will never happen and so there’s many questions that we have on health.

which we cannot tackle by randomized clinical trials. And then we have to find other plausible ways. And the epidemiologists are doing this hard work since several decades. And they do it very well. It’s an art. And there is these, what I would say, these other approaches for

extrapolation, which we don’t do often enough, so that we really use the data from one trial in one population and do not do as many subjects anymore in the new study by really having a nice way for extrapolating the expectation to the new population. I am a Bayesian by philosophy.

And so if you go from the adult population to patriotic populations, I would very much want to make more use of extrapolation from the one group to the other. And there’s various methods out that people can do, and that will reduce the number of people in a clinical trial, but also possibly the number of clinical trials that we can do. So, well, I mentioned before that we

primarily talk to statisticians at the moment. How would you say how can they prepare them for the future given that all the examples of why the end of the statistician is quite near I believe.

Well, what I would like statisticians to do more is A, get their hands dirty in dirty data. And from our basic training, I think we all had dirty data at some points in time, and we had fun playing around with them. So I don’t, why wouldn’t we continue doing that when we go in pharma?

So playing around much more with existing data for planning and getting our hands dirty in data. And I think this is very special to pharma statisticians. There’s pharma statisticians who don’t even do anything with data anymore. They just do planning and then they look into tables and plots that programmers did. Yep, that exists. Yep, absolutely.

And I don’t think that it makes statisticians very happy, but getting your hands dirty in data again, I think that’s one of the things. The other thing really is to not be afraid on thinking in terms of the data sources and how to bring them together, and then get somebody who is more knowledgeable than yourself. Or if you’re still studying, really, really look into data processing.

We look into scraping and all types of processes that exist to get data from various sources, to extract tables from the internet and stuff like that.

not really good at it, but at least I have looked at it. So I know what other people can do if I just tell them what I want. I think that is the key thing. You don’t need to do everything yourself. If you know what’s possible, you can delegate these kinds of tasks. I wouldn’t say that every statistician now needs to be fluent in R, in Python, and a couple of other languages. I’m not sure

best use of your time. But if you see the data and you know that certain things are extractable from the data and you understand the data sources, then I think that’s really, really valuable. Yeah. And the last thing for me really is not really the last thing. It is that I believe we should be

like machine learners when it comes to setting up processes. So like the process of getting the data in, really, and updating data and running everything again, we are almost there together with the programmers and the data managers on the inside, but not when it comes to the reporting. I would.

really, really much want that statisticians don’t stop with a PDF report and tables and plots on paper. We should give our clients tools to play with their data. If you want to learn more about that, just scroll back in your podcast player and go, for example, to the episode with sex, Krivinek about visualization, another episode.

about building your own company with Shafi Shaudhuri. There’s a couple of very, very nice examples there on doing things differently and thinking beyond PDF as a key deliverable. And yes, there’s this really, really nice episode that we recorded, Benjamin, you and myself, about tables are not the key deliverables. Oh, yeah. We had discussed previously that

we are very much aligned on that idea, but I didn’t know that you even have… Yeah, we discussed this before, so it’s quite fitting nicely, your introduction. It’s kind of a red line through all the different episodes. It’s a reoccurring theme, so to say. Okay, so another thing that I think statisticians can actually do to prepare themselves for the future is to invest in their development.

and training on these kind of things. And there is the upcoming PSI conference that you, for example, can go to. There is a special interest group on data science that PSI is setting up at the moment. And there is webinars coming up and one-day events that are happening on a rather

regular basis on these kind of things. So there’s a lot of different activities where you can get in touch with these kind of different things.

And close your gaps. OK, Ursula, what was the kind of last thought you had on this topic?

Well, it’s not a last thought. It is really the topic that I’m currently most into with respect to data science, and that is data science ethics. OK. That is coming towards ethical guidelines for data scientists. And well, I have just written together with a colleague of mine from University of Dortmund.

and 18 pages of article, which is going to be published hopefully in February. And there’s a lot of people, there’s a lot of initiatives around the globe on data science and ethics or on artificial intelligence, algorithms and ethics. But I, and it’s not only about it.

mostly outside the farmer world. It’s more in the social media world, in the internet world, where you see that algorithms are being used in a way such that they are harmful to the society. They become, data science is now really…

delivering to the promises that they once had in the 90s in many respects. And now they shape our communities and how humans interact with each other. And so I feel very deeply about that because of this high influence, we need to get data

an occupation without any scientific training, possibly even. Just anyone can do data science. And then not thinking about what a specific algorithm that you’re doing is going to do with your fellow human beings, be it a client, be it a company, or be it the society at large.

So this is really something where I feel data science has to become a profession and the profession has a service ideal. And we have to write down what we think a data scientist should do and how they should interact with society and companies and clients and their colleagues. Thanks so much. I think that will be one of the topics that we’ll have.

in the future again, ethics in as statisticians, as data scientists, there’s lots of both that in terms of, for example, also, you know, bias in the term of, you know, if you, you know, predict that a certain, let’s say, people from a certain area or people, you know, with a certain gender or, you know, some other things, you know.

are worth off in some kind of thing. Yeah. Is that, you know, is that bias in terms of, you know, not in our statistical term of bias, but more in the kind of political term of bias? I think there’s lots of very, very interesting topics in that and actually quite controversial and not easy topics in there. So we’ll surely talk about that in the future as well.

Thanks so much for this very, very nice interview at that really late time in today and have a great, great time. Thank you both for staying up as well. Thank you.

This show was created in association with PSI. Thanks for listening. Please visit thee to find the show notes with all the material and learn more about our podcast to boost your career as a statistician in the health sector.

Join The Effective Statistician LinkedIn group

I want to help the community of statisticians, data scientists, programmers and other quantitative scientists to be more influential, innovative, and effective. I believe that as a community we can help our research, our regulatory and payer systems, and ultimately physicians and patients take better decisions based on better evidence.

I work to achieve a future in which everyone can access the right evidence in the right format at the right time to make sound decisions.

When my kids are sick, I want to have good evidence to discuss with the physician about the different therapy choices.

When my mother is sick, I want her to understand the evidence and being able to understand it.

When I get sick, I want to find evidence that I can trust and that helps me to have meaningful discussions with my healthcare professionals.

I want to live in a world, where the media reports correctly about medical evidence and in which society distinguishes between fake evidence and real evidence.

Let’s work together to achieve this.